New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aho-Corasick multiple string matching algorithm #8
Aho-Corasick multiple string matching algorithm #8
Conversation
Probably you can change then pylint will be happy 😄 |
Thank you, of course it could be easily fixed -- I must have had some kinda blackout :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, required changes are rather small.
|
||
while not q.empty(): | ||
current = q.get() | ||
for a, child in [(a, current.goto(a)) for a in alphabet]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace list with generator here and in all other places.
text/PiotrMikolajczyk/main.tex
Outdated
|
||
\section{Wprowadzenie} | ||
|
||
Niech $\K=\{k_1,...,k_n\}$ będzie zbiorem wzorców (niepustych słów nad alfabetem $\Sigma$). Otrzymawszy słowo $T\in\Sigma^m$ chcemy wyznaczyć wszystkie indeksy z $[m]$, na których w $T$ zaczyna się pewien wzorzec z $\K$. Naturalnym rozwiązaniem jest $n$-krotne użycie jednego z wydajnych (liniowych) algorytmów do szukania pojedynczego wzorca. Podejście takie oczywiście jest poprawne, niestety skutkuje złożonością czasową $\Theta(nm)$. Okazuje się, że na podstawie $\K$ jesteśmy w stanie skonstruować w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ odpowiednią strukturę, która będzie potrafiła znaleźć naraz wszystkie wystąpienia wzorców oglądając $T$ wyłącznie raz. Przed nami: automat Aho-Corasick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: use
for i in range(1, n + 1): | ||
state = state.nxt(text[i]) | ||
for keyword_len in state.output(): | ||
yield i - keyword_len + 1, i + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the output format: I don't have a very strong opinion on that, but two solutions make sense to me
- returning only the starting index - mostly for compatibility with string matching algorithms and when we do care about the fact that something occurred, but not which patters,
- returning pair (starting index, pattern matched) - may be implemented even suboptimally, but at least makes clear that we're interested which pattern occured.
Please share your opinion on that. Also I'd like to hear @rafalalgo on that, since you both work for different algorithms for the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I implemented the 2. proposition. After realizing, that it is possible to completely abandon storing keywords in the automaton, I think that the current solution is much more sparing and we do not lose any clarity. The only reasonable situation I can imagine when we would need particular pattern is when encountering it, but then we can just take a corresponding substring from the text.
Surely, I'd be glad to hear others' opinion and, if needed, change the realization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first option is the one I would rather not to choose -- if there is any occurrence we will return something anyway. In any other situation we will lose information, providing which does not cost much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I agree that you may completely abandon storing keywords in the automaton - however I was thinking in terms of clarity of explanation vs. optimality.
As for the explanations: I would think that it's possible that we would like for a list of patterns w_1, w_2, ..., w_k to build an automaton and answer queries with arrays W[1..k] storing how many times each string appeared without any further comparisons. Equivalently, we could yield pairs (i, pos), i.e. store i's at the respective nodes. But I understand your concerns, but considering the above I guess returning (w_i, pos) maybe with a short explanation that (i, pos) can be done very similar is more readable by laymen. At least it's just my impression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to disagree :) I think that in terms of this repo at least, consistency and clarity are priorities, so maybe let's wait on @rafalalgo 's point of view and I will adapt properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Commentz-Walter algorithm implementation, my proposition is that for a list of patterns w_1, w_2, ..., w_k to build an automaton and then to answer queries, we could yield pairs (pattern, pos), i.e. [ ('acb', 1), ('aba', 5), ('aba', 7), ('aba', 9), ('aba', 11) ]. It is a proposition from original Commentz-Walter paper :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved :)
text/PiotrMikolajczyk/main.tex
Outdated
|
||
Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane). | ||
|
||
Pozostaje zastanowić się, ile czasu zajmie przeczytanie tekstu $T$ ($|T|=m$) i wypisanie wszystkich wystąpień wzorców. Moc $\Sigma$ traktujemy jak stałą. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bonus question: it seems that the whole construction and search depend linearly on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, as the size of the automaton is bounded by the product of the sum of keywords' length and alphabet cardinality. Every phase -- either construction or searching -- is linear with respect to the size of the graph. I will amend this remark.
text/PiotrMikolajczyk/main.tex
Outdated
self._construct_next(alphabet) | ||
\end{minted} | ||
|
||
Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for convention - I think I used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I will use
But of course, if you wish, I can still exchange sum for the length upperbound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I used through my notes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was convinced after realizing that I have breached the convention also for
|
||
Zatem możemy określić złożoność czasową wyszukiwania wielu wzorców w tekście za pomocą automatu Aho-Corasick jako liniową w stosunku do sumy długości wszystkich wzorców oraz ilości ich wystąpień. | ||
|
||
\section{Zastosowanie do wzorców 2D} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hoped you'll like it :) This application was presented as a curiosity during ADS lecture
I have applied almost all suggestions -- please review them and let me know what else should be done |
Brief implementation of https://cr.yp.to/bib/1975/aho.pdf with tests and a short description.
There were some minor problems with Pylint:
i.e. I've encapsulated the creation of the automaton within a class, but eventually it exposes only one public method (for searching). I could either disable Pylint's screaming or add some dummy method like
__str__()
.Moreover in the test script, Pylint has a problem with:
It is an open issue raised here: pylint-dev/pylint#3107