Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aho-Corasick multiple string matching algorithm #8

Merged

Conversation

pmikolajczyk41
Copy link
Contributor

Brief implementation of https://cr.yp.to/bib/1975/aho.pdf with tests and a short description.

There were some minor problems with Pylint:

# pylint: disable=too-few-public-methods
class AhoCorasickAutomaton:
  ...

i.e. I've encapsulated the creation of the automaton within a class, but eventually it exposes only one public method (for searching). I could either disable Pylint's screaming or add some dummy method like __str__().

Moreover in the test script, Pylint has a problem with:

for p in patterns:
  ...
  indices = map(lambda i: (i, i + len(p)), starts)

It is an open issue raised here: pylint-dev/pylint#3107

@rafalalgo
Copy link
Contributor

Probably you can change
indices = map(lambda i: (i, i + len(p)), starts)
into
indices = [(i, i + len(p)) for i in starts]

then pylint will be happy 😄

@pmikolajczyk41
Copy link
Contributor Author

Thank you, of course it could be easily fixed -- I must have had some kinda blackout :)

@krzysztof-turowski krzysztof-turowski self-assigned this Jun 18, 2020
Copy link
Owner

@krzysztof-turowski krzysztof-turowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, required changes are rather small.


while not q.empty():
current = q.get()
for a, child in [(a, current.goto(a)) for a in alphabet]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace list with generator here and in all other places.


\section{Wprowadzenie}

Niech $\K=\{k_1,...,k_n\}$ będzie zbiorem wzorców (niepustych słów nad alfabetem $\Sigma$). Otrzymawszy słowo $T\in\Sigma^m$ chcemy wyznaczyć wszystkie indeksy z $[m]$, na których w $T$ zaczyna się pewien wzorzec z $\K$. Naturalnym rozwiązaniem jest $n$-krotne użycie jednego z wydajnych (liniowych) algorytmów do szukania pojedynczego wzorca. Podejście takie oczywiście jest poprawne, niestety skutkuje złożonością czasową $\Theta(nm)$. Okazuje się, że na podstawie $\K$ jesteśmy w stanie skonstruować w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ odpowiednią strukturę, która będzie potrafiła znaleźć naraz wszystkie wystąpienia wzorców oglądając $T$ wyłącznie raz. Przed nami: automat Aho-Corasick.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: use $w_i$'s for patterns for consistency with others.

for i in range(1, n + 1):
state = state.nxt(text[i])
for keyword_len in state.output():
yield i - keyword_len + 1, i + 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for the output format: I don't have a very strong opinion on that, but two solutions make sense to me

  1. returning only the starting index - mostly for compatibility with string matching algorithms and when we do care about the fact that something occurred, but not which patters,
  2. returning pair (starting index, pattern matched) - may be implemented even suboptimally, but at least makes clear that we're interested which pattern occured.

Please share your opinion on that. Also I'd like to hear @rafalalgo on that, since you both work for different algorithms for the same problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I implemented the 2. proposition. After realizing, that it is possible to completely abandon storing keywords in the automaton, I think that the current solution is much more sparing and we do not lose any clarity. The only reasonable situation I can imagine when we would need particular pattern is when encountering it, but then we can just take a corresponding substring from the text.
Surely, I'd be glad to hear others' opinion and, if needed, change the realization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first option is the one I would rather not to choose -- if there is any occurrence we will return something anyway. In any other situation we will lose information, providing which does not cost much.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I agree that you may completely abandon storing keywords in the automaton - however I was thinking in terms of clarity of explanation vs. optimality.

As for the explanations: I would think that it's possible that we would like for a list of patterns w_1, w_2, ..., w_k to build an automaton and answer queries with arrays W[1..k] storing how many times each string appeared without any further comparisons. Equivalently, we could yield pairs (i, pos), i.e. store i's at the respective nodes. But I understand your concerns, but considering the above I guess returning (w_i, pos) maybe with a short explanation that (i, pos) can be done very similar is more readable by laymen. At least it's just my impression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to disagree :) I think that in terms of this repo at least, consistency and clarity are priorities, so maybe let's wait on @rafalalgo 's point of view and I will adapt properly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Commentz-Walter algorithm implementation, my proposition is that for a list of patterns w_1, w_2, ..., w_k to build an automaton and then to answer queries, we could yield pairs (pattern, pos), i.e. [ ('acb', 1), ('aba', 5), ('aba', 7), ('aba', 9), ('aba', 11) ]. It is a proposition from original Commentz-Walter paper :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved :)


Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane).

Pozostaje zastanowić się, ile czasu zajmie przeczytanie tekstu $T$ ($|T|=m$) i wypisanie wszystkich wystąpień wzorców. Moc $\Sigma$ traktujemy jak stałą.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bonus question: it seems that the whole construction and search depend linearly on $|\Sigma|$, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, as the size of the automaton is bounded by the product of the sum of keywords' length and alphabet cardinality. Every phase -- either construction or searching -- is linear with respect to the size of the graph. I will amend this remark.

self._construct_next(alphabet)
\end{minted}

Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for convention - I think I used $k$ for number of patterns and you may use $m$ as the length of the longest string in pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I will use $k$ for the number of patters, but maybe I will not take care about the longest pattern -- I will leave the sum of lengths as it is, especially that $m$ is currently the $T$'s length :)
But of course, if you wish, I can still exchange sum for the length upperbound

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I used through my notes $|t| = n$, $|w| = m$ (or $\max_i |w_i| = m$ if there were multiple), but OK, this I can correct during my merge run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was convinced after realizing that I have breached the convention also for $t$ -- my apologies. The patch is on its way


Zatem możemy określić złożoność czasową wyszukiwania wielu wzorców w tekście za pomocą automatu Aho-Corasick jako liniową w stosunku do sumy długości wszystkich wzorców oraz ilości ich wystąpień.

\section{Zastosowanie do wzorców 2D}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hoped you'll like it :) This application was presented as a curiosity during ADS lecture

@pmikolajczyk41
Copy link
Contributor Author

I have applied almost all suggestions -- please review them and let me know what else should be done

@krzysztof-turowski krzysztof-turowski changed the title Aho-Corasick Automaton Aho-Corasick multiple string matching algorithm Jun 20, 2020
@krzysztof-turowski krzysztof-turowski merged commit 412cda4 into krzysztof-turowski:master Jun 20, 2020
@krzysztof-turowski krzysztof-turowski added the enhancement New feature or request label Sep 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants