Aho-Corasick multiple string matching algorithm #8

pmikolajczyk41 · 2020-06-17T22:05:36Z

Brief implementation of https://cr.yp.to/bib/1975/aho.pdf with tests and a short description.

There were some minor problems with Pylint:

# pylint: disable=too-few-public-methods
class AhoCorasickAutomaton:
  ...

i.e. I've encapsulated the creation of the automaton within a class, but eventually it exposes only one public method (for searching). I could either disable Pylint's screaming or add some dummy method like __str__().

Moreover in the test script, Pylint has a problem with:

for p in patterns:
  ...
  indices = map(lambda i: (i, i + len(p)), starts)

It is an open issue raised here: pylint-dev/pylint#3107

rafalalgo · 2020-06-18T00:48:36Z

Probably you can change
indices = map(lambda i: (i, i + len(p)), starts)
into
indices = [(i, i + len(p)) for i in starts]

then pylint will be happy 😄

pmikolajczyk41 · 2020-06-18T08:02:29Z

Thank you, of course it could be easily fixed -- I must have had some kinda blackout :)

krzysztof-turowski

Good work, required changes are rather small.

krzysztof-turowski · 2020-06-19T08:35:26Z

exact_multiple_string_matching/aho_corasick.py

+
+    while not q.empty():
+      current = q.get()
+      for a, child in [(a, current.goto(a)) for a in alphabet]:


Replace list with generator here and in all other places.

krzysztof-turowski · 2020-06-19T08:39:54Z

text/PiotrMikolajczyk/main.tex

+
+\section{Wprowadzenie}
+
+Niech $\K=\{k_1,...,k_n\}$ będzie zbiorem wzorców (niepustych słów nad alfabetem $\Sigma$). Otrzymawszy słowo $T\in\Sigma^m$ chcemy wyznaczyć wszystkie indeksy z $[m]$, na których w $T$ zaczyna się pewien wzorzec z $\K$. Naturalnym rozwiązaniem jest $n$-krotne użycie jednego z wydajnych (liniowych) algorytmów do szukania pojedynczego wzorca. Podejście takie oczywiście jest poprawne, niestety skutkuje złożonością czasową $\Theta(nm)$. Okazuje się, że na podstawie $\K$ jesteśmy w stanie skonstruować w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ odpowiednią strukturę, która będzie potrafiła znaleźć naraz wszystkie wystąpienia wzorców oglądając $T$ wyłącznie raz. Przed nami: automat Aho-Corasick.


Nit: use $w_i$'s for patterns for consistency with others.

krzysztof-turowski · 2020-06-19T22:47:44Z

exact_multiple_string_matching/aho_corasick.py

+    for i in range(1, n + 1):
+      state = state.nxt(text[i])
+      for keyword_len in state.output():
+        yield i - keyword_len + 1, i + 1


As for the output format: I don't have a very strong opinion on that, but two solutions make sense to me

returning only the starting index - mostly for compatibility with string matching algorithms and when we do care about the fact that something occurred, but not which patters,

returning pair (starting index, pattern matched) - may be implemented even suboptimally, but at least makes clear that we're interested which pattern occured.

Please share your opinion on that. Also I'd like to hear @rafalalgo on that, since you both work for different algorithms for the same problem.

Firstly, I implemented the 2. proposition. After realizing, that it is possible to completely abandon storing keywords in the automaton, I think that the current solution is much more sparing and we do not lose any clarity. The only reasonable situation I can imagine when we would need particular pattern is when encountering it, but then we can just take a corresponding substring from the text.
Surely, I'd be glad to hear others' opinion and, if needed, change the realization.

The first option is the one I would rather not to choose -- if there is any occurrence we will return something anyway. In any other situation we will lose information, providing which does not cost much.

Sure, I agree that you may completely abandon storing keywords in the automaton - however I was thinking in terms of clarity of explanation vs. optimality.

As for the explanations: I would think that it's possible that we would like for a list of patterns w_1, w_2, ..., w_k to build an automaton and answer queries with arrays W[1..k] storing how many times each string appeared without any further comparisons. Equivalently, we could yield pairs (i, pos), i.e. store i's at the respective nodes. But I understand your concerns, but considering the above I guess returning (w_i, pos) maybe with a short explanation that (i, pos) can be done very similar is more readable by laymen. At least it's just my impression.

It's hard to disagree :) I think that in terms of this repo at least, consistency and clarity are priorities, so maybe let's wait on @rafalalgo 's point of view and I will adapt properly

In Commentz-Walter algorithm implementation, my proposition is that for a list of patterns w_1, w_2, ..., w_k to build an automaton and then to answer queries, we could yield pairs (pattern, pos), i.e. [ ('acb', 1), ('aba', 5), ('aba', 7), ('aba', 9), ('aba', 11) ]. It is a proposition from original Commentz-Walter paper :)

Resolved :)

krzysztof-turowski · 2020-06-19T22:55:47Z

text/PiotrMikolajczyk/main.tex

+
+Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane).
+
+Pozostaje zastanowić się, ile czasu zajmie przeczytanie tekstu $T$ ($|T|=m$) i wypisanie wszystkich wystąpień wzorców. Moc $\Sigma$ traktujemy jak stałą.


Bonus question: it seems that the whole construction and search depend linearly on $|\Sigma|$, right?

I agree, as the size of the automaton is bounded by the product of the sum of keywords' length and alphabet cardinality. Every phase -- either construction or searching -- is linear with respect to the size of the graph. I will amend this remark.

krzysztof-turowski · 2020-06-19T22:57:20Z

text/PiotrMikolajczyk/main.tex

+    self._construct_next(alphabet)
+\end{minted}
+
+Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}|k_i|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane).


Just for convention - I think I used $k$ for number of patterns and you may use $m$ as the length of the longest string in pattern.

For now I will use $k$ for the number of patters, but maybe I will not take care about the longest pattern -- I will leave the sum of lengths as it is, especially that $m$ is currently the $T$'s length :)
But of course, if you wish, I can still exchange sum for the length upperbound

Well, I used through my notes $|t| = n$, $|w| = m$ (or $\max_i |w_i| = m$ if there were multiple), but OK, this I can correct during my merge run.

I was convinced after realizing that I have breached the convention also for $t$ -- my apologies. The patch is on its way

krzysztof-turowski · 2020-06-19T22:59:33Z

text/PiotrMikolajczyk/main.tex

+
+Zatem możemy określić złożoność czasową wyszukiwania wielu wzorców w tekście za pomocą automatu Aho-Corasick jako liniową w stosunku do sumy długości wszystkich wzorców oraz ilości ich wystąpień.
+
+\section{Zastosowanie do wzorców 2D}


Nice addition, thank you.

Hoped you'll like it :) This application was presented as a curiosity during ADS lecture

pmikolajczyk41 · 2020-06-20T00:09:35Z

I have applied almost all suggestions -- please review them and let me know what else should be done

pmikolajczyk41 added 4 commits June 17, 2020 19:06

Add implementation of Aho-Corasick Automaton

f88eca7

Do not store keywords; Rename next -> nxt

d1d72e5

Add tests

9c8d328

Add tex+pdf

1cb0d04

Refactor making Pylint happier and the code cleaner

e0c5d7f

krzysztof-turowski self-assigned this Jun 18, 2020

krzysztof-turowski self-requested a review June 18, 2020 23:51

krzysztof-turowski requested changes Jun 19, 2020

View reviewed changes

pmikolajczyk41 added 2 commits June 20, 2020 01:38

Replace list with generator

1bba6d4

Apply style recommendations

868372e

pmikolajczyk41 requested a review from krzysztof-turowski June 20, 2020 00:10

Comply with the convention

9535e1c

krzysztof-turowski approved these changes Jun 20, 2020

View reviewed changes

krzysztof-turowski changed the title ~~Aho-Corasick Automaton~~ Aho-Corasick multiple string matching algorithm Jun 20, 2020

krzysztof-turowski merged commit 412cda4 into krzysztof-turowski:master Jun 20, 2020

krzysztof-turowski assigned pmikolajczyk41 and unassigned krzysztof-turowski Jan 12, 2022

krzysztof-turowski added the enhancement New feature or request label Sep 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aho-Corasick multiple string matching algorithm #8

Aho-Corasick multiple string matching algorithm #8

pmikolajczyk41 commented Jun 17, 2020

rafalalgo commented Jun 18, 2020

pmikolajczyk41 commented Jun 18, 2020

krzysztof-turowski left a comment

krzysztof-turowski Jun 19, 2020

krzysztof-turowski Jun 19, 2020

krzysztof-turowski Jun 19, 2020

pmikolajczyk41 Jun 19, 2020

pmikolajczyk41 Jun 19, 2020

krzysztof-turowski Jun 20, 2020

pmikolajczyk41 Jun 20, 2020

rafalalgo Jun 20, 2020

pmikolajczyk41 Jun 20, 2020

krzysztof-turowski Jun 19, 2020

pmikolajczyk41 Jun 19, 2020 •

edited

krzysztof-turowski Jun 19, 2020

pmikolajczyk41 Jun 19, 2020

krzysztof-turowski Jun 20, 2020

pmikolajczyk41 Jun 20, 2020

krzysztof-turowski Jun 19, 2020

pmikolajczyk41 Jun 19, 2020

pmikolajczyk41 commented Jun 20, 2020


		\section{Wprowadzenie}

		Niech $\K=\{k_1,...,k_n\}$ będzie zbiorem wzorców (niepustych słów nad alfabetem $\Sigma$). Otrzymawszy słowo $T\in\Sigma^m$ chcemy wyznaczyć wszystkie indeksy z $[m]$, na których w $T$ zaczyna się pewien wzorzec z $\K$. Naturalnym rozwiązaniem jest $n$-krotne użycie jednego z wydajnych (liniowych) algorytmów do szukania pojedynczego wzorca. Podejście takie oczywiście jest poprawne, niestety skutkuje złożonością czasową $\Theta(nm)$. Okazuje się, że na podstawie $\K$ jesteśmy w stanie skonstruować w czasie $\Theta(\sum_{i\in[n]}\|k_i\|)$ odpowiednią strukturę, która będzie potrafiła znaleźć naraz wszystkie wystąpienia wzorców oglądając $T$ wyłącznie raz. Przed nami: automat Aho-Corasick.


		Bardzo łatwo zauważyć, że każdą z nich wykonać można w czasie $\Theta(\sum_{i\in[n]}\|k_i\|)$ -- za każdym razem przechodzimy po całym drzewie $\trie$ odwiedzając każdy wierzchołek dokładnie raz oraz wykonując $\Theta(1)$ operacji w każdym z nich. Zakładamy również, że łączenie zbiorów $\out$ wykonuje się w czasie stałym (jak np. listy wiązane).

		Pozostaje zastanowić się, ile czasu zajmie przeczytanie tekstu $T$ ($\|T\|=m$) i wypisanie wszystkich wystąpień wzorców. Moc $\Sigma$ traktujemy jak stałą.


		Zatem możemy określić złożoność czasową wyszukiwania wielu wzorców w tekście za pomocą automatu Aho-Corasick jako liniową w stosunku do sumy długości wszystkich wzorców oraz ilości ich wystąpień.

		\section{Zastosowanie do wzorców 2D}

Aho-Corasick multiple string matching algorithm #8

Aho-Corasick multiple string matching algorithm #8

Conversation

pmikolajczyk41 commented Jun 17, 2020

rafalalgo commented Jun 18, 2020

pmikolajczyk41 commented Jun 18, 2020

krzysztof-turowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmikolajczyk41 Jun 19, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmikolajczyk41 commented Jun 20, 2020

pmikolajczyk41 Jun 19, 2020 •

edited