Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how about the new approach base on 'Double Array Trie' #137

Open
SeekPoint opened this issue Jan 3, 2021 · 15 comments
Open

how about the new approach base on 'Double Array Trie' #137

SeekPoint opened this issue Jan 3, 2021 · 15 comments
Labels

Comments

@SeekPoint
Copy link

Aho Corasick algorithm based on Double Array Trie
https://github.com/hankcs/AhoCorasickDoubleArrayTrie

@WojciechMula
Copy link
Owner

Looks interesting, thanks for the pointer!

@SeekPoint
Copy link
Author

here, the paper..
https://dl.acm.org/doi/10.1109/32.31365

waiting good news

@Dobatymo
Copy link

Dobatymo commented Jan 4, 2021

In addition to the performance enhancements, this might also be a good candidate for a 'shared memory data structure' discussed here #114

@pombredanne
Copy link
Collaborator

FWIW, @mischasan has had this other approach using interleaved arrays (for a long while) https://github.com/mischasan/aho-corasick https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/edit and is in C++.
@WojciechMula take in this project is fairly tight and compact... IMHO any fundamental change to the core would need super careful testing and benchmarking as this is doubtful that they would perform any better until proven ;)

@WojciechMula
Copy link
Owner

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@pombredanne
Copy link
Collaborator

pombredanne commented Jan 31, 2021

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

@mischasan I am the culprit for "approximate" word addition in the README (I did not write the feature) and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@pombredanne
Copy link
Collaborator

pombredanne commented Jan 31, 2021

@mischasan I do not have hard stats, but I use pyahocorasick with a dataset of about ~250MB for about 21K patterns of very different sizes: some a few bytes and some 10's of KB. This is in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data
The main automaton has these stats:

$ python
Python 3.6.10 (default, Jun 13 2020, 08:53:46) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from licensedcode.cache import *
>>> get_index().rules_automaton.get_stats()
{'nodes_count': 3156589, 'words_count': 21370, 'longest_word': 14062, 'links_count': 3156588, 'sizeof_node': 32, 'total_size': 126263552}

... knowing that these are based on strings converted converted sequences of 16bit ints using a dictionary mapping known words to 16 bits ints, and this prior to being added to the Trie, so in my case this is likely more compact than if using the original bytes directly. 126MB is roughly the size of that automaton.
I did not run explicit benchmarks on that, but it is anecdotally orders of magnitude faster than anything else I had tried before and since and that has a Python API.

@pombredanne
Copy link
Collaborator

@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on mischasan/aho-corasick#12 (comment) ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :)

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@mischasan
Copy link

mischasan commented Jan 31, 2021 via email

@pombredanne
Copy link
Collaborator

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

@mischasan in my case the inputs are long open source license texts, medium notices and short mentions. They are broken into tokens (e.g. split in words where anything space or punctuation is a separator). Each word is looked up in a dictionary keyed by these words, and where the value is a 16 bits int. The sequence of all 16 bits values for a given input string (e.g. a license text) becomes the bytes fed to the automaton for indexing.

At search time, the input text to search patterns for (e.g. some text where we want to find a license) is subjected to the same transformation in a sequence of 2 bytes tokens then used for an Aho Corasick search.

That's a highly specialized use case of course.... the dictionary is basically mapping words known to be used in any license and relevant more or less "legally".... Think of these as the whole vocabulary of a language that I call "legalese".
Once the dictionary transformation is done, words like "license" or "copyright" may be represented by xAB or x78 and are all concatenated back in a bytes string. After that step we have something plain from an ahocorasick point of view except that this is a 2bytes "alphabet" that's not ASCII and not Unicode or UTF, just an encoded bytes string.

And yes in my case the dictionary has up to 65535 known words.... But guess what legalese vocabulary is much poorer than that ;) ... ATM is it about 20000 words long and growing fairly slowly when new licenses and notices are added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants