Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for future 3.0.0 branch #100

Open
WojciechMula opened this issue Dec 3, 2018 · 13 comments
Open

Plans for future 3.0.0 branch #100

WojciechMula opened this issue Dec 3, 2018 · 13 comments

Comments

@WojciechMula
Copy link
Owner

Few things I consider for the next, incompatible version (in random order):

  1. Remove value_type from Automaton constructor (i.e. STORE_ANY, STORE_INTS, STORE_LENGTH). Using STORE_{INT,LENGTH} makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest, STORE_LENGTH is stupid, I suspect it was added because "I could do it".
  2. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.
  3. Support any iterator-like object in search. We have just gained support for tuples, which is great, but more generic code -- even if not specialised and fast -- seems be useful. (There was a question about mmaped files, for instance).
  4. Change also a little memory layout: when node has one child store pointer directly, rather allocating subarray of size 1.
  5. Remove kind "EMPTY" --- it's a technical detail; an automaton should have got always the "root" node, that makes adding and removing words a bit easier.
  6. Rewrote pickling mechanism. Now plain C structures by the module are mapped almost directly, which is not a good idea; the future format should be tailored to pickling needs, and also more validation has to be done put there.
@WojciechMula
Copy link
Owner Author

In the end I want to clean up my two pet peeves: 1) replace tabs with spaces and 2) move C files to subdirectory src.

@WojciechMula WojciechMula changed the title Plans for future 1.2.x branch Plans for future 2.0.0 branch Dec 12, 2018
@WojciechMula
Copy link
Owner Author

@pombredanne Philippe, do you have any comments, ideas? Is there anything which is now annoying and can be fixed.

@frankier
Copy link
Contributor

frankier commented Feb 13, 2019

So I never rebased it because I was not really happy with it as is, but this patch could be made nicer if based against a pyahocorasick with structure outlined below

frankier@433ca1b

So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.

@pombredanne
Copy link
Collaborator

pombredanne commented Jun 4, 2019

@WojciechMula sorry for the late reply!... here are a few ideas, in no specific order:

  • build wheels for linux/mac/windows Provide manylinux wheel #91
  • in the Aho-Corasick search, have a way to return only the longest of contained matches iter with longest_prefix match #108
  • consider some of the similar shortcuting as implemented in ahocorapy
  • this is likely a dream and not suited to the algorithm: in the Aho-Corasick search, how some simple wildcard-like skipping mechanism: have a way to allow continue searching if the next character is not matching but the one after next matches and that up to a certain max number of characters for a given match.
  • in the Aho-Corasick search, optionally return all partial longest matches above a length threshold
  • Support typed array as an input when storing sequences Support typed array as an input when storing sequences  #113

@pombredanne
Copy link
Collaborator

@frankier re

So my request would be that the C module is moved to _ahocorasick and then everything re-exported from ahocorasick so that "porcelain" like that patch can be implemented in Python.

Your patch is in Python... so I am not sure I get the requirement. Can you elaborate?

@frankier
Copy link
Contributor

frankier commented Jun 4, 2019

So the idea is to make it so Python and non Python stuff can be imported from ahocorasick together. This is typically done by naming the C module _mymodule and re-exporting everything from a Python module mymodule. You can see use of this pattern in the standard library here: https://github.com/python/cpython/blob/master/Lib/pickle.py#L39

@WojciechMula
Copy link
Owner Author

TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.

@pombredanne
Copy link
Collaborator

@WojciechMula re:

TBH I would be happy to remove plain-python module. It hasn't been upgraded for ages and seems alternatives are more mature.

I am fine too. What you could do it move it to tests or a gist and explain this was used as an early prototype or not do anything at all and just remove it. It is in the history in anycase.

@pombredanne
Copy link
Collaborator

Some notes:

  1. Remove value_type from Automaton constructor (i.e. STORE_ANY, STORE_INTS, STORE_LENGTH). Using STORE_{INT,LENGTH} makes sense at C-language level, but in Python isn't good. I mean, even if we save bare integers underneath, they have to be converted back to Python object. Removing this might also improve memory usage. To be honest, STORE_LENGTH is stupid, I suspect it was added because "I could do it".

I guess there are two ways I think of this:

  1. only store integers. Actual mapping to a Python object is done elsewhere... It could be done in Python.
  2. only store arbitrary pickable object.

As an example of some of the things I store see https://github.com/nexB/scancode-toolkit/blob/d1e725d3603a8f96c25f7e3f7595c68999b92a67/src/licensedcode/match_aho.py#L52
Here I store some int id for some object that represents what was matched plus some extra start and end positions.
This could be a simple integers alright and we could have a separate plain Python mapping that maps the integer to an actual object to return. This could simplify the code?

  1. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.

Frankly I would drop entirely wildcard matching as this is both a rare feature of such an automaton and I am not sure anyone is using this. I could see the value to have a regex engine leveraging the automaton to speed up matching (say more or less like esmre or may be parts of hyperscan) but I cannot how I would use the wildcard matching

@melsabagh
Copy link
Contributor

melsabagh commented Jun 26, 2022

  1. Rework wildcard matching. Use syntax from glob module, i.e. always the '?' marks "any character". It will require a bit of parsing, but the current solution is silly and unintuitive.

I can give this a shot. Glob-style pattern matching should be possible if you slightly adjust the stack-based iterator in AutomatonItemsIter.c. I will work on a PR that supports the ? (match any single char) and * (match anything) wildcard patterns (perhaps also supports escaping them). I have extensively used similar techniques in a couple of projects to query large Tries with wildcard patterns and the performance often was significantly better than using regexes or Esmre. It is also more use friendly since glob patterns are much simpler and more intuitive than regex, especially when the queries tend to contain many special characters. Glob patterns also force the caller to "keep it simple".

@pombredanne
Copy link
Collaborator

@melsabagh-kw this looks like a great idea! I love it. Note that AFAIK, a glob can be converted to a regex (at least Python can translate https://docs.python.org/3/library/fnmatch.html#fnmatch.translate ) but this is a much simpler syntax than full regex and therefore better IMHO as you pointed out!

@melsabagh
Copy link
Contributor

I submitted a PR at #171

@pombredanne pombredanne changed the title Plans for future 2.0.0 branch Plans for future 3.0.0 branch Jan 14, 2023
@pombredanne
Copy link
Collaborator

I renamed this to target a future v3.0. I am working on a v2.0 today and this is not as ambitious as what we have here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants