Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should work with bytes? #65

Open
orbisvicis opened this issue Aug 16, 2017 · 18 comments
Open

Should work with bytes? #65

orbisvicis opened this issue Aug 16, 2017 · 18 comments

Comments

@orbisvicis
Copy link

No description provided.

@orbisvicis
Copy link
Author

Hmm just noticed ahocorasick.unicode in the docs and TRIE_LETTER_TYPE in the source. Is there no way to support both bytes and strings in python 3?

@orbisvicis
Copy link
Author

orbisvicis commented Aug 16, 2017

Constantly segfaults when I disable AHOCORASICK_UNICODE, even when not using pyahocorasick objects - py3.5.3.

@WojciechMula
Copy link
Owner

Bytes are supported. Could you please run python unitests.py? There're many tests that use bytes.

@orbisvicis
Copy link
Author

orbisvicis commented Aug 17, 2017

This is with pristine pyahocorasick on python 3.5.3:

test (unittest.loader._FailedTest) ... ERROR

======================================================================
ERROR: test (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test
Traceback (most recent call last):
  File "/usr/lib64/python3.5/unittest/loader.py", line 428, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib64/python3.5/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "<...>/pyahocorasick/test.py", line 16, in <module>
    a.add_word(w, (i, w))
TypeError: string expected


----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>

@sborpo
Copy link

sborpo commented May 16, 2018

Hello,
Is the fix to this bug is planned for the near future?
I have the same problem :(
Unfortunately cannot work in bytes mode

@WojciechMula
Copy link
Owner

Unfortunately I cannot devote as much time as previously. Waiting for contributors.

@vapier
Copy link

vapier commented Jun 25, 2021

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

@Dobatymo
Copy link

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

Not exactly the best solution, but you can still use a codec which cannot fail to do the bytes to string conversation like latin-1.

@pombredanne
Copy link
Collaborator

@vapier hey 👋

re:

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

Can you be more explicit? what would need to be fixed/changed and where?

@pombredanne
Copy link
Collaborator

@vapier you rock! .. I guess there is an interesting derived tidbit: pyahocorasick may be used in chromium OS? 😇

And I see you are also the author to credit for the patch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+log/refs/heads/release-R100-14526.B/dev-python/pyahocorasick/files/pyahocorasick-1.4.0-bytes.patch

I also now recall of so many of your awesome contribs like on strace and so many other places. I am not worthy.

@pombredanne
Copy link
Collaborator

@WojciechMula this deserves a 1.5 bump indeed. I reckon I am not impacted in most cases as usually stuff only integers in automatons. But I wonder if we should not publish two builds now that I started pushing pre-built wheels for all OSes: one for bytes and one for unicode?

@WojciechMula
Copy link
Owner

Oh man, it's really great news about usage!
@pombredanne I think it's better to finish releasing and then add patch. More work, but a bit more logical. Of course IMHO it's up to you.

Anyway, it would be great if we provided specialisations for each type and then have some lightweight factory for this, even in pure python. This would be pretty easy if we have some code generator underneath.

pombredanne added a commit that referenced this issue Mar 8, 2022
Tests are now mostly passing when building without unicode.
The tests should also run in the CI though.

Reference: #65
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Mar 8, 2022
The behaviour with bytes build has some issues. This helps testing this

Reference: #65
Reference: #165
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Mar 8, 2022
The environment variables AHOCORASICK_UNICODE and AHOCORASICK_BYTES now
drive the flavor of the build if defined (using any value).

Reference: #65
Reference: #165
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Mar 8, 2022
Use environment variables AHOCORASICK_UNICODE and AHOCORASICK_BYTES
to test vboth builds on all supported OSes.

Reference: #65
Reference: #165
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Collaborator

I pushed some tests in this branch https://github.com/WojciechMula/pyahocorasick/tree/release-2.0.0

  • all the tests have been adapted or duplicated with a condition on the build type to run on unicode and bytes builds
  • there are new environment variable to drive the build as unicode or bytes See f6a5d20

Define either AHOCORASICK_UNICODE=yes or AHOCORASICK_BYTES=yes to build one or the other such as:
with AHOCORASICK_UNICODE=yes pip install -e . or AHOCORASICK_BYTES=yes make

There are some issues that show up:

@vapier the new environment variable should help you at the minimum get rid of your setup.py patch

Now to support strings & bytes simultaneously, we could either build both and manage a factory in Python as @WojciechMula suggested above... or find a way to deal with both together.

I have been toying with the idea of an alternative unified way (on the C side) where the number of symbols in some "alphabet" would be the key input as opposed to the implied bytes = length 1 (e.g. 256 symbols in alphabet) and unicode = variable length based on build.

Also re: #65 (comment)

I reckon I am not impacted in most cases as usually stuff only integers in automatons.

Actually I usually use sequences of integers as keys with ahocorasick.KEY_SEQUENCE
I could very much benefit from less memory usage if the KEY_SEQUENCE integers could be 16bits rather 32bits... My set of unique values are in a well defined range roughly under 15 bits long.

@pombredanne
Copy link
Collaborator

pombredanne commented Mar 8, 2022

You can see the runs failures there https://github.com/WojciechMula/pyahocorasick/actions/runs/1952208882

  • all bytes builds have at two tests failing, both linked to the Automaton.items() method
  • the windows unicode builds have also the same two tests failing as if they were built always a bytes and not unicode.

@pombredanne
Copy link
Collaborator

At this stage all the bytes-based tests pass.
I wonder if it would be useful to build two different packages like pyahocorasick and pyahocorasick-bytes to make it easier for users to pick a byte build? In this case the wheels would be built for bytes and the sdist would be by default building as bytes. This would be fairly simple.

@pombredanne
Copy link
Collaborator

We have #165 as a related issue

@vapier
Copy link

vapier commented Jul 12, 2023

personally i'd rip the bandaid off, but failing that, publishing two wheels with the diff names sounds OK assuming they can be installed in parallel

i guess there's no way to build 2 C components into a single package and pick the right one based on the inputs ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants