Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the regex package in replacement of the built-in re module #128

Open
eldipa opened this issue Aug 2, 2020 · 4 comments · Fixed by #133
Open

Use the regex package in replacement of the built-in re module #128

eldipa opened this issue Aug 2, 2020 · 4 comments · Fixed by #133
Labels
enhancement something nice to have but it is not neither critical nor urgent far in the future something that it could be cool but it is too hard to implement

Comments

@eldipa
Copy link
Collaborator

eldipa commented Aug 2, 2020

Describe the feature you'd like
The idea is to replace the standard re module with the third party regex module.

There are a few reasons to do it:

  • the was not intention of releasing the GIL by re while regex can do it (and it can be enforced with concurrent=True). The lack of multithreading support by re prevented to use threads instead of multi-processes in byexample which are, obviously, more expensive.
  • byexample uses heavily the regex engine but all the regex used in a single run are different, unique. This means that the traditional cache of re is pointless because it disappears after each run (byexample is restarted and as any process the OS frees its memory). That means that in each run byexample needs to re-compile every single regex which it is very expensive. Pickling is pointless because currently re pickles only the expression and it compiles it when it loads the pickle so we don't save any time. regex, however, supports pickling the bytecode directly. Note: we should test how much we win with this.
  • despite of been heavily optimized, the regex created by byexample may lead to a catastrophic collapse (endless high CPU usage). re does not support atomic groups or possessive qualifiers that could reduce the impact of a catastrophic backtracking (see Use a Thompson NFA instead of the Python default re (regex) engine if it is possible #16). re neither supports timeouts. regex in the other hand supports all of them.
@eldipa eldipa added the enhancement something nice to have but it is not neither critical nor urgent label Aug 2, 2020
@eldipa
Copy link
Collaborator Author

eldipa commented Mar 5, 2021

From 81edc42 we have the following observations after instrumenting and measuring the time for building a regex object (compile function) and using that regex to match a given string, for all the examples of parse_sm.py's documentation:

  • regex's compile function is 96 times slower than re's one. (order of 300 microseconds against 3 microseconds)
  • regex's match function is 2.5 times slower than re's one. (order of 8 microseconds against 3 microseconds)

Despite showing the worst performance regression, compile slowness may not be so important if we can implement a regex cache; the performance of match may be is a problem.

@eldipa eldipa reopened this Mar 5, 2021
@eldipa
Copy link
Collaborator Author

eldipa commented Mar 5, 2021

Merge branch onto master (#133 ) reverted in fc9b1c4 due the performance regression.

@eldipa eldipa added this to the 10.0.0 milestone Mar 6, 2021
@eldipa
Copy link
Collaborator Author

eldipa commented Mar 11, 2021

Running a full suite of tests with the profiler on shown that _exec function using regex ran 12% slower that using the re engine.

@eldipa
Copy link
Collaborator Author

eldipa commented Mar 13, 2021

At the moment, regex will not be integrated in byexample:

It could be possible (maybe) to integrate part of #139 to add a layer of abstraction and make byexample and the modules agnostic of the regex engine to be used. This will make the transition to regex easier in the future with a minimal cost.

@eldipa eldipa removed this from the 10.0.0 milestone Mar 13, 2021
@eldipa eldipa added the far in the future something that it could be cool but it is too hard to implement label Mar 13, 2021
eldipa added a commit that referenced this issue Mar 14, 2021
Use a home-made regex module to replace Python's re. This home-made
serves as a layer to "hide" the used of re: the only
purpose is to serve as a thin layer between client code and the real
regex engine.

Supports only compile and escape to promote the use the regex/pattern
objects instead of the global functions search/match/find/sub/...

This layer can be used later to do advanced cache or to use a different
regex engine.

This commit was originally development for #128.

(cherry picked from commit 5de8a41)
eldipa added a commit that referenced this issue Mar 14, 2021
The commit was originally developmented for #128
(cherry picked from commit a3ebe9b)
eldipa added a commit that referenced this issue Mar 24, 2021
Use a home-made regex module to replace Python's re. This home-made
serves as a layer to "hide" the used of re: the only
purpose is to serve as a thin layer between client code and the real
regex engine.

Supports only compile and escape to promote the use the regex/pattern
objects instead of the global functions search/match/find/sub/...

This layer can be used later to do advanced cache or to use a different
regex engine.

This commit was originally development for #128.

(cherry picked from commit 5de8a41)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something nice to have but it is not neither critical nor urgent far in the future something that it could be cool but it is too hard to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant