Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

bskinn · 2021-01-10T05:14:16Z

NOTE: With the deprecation of the python-Levenshtein speedup for the suggest functionality (see #211 & #218), identifying other methods to increase performance is a priority. This multi-processing based approach is the best one I've thought of so far. If anyone has another suggestion, please open a new issue to discuss it.

Very early POC on suggest-multiproc branch.

Suggests some speedup possible, but not a slam-dunk whether it's worth it given the sub-2s processing times for most inventories. Properly exploiting difflib.SequenceMatcher's caching behavior may change this, however.

For comparison, python-Levenshtein is a better speedup for less internal complexity, and doesn't appear to benefit at all from multiproc.

Latitude laptop (4 cores), yt (17420 objects)

Quick, rough timing:

>>> import sphobjinv as soi
>>> import timeit
>>> timeit.timeit("soi.Inventory('tests/resource/objects_yt.inv').suggest('ndarray')", globals=globals(), number=1)

NO mp,      WITH lev:  2.177 s
WITH mp(4), WITH lev:  2.396 s
WITH mp(6), WITH lev:  2.355 s

NO mp,      NO lev:   11.795 s
WITH mp(2), NO lev:   10.361 s
WITH mp(3), NO lev:    8.471 s
WITH mp(4), NO lev:    7.583 s (~35% speedup)
WITH mp(5), NO lev:    7.399 s (oversubscribed)
WITH mp(6), NO lev:    8.372 s (oversubscribed)

Notes

Probably can switch to pool.map(), without the context manager
Looks like using multiprocessing.cpu_count() would be a reasonable default pool size
Want user to be able to set max of process count (probably refuse to oversubscribe??)
- API -- arg to Inventory.suggest(), likely
- CLI -- new parameter for suggest subparser
If nproc == 1 then skip multiprocessing entirely
Probably need to check at sphobjinv import-time whether multiprocessing is available
- Doing that in a way that doesn't further slow down the import time may be tricky
Per the difflib docs, implementing a bespoke scoring function directly with difflib.SequenceMatcher may allow significant speed gains, due to the ability to cache the suggest search term
For v2.3, perhaps keep the current single-process fuzzywuzzy method as the default, but implement with a feature flag to switch to the mp-accelerated version?
A naive, direct use of difflib.SequenceMatcher does not give good suggestions for some searches (e.g., "dataobj" in the sphobjinv inventory), compared to the default, WRatio-based partial-matcher in fuzzywuzzy
- Going to want to port this fuzzywuzzy method directly into the codebase, adapted to allow for multiprocess, I think.

The text was updated successfully, but these errors were encountered:

Honestly, it's still not great. The best next step appears to be taking the list of inventory items and using `fuzzywuzzy` directly for more control over how the query is scored against each item. In short, `sphobjinv` does enough for a proof-of-concept, but how its `.suggest()` method works isn't quite search-engine-y enough.

bskinn · 2022-10-25T20:55:57Z

If I'm going to have to re-implement fuzzywuzzy's WRatio as part of this, I might as well do it in a way that exposes the internal weighting values inside WRatio, so that if someone wanted to they could adjust those values. Have to think about whether there's a way to do that ~cleanly in a plugin system, though...

bskinn · 2022-11-29T12:47:39Z

Scorer should make its own internal decision about multi or not, but suggest should have an option to coerce single or nproc if user prefers, or if the multi-detector is making a bad determination.

The scorer should have a fully inspect able/introspectable API. Defined interface on how suggest will call it. Should be the most general, information rich inputs possible, which is probably the search term and the inventory itself!

Provide index and score flags, too? Threshold? I could see a scorer knowing enough about its own properties to be able to make a quick first pass and discard sufficiently poor matches. Might as well provide all the information possible. Have to set this up so that it's easy to add more information if something new comes up.

bskinn · 2022-12-07T04:53:12Z

SuggestPayload object with Inventory and suggest parameters (index, score, thresh... others?) passed to scorer, along with an extra_kwargs dict of additional arguments that can adjust the scorer behavior.

Best practice... recommend that any scorer, builtin or plugged, define Enums for use as the keys in extra_kwargs?

bskinn · 2022-12-07T04:58:15Z

As part of the new implementation of the multiprocessing-enabled WRatio scorer, perhaps add an option to 'devalue' object match scores if there's no substring match?

E.g., if search not in rst, then adjust score as

$$ score_{adj} = 100 * (score_0 / 100)^{1+penalty} $$

$penalty$ would likely be a decimal less than one, in most cases.

bskinn · 2022-12-07T05:30:29Z

Either way, keep the legacy scorer, available under the legacy id... and call the new one default, probably.

bskinn · 2022-12-08T20:08:25Z

Can consider using https://pypi.org/project/editdistance/ as a new speedup extra, integrating into the new default scorer.

Or --- if it's fast enough with the editdistance speedup, may be able to avoid writing the multiprocessing-accelerated scorer entirely.

bskinn · 2023-07-06T04:10:04Z

Sprinkling this in various issues: rapidfuzz as another fuzzywuzzy alternative.

bskinn added type: enhancement ✨ Something to add issue: maybe 🤔 Being considered, but not certain pr: needs tests ✔️ area: cli 💻 area: api 🤖 labels Jan 10, 2021

bskinn added this to the Future milestone Jan 10, 2021

bskinn modified the milestones: Future, v2.2 Dec 12, 2021

bskinn mentioned this issue Dec 23, 2021

Fuzzywuzzy is not MIT compatible #211

Closed

bskinn modified the milestones: v2.2, v2.3 Dec 23, 2021

bskinn pinned this issue Jan 31, 2022

bskinn changed the title ~~Implement multiprocessing speedup in absence of python-Levenshtein~~ Implement multiprocessing speedup for suggest in absence of python-Levenshtein Jan 31, 2022

bskinn added the topic: performance 🏎️ label Mar 14, 2022

bskinn mentioned this issue Apr 20, 2022

Revise handling of index and score in Inventory.suggest #220

Open

bskinn mentioned this issue Oct 14, 2022

Remove references to fuzzywuzzy from docs #247

Open

This was referenced Nov 22, 2022

Consider uploading source dist (aka. "sdist") to PyPI #264

Closed

Add sphobjinv recipe to conda-forge conda-forge/staged-recipes#21285

Merged

bskinn added issue: planned ⌚ Assigned to a specific version's milestone and removed issue: maybe 🤔 Being considered, but not certain pr: needs tests ✔️ labels Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

bskinn commented Jan 10, 2021 •

edited

bskinn commented Oct 25, 2022

bskinn commented Nov 29, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 8, 2022

bskinn commented Jul 6, 2023

Implement multiprocessing speedup for suggest in absence of python-Levenshtein #178

Implement multiprocessing speedup for suggest in absence of python-Levenshtein #178

Comments

bskinn commented Jan 10, 2021 • edited

bskinn commented Oct 25, 2022

bskinn commented Nov 29, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 7, 2022

bskinn commented Dec 8, 2022

bskinn commented Jul 6, 2023

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

Implement multiprocessing speedup for `suggest` in absence of python-Levenshtein #178

bskinn commented Jan 10, 2021 •

edited