Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise handling of index and score in Inventory.suggest #220

Open
bskinn opened this issue Jan 25, 2022 · 2 comments
Open

Revise handling of index and score in Inventory.suggest #220

bskinn opened this issue Jan 25, 2022 · 2 comments
Labels
Milestone

Comments

@bskinn
Copy link
Owner

bskinn commented Jan 25, 2022

The current stringify-then-regex-extract approach is kind of horrifying. I must have been on a regex kick when I wrote it. But, it was the best way I could think of at the time to retain each object's index value when passed into fuzzywuzzy.process.

Should be possible to just keep everything as tuples throughout? Catch might be on trying to implement #213, since a multiprocess implementation likely won't retain the ordering of the items, and so a simple enumerate(...) on the (e.g.) fwp.process() call might end up with meaningless index values.

On the other hand, this sort of ugly stringified approach will probably be horrid for trying to integrate with pluggable scoring-callables (#207), and a re-implementation will be needed anyways.

Might be better to implement my own scoring based on difflib, move away from fuzzywuzzy? Or, perhaps switch to a single-string scoring function within fuzzywuzzy? (That switch doesn't help dealing with possible loss of ordering by multiprocessed and/or third-party scoring functions....)

@bskinn bskinn added type: refactor 🔀 Some code needs restructuring pr: needs changelog 📍 labels Jan 25, 2022
@bskinn bskinn added this to the v2.3 milestone Jan 25, 2022
@eirrgang
Copy link

I notice that underscores don't contibute much to the matching score. Is this from the scoring heuristic or from the regex?

For instance, searching sphobjinv suggest https://docs.python.org/3/library '__module__' -su scores everything containing the string "module" with 90, including every occurrence of a :py:module: suggestion.

@bskinn
Copy link
Owner Author

bskinn commented Apr 20, 2022

It's just the way fuzzywuzzy works, @eirrgang -- I believe it strips non-alphanumeric characters, which is why the underscores don't affect the match.

In terms of the broader question of scoring quality, fuzzywuzzy does a Levenshtein-style string-diff calculation, and then transforms that to a 0-100 scale in some fashion. I haven't ever taken a close look at what it's doing... it's definitely not an optimal scoring function, but it worked well enough when I was looking for something lightweight and easy to integrate.

I want to work toward #207, so that users can develop and use higher-quality and/or customized scoring functions. I'm picturing a full plugin system, so that rather than housing a bunch of different scoring functions in sphobjinv itself, they can be maintained as separate sphobjinv-scoring-foo packages. Not sure how long that will take, though. I suspect the work I'm going to need to do to implement #178 will also move part of the way toward #207... will see.

@bskinn bskinn changed the title Refactor handling of index and score in Inventory.suggest Revise handling of index and score in Inventory.suggest Apr 2, 2024
@bskinn bskinn added the issue: future ⏳ Planned, but not for a specific release label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants