_set_token_ratio now keeps tokenization. #300

MWLever · 2021-02-21T17:07:38Z

Previously, set_token ratio, through a mixture of join and split and strip concatenated all tokens together with no whitespace. This allowed for partial matches across token boundaries. This can occur in practice when a human enters a search, but is rare.

Change: Implement Levenshtein's setratio() scoring and preserve tokenization in fuzz._set_token_ratio

Now fails 2 tests due to score changes, which should be expected.

testTokenSetRatio: score improves
testWithCutOff: score improves to above 50

Previous issue: partial_token_set_ratio matching strings across tokens. Fix: Preserve tokenization of the comparison sets and use Levenshtein's setratio/seqratio over ratio. Detail: Previously token_set_ratio used python's strip to remove white space. Since strip removes all whitespace, the set comparisons are not tokenized. So when using partial_token_set_ratio, you would be able to match strings across word boundaries. This is generally unexpected behavior. This change should allow a more bag of words.

MWLever added 2 commits February 21, 2021 11:32

make pep8 compliant

ee940a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_set_token_ratio now keeps tokenization. #300

_set_token_ratio now keeps tokenization. #300

MWLever commented Feb 21, 2021

_set_token_ratio now keeps tokenization. #300

Are you sure you want to change the base?

_set_token_ratio now keeps tokenization. #300

Conversation

MWLever commented Feb 21, 2021