Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token_set_ratio Degenerate Case #325

Open
rogerrohrbach opened this issue Oct 13, 2021 · 0 comments
Open

token_set_ratio Degenerate Case #325

rogerrohrbach opened this issue Oct 13, 2021 · 0 comments

Comments

@rogerrohrbach
Copy link

Referring to the description of token_set_ratio in the original blog post: if the SORTED_INTERSECTION is a strict subset of STRING2, the result ratio will be 100. E.g.,

fuzz.token_set_ratio("Deep Learning", "Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2")

yields 100. This is patently incorrect, and does not uphold the purported intuition ("because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar").

Looking at fuzz._token_set, we see that it returns

max(
    [
        ratio_func(sorted_sect, combined_1to2),
        ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)

It appears the assumption is that the string remainder will never be empty. Perhaps something like this is more appropriate:

max(
    [
        0 if sorted_sect == combined_1to2 else ratio_func(sorted_sect, combined_1to2),
        0 if sorted_sect == combined_2to1 else ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant