Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.extractOne does not match fuzz.ratio #288

Open
Pedro-Saad opened this issue Oct 31, 2020 · 1 comment
Open

process.extractOne does not match fuzz.ratio #288

Pedro-Saad opened this issue Oct 31, 2020 · 1 comment

Comments

@Pedro-Saad
Copy link

Using the process.extractOne and fuzz.ratio give different results in this case:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

stringToMatch = 'Florinia-SP'
possibleResults = ['São Bernado do Campo-SP', 'Florínea-SP']
print(fuzz.ratio(stringToMatch,possibleResults[0]))
print(fuzz.ratio(stringToMatch,possibleResults[1]))
print(process.extract(stringToMatch,possibleResults))

While the individual fuzz.ratio give correct results (41 for the lowest score and 82 for the highest score), the process.extract gives 86 for both of them.

teste.zip

@maxbachmann
Copy link

maxbachmann commented Nov 1, 2020

These are the docs of process.extract:

Select the best match in a list or dictionary of choices.
Find best matches in a list or dictionary of choices, return a
list of tuples containing the match and its score. If a dictionary
is used, also returns the key for each match.
Arguments:
query: An object representing the thing we want to find.
choices: An iterable or dictionary-like object containing choices
to be matched against the query. Dictionary arguments of
{key: value} pairs will attempt to match the query against
each value.
processor: Optional function of the form f(a) -> b, where a is the query or
individual choice and b is the choice to be used in matching.
This can be used to match against, say, the first element of
a list:
lambda x: x[0]
Defaults to fuzzywuzzy.utils.full_process().
scorer: Optional function for scoring matches between the query and
an individual processed choice. This should be a function
of the form f(query, choice) -> int.
By default, fuzz.WRatio() is used and expects both query and
choice to be strings.
limit: Optional maximum for the number of elements returned. Defaults
to 5.
Returns:
List of tuples containing the match and its score.
If a list is used for choices, then the result will be 2-tuples.
If a dictionary is used, then the result will be 3-tuples containing
the key for each match.
For example, searching for 'bird' in the dictionary
{'bard': 'train', 'dog': 'man'}
may return
[('train', 22, 'bard'), ('man', 0, 'dog')]

They state, that the default scorer for process.extract is fuzz.WRatio, which will give different results than fuzz.ratio. If you want to use fuzz.ratio you can specify this using the scorer argument. Beside this fuzz.ratio does not preprocess strings before matching them, while process.extract does preprocess them by default using fuzzywuzzy.utils.full_process(). So if you want to have similar results to fuzz.ratio this behaviour should be disabled using the processor argument.

process.extract(stringToMatch, possibleResults, scorer=fuzz.ratio, processor=None)

Other process functions like process.extractOne use similar defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants