process.extractOne does not match fuzz.ratio #288

Pedro-Saad · 2020-10-31T02:12:21Z

Using the process.extractOne and fuzz.ratio give different results in this case:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

stringToMatch = 'Florinia-SP'
possibleResults = ['São Bernado do Campo-SP', 'Florínea-SP']
print(fuzz.ratio(stringToMatch,possibleResults[0]))
print(fuzz.ratio(stringToMatch,possibleResults[1]))
print(process.extract(stringToMatch,possibleResults))

While the individual fuzz.ratio give correct results (41 for the lowest score and 82 for the highest score), the process.extract gives 86 for both of them.

teste.zip

The text was updated successfully, but these errors were encountered:

maxbachmann · 2020-11-01T12:12:18Z

These are the docs of process.extract:

Select the best match in a list or dictionary of choices.
Find best matches in a list or dictionary of choices, return a
list of tuples containing the match and its score. If a dictionary
is used, also returns the key for each match.
Arguments:
query: An object representing the thing we want to find.
choices: An iterable or dictionary-like object containing choices
to be matched against the query. Dictionary arguments of
{key: value} pairs will attempt to match the query against
each value.
processor: Optional function of the form f(a) -> b, where a is the query or
individual choice and b is the choice to be used in matching.
This can be used to match against, say, the first element of
a list:
lambda x: x[0]
Defaults to fuzzywuzzy.utils.full_process().
scorer: Optional function for scoring matches between the query and
an individual processed choice. This should be a function
of the form f(query, choice) -> int.
By default, fuzz.WRatio() is used and expects both query and
choice to be strings.
limit: Optional maximum for the number of elements returned. Defaults
to 5.
Returns:
List of tuples containing the match and its score.
If a list is used for choices, then the result will be 2-tuples.
If a dictionary is used, then the result will be 3-tuples containing
the key for each match.
For example, searching for 'bird' in the dictionary
{'bard': 'train', 'dog': 'man'}
may return
[('train', 22, 'bard'), ('man', 0, 'dog')]

They state, that the default scorer for process.extract is fuzz.WRatio, which will give different results than fuzz.ratio. If you want to use fuzz.ratio you can specify this using the scorer argument. Beside this fuzz.ratio does not preprocess strings before matching them, while process.extract does preprocess them by default using fuzzywuzzy.utils.full_process(). So if you want to have similar results to fuzz.ratio this behaviour should be disabled using the processor argument.

process.extract(stringToMatch, possibleResults, scorer=fuzz.ratio, processor=None)

Other process functions like process.extractOne use similar defaults.

Azrael1 mentioned this issue May 13, 2021

process.extract broken in fuzzywuzzy=0.13 #314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process.extractOne does not match fuzz.ratio #288

process.extractOne does not match fuzz.ratio #288

Pedro-Saad commented Oct 31, 2020

maxbachmann commented Nov 1, 2020 •

edited

process.extractOne does not match fuzz.ratio #288

process.extractOne does not match fuzz.ratio #288

Comments

Pedro-Saad commented Oct 31, 2020

maxbachmann commented Nov 1, 2020 • edited

maxbachmann commented Nov 1, 2020 •

edited