Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.dedupe() using sets instead of lists & dict.keys() #255

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 5 additions & 15 deletions fuzzywuzzy/process.py
Expand Up @@ -228,11 +228,9 @@ def dedupe(contains_dupes, threshold=70, scorer=fuzz.token_set_ratio):
score greater than a user defined threshold. Then, it looks for the longest item in the duplicate list
since we assume this item contains the most entity information and returns that. It breaks string
length ties on an alphabetical sort.

Note: as the threshold DECREASES the number of duplicates that are found INCREASES. This means that the
returned deduplicated list will likely be shorter. Raise the threshold for fuzzy_dedupe to be less
sensitive.

Args:
contains_dupes: A list of strings that we would like to dedupe.
threshold: the numerical value (0,100) point at which we expect to find duplicates.
Expand All @@ -242,44 +240,36 @@ def dedupe(contains_dupes, threshold=70, scorer=fuzz.token_set_ratio):
of the form f(query, choice) -> int.
By default, fuzz.token_set_ratio() is used and expects both query and
choice to be strings.

Returns:
A deduplicated list. For example:

In: contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
In: fuzzy_dedupe(contains_dupes)
Out: ['Frodo Baggins', 'Samwise G.', 'Bilbo Baggins', 'Gandalf']
"""

extractor = []
extractor = set()

# iterate over items in *contains_dupes*
for item in contains_dupes:
# return all duplicate matches found
matches = extract(item, contains_dupes, limit=None, scorer=scorer)
# filter matches based on the threshold
filtered = [x for x in matches if x[1] > threshold]
# if there is only 1 item in *filtered*, no duplicates were found so append to *extracted*
# if there is only 1 item in *filtered*, no duplicates were found so add to *extracted*
if len(filtered) == 1:
extractor.append(filtered[0][0])
extractor.add(filtered[0][0])

else:
# alpha sort
filtered = sorted(filtered, key=lambda x: x[0])
# length sort
filter_sort = sorted(filtered, key=lambda x: len(x[0]), reverse=True)
# take first item as our 'canonical example'
extractor.append(filter_sort[0][0])

# uniquify *extractor* list
keys = {}
for e in extractor:
keys[e] = 1
extractor = keys.keys()
extractor.add(filter_sort[0][0])

# check that extractor differs from contain_dupes (e.g. duplicates were found)
# if not, then return the original list
if len(extractor) == len(contains_dupes):
return contains_dupes
else:
return extractor
return list(extractor)