Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String fuzzy-matching From R to Python #317

Open
Magic-fan opened this issue Jul 6, 2021 · 1 comment
Open

String fuzzy-matching From R to Python #317

Magic-fan opened this issue Jul 6, 2021 · 1 comment

Comments

@Magic-fan
Copy link

I am trying to use string fuzzy-matching with both R and Python. I am actually using two packages:

  • stringdist from R
  • fuzzywuzzy from Python

When I try amatch("PARI", c("HELLO", "WORLD"), maxDist = 2) on R, I get NA as a result, which is intuitive. But when I try the same thing with Python : process.extract("PARI", ["HELLO", "WORLD"], limit = 2), I get [('world', 22), ('HELLO', 0)]

How could I get the same result as in R ?

Thanks in advance

@maxbachmann
Copy link

maxbachmann commented Jul 7, 2021

There are a couple of important differences between the two packages:

  1. In FuzzyWuzzy limit specifies how many elements you want extract to return. extract does not provide an argument to specify a maxDist. For this purpose you would have to use the extractBests with the score_cutoff argument.

  2. Stringdist appears to use an edit distance, while FuzzyWuzzy only provides normalized string metrics (0-100). So you would have to use e.g. score_cutoff=90. You can specify the string metric using the scorer argument.

  3. FuzzWuzzy preprocesses strings by default in the extract function (lowercase + replaces non alphanumeric characters). You can disable this using processor=None

As an alternative you could use RapidFuzz which allows the usage of edit distances and a score_cutoff parameter in the extract function:

>>> from rapidfuzz import process, string_metric
>>> process.extract("PARI", ["HELLO", "WORLD"], processor=None, scorer=string_metric.levenshtein, score_cutoff=2)
[]
>>> process.extract("HELL", ["HELLO", "WORLD"], processor=None, scorer=string_metric.levenshtein, score_cutoff=2)
[('HELLO', 1, 0)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants