Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get fuzzy index? #293

Open
mridu-enigma opened this issue Dec 13, 2020 · 3 comments
Open

How to get fuzzy index? #293

mridu-enigma opened this issue Dec 13, 2020 · 3 comments

Comments

@mridu-enigma
Copy link

I am using fuzzywuzzy to look for key-phrase like terms in corpuses.

FWIW, when there's a tie I'd like the tie to be broken by earliest match, so: is there a way to get the fuzzy-index of a match? I tried all functions in fuzz and process (using dir() to discover funcs like QWRatio, etc.)

For instance, I want some mechanism that ranks fuzz.partial_ratio('alex', 'alexa not') higher than fuzz.partial_ratio('alex', 'not alexa'), but for fuzzy matches (that's a simplistic example). How can I achieve this?

@acslater00
Copy link
Contributor

I haven't thought about it deeply, but I think it's an ambiguous problem definition. You probably need a minimum threshold that you would consider a match before being able to implement. If you did that, a naive solution would be to just iterate the string and return the first index with a fuzzy-threshold matching your search key. There might be other more optimized solutions that would scale to longer strings. Hope this helps.

@s-c-p
Copy link

s-c-p commented Dec 14, 2020

@mridu-enigma I don't understand it.

You want index of first token match or index where max similarity (density) is observed?

Say you are searching for o ordin in strings like co-ordinate, what output do you seek? 1 or 3?

@acslater00 I haven't grokked the entire code, perhaps you could shed some light of correct behavior.

@maxbachmann
Copy link

maxbachmann commented Dec 16, 2020

fuzz.partial_ratio searches for the best alignment of the shorter string to the longer string. It does not matter which way you insert them in as long as they do not have a similar length (for similar lengths the results can differ)
In the code this is achieved by

    if len(s1) <= len(s2):
        shorter = s1
        longer = s2
    else:
        shorter = s2
        longer = s1

Since you mention that you match a key against a phrase I assume that the key always has to be shorter than the phrase, so you might be able to implement this the following way:

def your_scorer(s1, s2):
  if len(s1) > len(s2):
    return 0
  return fuzz.partial_ratio(s1, s2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants