Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Prefix match support fuzziness #30720

Open
KazaBen opened this issue Mar 22, 2024 · 5 comments
Open

[Feature request] Prefix match support fuzziness #30720

KazaBen opened this issue Mar 22, 2024 · 5 comments
Assignees
Milestone

Comments

@KazaBen
Copy link

KazaBen commented Mar 22, 2024

Is your feature request related to a problem? Please describe.

When implementing autocomplete application, most of the queries are incomplete words.
Sometimes misspells happen in incomplete queries too. Let's look at an example:

Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

It is a must do for autocomplete application to be still able to match Fjallraven with user query Fjalr.

Describe the solution you'd like

When using Prefix match add support for Fuzziness.

Describe alternatives you've considered

Combining prefix matching with word fuzzy matching: term (prefix) contains OR term contains fuzzy
However, it would not cover the previously mentioned common case:
Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

term (prefix) contains couldn't match as it query has a misspell and there is no fuzziness support
term contains fuzzy couldn't match as it is incomplete word

Additional context

This exact functionality exists in Elasticsearch’s Completion Suggester Fuzzy Queries

@geirst geirst added this to the soon milestone Apr 3, 2024
@KazaBen
Copy link
Author

KazaBen commented Apr 15, 2024

@vekterli @geirst Hi 👋

We have a major project to implement autocomplete on Vespa for Vinted, and we depend on fuzziness :o
Are there any estimates when this could be implemented, or should we look for other solutions?

Thank you ❤️

@tkaessmann
Copy link

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query?
Or did i understand something wrong?

Greetings,
Tobias

@vekterli
Copy link
Member

@KazaBen I've started working on this feature between a few other things. Implementation will be a two-step process:

  1. Add support for prefix matching to our core Levenshtein algorithm implementations (Deterministic Finite Automata used for max edits {1, 2} and a generalized Levenshtein matrix fallback for max edits > 2). Work in progress.
  2. Wire this new prefix matching mode into the query evaluation pipeline. Not yet started.

I should be able to give an estimate once I've started work on part 2 (hopefully within a few days, assuming nothing else pops up) and have gotten a gist of the complexity involved.

@vekterli
Copy link
Member

Update: part 2 has been completed and its PR is pending code review. Once it has been merged and a new Vespa version has been released containing the changes, fuzzy prefix matching can be used by adding a prefix:true annotation to your query term. Example YQL:

{maxEditDistance:1,prefix:true}fuzzy("Fjalr")

Assuming no other blockers, I'd expect it to be available as part of an Open-source release some time next week. I'll update this issue with a concrete version number once that happens.

A few notes:

  • Fuzzy prefix matching will often end up matching a lot more terms than non-prefix matching, so this should be taken into account when constructing queries—in particular when query strings are short. For instance, the query {maxEditDistance:2,prefix:true}fuzzy("Fj") will match all terms since any possible prefix can be trivially transformed to Fj with 2 edits. This generalizes to be the case for every query where maxEditDistance >= len(query).
  • Related to the above, prefix locking (prefixLength:n) can be used alongside prefix matching to constrain the candidate set to terms that have prefix that exactly matches n characters of the query term. This also greatly speeds up dictionary scans.
  • Although this implements fuzzy prefix matching, one piece of the puzzle that is still missing is a ranking feature that exposes the actual edit distance between the term and the query. This is nothing new, as the same applies to non-prefix fuzzy queries. We have an existing ticket Consider add ranking features for fuzzy query operator #24242 for adding this.

@vekterli
Copy link
Member

@KazaBen version 8.337.85 is now on Docker hub, which contains support for fuzzy prefix matching. I haven't added it to the official documentation just yet, but my previous comment should have the most important bits of information (and caveats...!). Would be great if you could test it out and let me know if it solves your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants