[Feature request] Prefix match support fuzziness #30720

KazaBen · 2024-03-22T15:01:42Z

Is your feature request related to a problem? Please describe.

When implementing autocomplete application, most of the queries are incomplete words.
Sometimes misspells happen in incomplete queries too. Let's look at an example:

Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

It is a must do for autocomplete application to be still able to match Fjallraven with user query Fjalr.

Describe the solution you'd like

When using Prefix match add support for Fuzziness.

Describe alternatives you've considered

Combining prefix matching with word fuzzy matching: term (prefix) contains OR term contains fuzzy
However, it would not cover the previously mentioned common case:
Document = Fjallraven (notice double L)
Query = Fjalr (incomplete query, with misspell: only a single L)

term (prefix) contains couldn't match as it query has a misspell and there is no fuzziness support
term contains fuzzy couldn't match as it is incomplete word

Additional context

This exact functionality exists in Elasticsearch’s Completion Suggester Fuzzy Queries

The text was updated successfully, but these errors were encountered:

KazaBen · 2024-04-15T10:16:46Z

@vekterli @geirst Hi 👋

We have a major project to implement autocomplete on Vespa for Vinted, and we depend on fuzziness :o
Are there any estimates when this could be implemented, or should we look for other solutions?

Thank you ❤️

tkaessmann · 2024-04-15T10:58:51Z

Wouldn‘t it be possible to add a simple component that first run a internal fuzzy matching query and with the corrected result a usual prefix query?
Or did i understand something wrong?

Greetings,
Tobias

vekterli · 2024-04-15T12:02:12Z

@KazaBen I've started working on this feature between a few other things. Implementation will be a two-step process:

Add support for prefix matching to our core Levenshtein algorithm implementations (Deterministic Finite Automata used for max edits {1, 2} and a generalized Levenshtein matrix fallback for max edits > 2). Work in progress.
Wire this new prefix matching mode into the query evaluation pipeline. Not yet started.

I should be able to give an estimate once I've started work on part 2 (hopefully within a few days, assuming nothing else pops up) and have gotten a gist of the complexity involved.

vekterli · 2024-04-19T15:51:33Z

Update: part 2 has been completed and its PR is pending code review. Once it has been merged and a new Vespa version has been released containing the changes, fuzzy prefix matching can be used by adding a prefix:true annotation to your query term. Example YQL:

{maxEditDistance:1,prefix:true}fuzzy("Fjalr")

Assuming no other blockers, I'd expect it to be available as part of an Open-source release some time next week. I'll update this issue with a concrete version number once that happens.

A few notes:

Fuzzy prefix matching will often end up matching a lot more terms than non-prefix matching, so this should be taken into account when constructing queries—in particular when query strings are short. For instance, the query {maxEditDistance:2,prefix:true}fuzzy("Fj") will match all terms since any possible prefix can be trivially transformed to Fj with 2 edits. This generalizes to be the case for every query where maxEditDistance >= len(query).
Related to the above, prefix locking (prefixLength:n) can be used alongside prefix matching to constrain the candidate set to terms that have prefix that exactly matches n characters of the query term. This also greatly speeds up dictionary scans.
Although this implements fuzzy prefix matching, one piece of the puzzle that is still missing is a ranking feature that exposes the actual edit distance between the term and the query. This is nothing new, as the same applies to non-prefix fuzzy queries. We have an existing ticket Consider add ranking features for fuzzy query operator #24242 for adding this.

vekterli · 2024-04-30T13:19:33Z

@KazaBen version 8.337.85 is now on Docker hub, which contains support for fuzzy prefix matching. I haven't added it to the official documentation just yet, but my previous comment should have the most important bits of information (and caveats...!). Would be great if you could test it out and let me know if it solves your use case.

geirst assigned vekterli Apr 3, 2024

geirst added this to the soon milestone Apr 3, 2024

This was referenced Apr 16, 2024

Add prefix match support to Levenshtein algorithm implementations #30932

Merged

Wire fuzzy prefix matching support through the query stack #30976

Merged

Add system testing of fuzzy prefix matching vespa-engine/system-test#3806

Merged

vekterli mentioned this issue May 10, 2024

Document fuzzy prefix matching functionality vespa-engine/documentation#3190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature request] Prefix match support fuzziness #30720

[Feature request] Prefix match support fuzziness #30720

KazaBen commented Mar 22, 2024

KazaBen commented Apr 15, 2024

tkaessmann commented Apr 15, 2024

vekterli commented Apr 15, 2024

vekterli commented Apr 19, 2024

vekterli commented Apr 30, 2024

[Feature request] Prefix match support fuzziness #30720

[Feature request] Prefix match support fuzziness #30720

Comments

KazaBen commented Mar 22, 2024

KazaBen commented Apr 15, 2024

tkaessmann commented Apr 15, 2024

vekterli commented Apr 15, 2024

vekterli commented Apr 19, 2024

vekterli commented Apr 30, 2024