Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maybe detect acronyms? #47

Open
leeoniya opened this issue Oct 2, 2023 · 4 comments
Open

maybe detect acronyms? #47

leeoniya opened this issue Oct 2, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@leeoniya
Copy link
Owner

leeoniya commented Oct 2, 2023

an option to detect acronyms in the needle might be interesting, but also tricky

Teenage Mutant Ninja Turtles, commonly abbreviated as TMNT, is an American media franchise created by the comic book artists Kevin Eastman and Peter Laird.

searching for TMNT would modify the term to t m n t and maybe interLft: 2. not sure this can actually work. e.g. NASA and NBA is never actually spelled out. plus interLft: 2 affects the whole needle, so would have unwanted side-effects. always possible to do better discarding for acronyms after initial filter, or maybe not...

@leeoniya leeoniya added the enhancement New feature or request label Oct 2, 2023
@leeoniya leeoniya changed the title detect acronyms? maybe detect acronyms? Oct 2, 2023
@theBowja
Copy link

theBowja commented Feb 9, 2024

farzher/fuzzysort handles acronyms pretty well right?

image

I think after modifying to t m n t you could add sort function that prioritizes prioritizes prefix matches?

@leeoniya
Copy link
Owner Author

leeoniya commented Feb 9, 2024

the main problem here is knowing if you have an acronym or not. once we know that, we can easily set the correct options:

https://leeoniya.github.io/uFuzzy/demos/compare.html?libs=uFuzzy&search=t%20m%20n%20t&interLft=2

how do we know that lowercase tmnt is an acronym but fast is not? you would not want "fast" to match "for a strange test"

@theBowja
Copy link

theBowja commented Feb 9, 2024

I see. I think it probably goes more into search relevance than fuzzy search can handle. But I believe there are some rules we can use to get close to good results.

Exact full term matches:

  • Self-explanatory.
  • Example: for haystack ["fast", "for a strange test"], searching fast should return "fast" first because it is an exact match.

Exact acronym matches:

  • I think this is what we're really interested in.
  • Generally we search for acronyms by providing the entire acronym as needle.
  • Example: for haystack ["faster", "for a strange test"], searching fast should return "for a strange test" first because it is an exact match against the first character of each "word" within "for a strange test".

Partial acronym matches:

  • There are special considerations where things get tough.
  • Easy example: we don't match tmn against "Teenage Mutant Ninja Turtles" because it is a typo.
  • Hard example: for haystack ["Code Vein", "Call of Duty: Black Ops"], what should be the best result of searching cod? Based on previous rules, it should be "Code Vein". It is reasonable to expect that the user will modify the search term to codbo if they actually wanted to get "Call of Duty: Black Ops".
  • Another example: what about "Teenage Mutant Ninja Turtles: Mutants in Manhattan"? Let's say that the popular acronym for it is tmnt mm. In this case, it'll no longer be an exact acronym match. But an observation is that the longer the acronym, the lower the chance that it forms an actual word that collides with our desired expanded acronym. So it should still be ok following the previous rules. Probably.

@leeoniya
Copy link
Owner Author

leeoniya commented Feb 9, 2024

im not sure this belongs in the core, honestly. you can simply pre-process the needle and create a few different needles + ufuzzy options, and just do a several independent searches, then combine and sort the results as you see fit. it will be slightly slower but that's an okay trade-off to keeping the internals relatively unopinionated and straightforward.

  • Example: for haystack ["faster", "for a strange test"], searching fast should return "for a strange test" first because it is an exact match against the first character of each "word" within "for a strange test".

that's just your preference. many others (myself included) would expect the prefix match first. it's not black and white, unfort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants