Update bm25 algorithm with term matching and field normalization #190

agologan · 2019-07-06T09:53:41Z

This fixes the way BM25 implementation works and introduces field normalization all inspired by Elasticsearch.
This breaks fuzzy matching but if this looks like a change in the right direction I'm open to completing this change.

nticaric · 2019-07-07T17:40:45Z

First of all, thanks for this PR. It surely took a lot of effort to create it. Lets first focus a little bit on the BM25 algorithm. You changed some of the weights and introduced the document length variable again. In earlier versions we already had the document length variable, but it turned out better by ignoring it. In theory it's good to have it, but the praxis showed something different. There was a blog post from sphinx search that covered the reason why ignoring it is a good choice. Maybe you could point to some elasticsearch resources to have a further look on it

agologan · 2019-07-07T18:13:21Z

Wrote these changes because we have an internal project that uses tntsearch and our result ranking was weirdly off in some cases.

Wasn't sure why but after reading the implementation I realized the BM25 ranking implementation was different than what I expected.
Read a few papers and checked if any current projects still use it and if they made any changes to it.
After reading this: https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables I implemented a few changes which ended up in quite noticeable improvements.

I have not read the Sphinx blog post you're referring to, but I can imagine why document length can be ignored in some cases. In my case it produces better results.

To be honest document length improved the result ranking a bit but what improved result ranking even more was field normalization. This really depends on the data set you're using but having matches rank higher by fields like category or title which are usually much smaller than the document body is very important to us.

The weights I changed because their defaults elastic uses seem a bit more reasonable based on the documents I read so for general purpose usage.

nticaric · 2019-07-07T18:20:41Z

Do you think we could come up with a compromise by creating another class that would extend the current one and have those changes in it. So the end user would be able to choose and see what works best for him

agologan · 2019-07-07T19:09:17Z

I think we can leave this as is and if anybody needs to change the default behavior they can be pointed to this PR.
This PR exists because I found these changes to be useful but they may not necessarily be the best defaults for this project.
For example I found it strange that by default boolean search doesn't do ranking or that the ranked results don't provide an option to match all terms but that's just me and my use-case.

These changes also introduce a fair bit of complexity and a performance penalty for somebody looking to just search a list of cities or some other smaller objects.
For engineers looking for a simple solution for document searching these can be beneficial but if they understand the changes I'm proposing here they may be better off using a full fledge solution like Solr, Sphinx or Elasticsearch.

nticaric · 2019-07-07T19:29:45Z

You're changes are good and could be useful to a lot of people, I'll try to find a way to include them into the package without introducing breaking changes.
My current way of thinking is to write a class that would extend TNTSEarch class and include those changes.

Cannot think of a compact name for the class but something like TNTSearchWithFieldWeighting. I'm open to suggestions

The indexer could stay as is only with an option that would run the 2 functions saveHitList and saveDocInfo. This wouldn't affect performance for the default behavior

and if someone uses the TNTSearchWithFieldWeighting class on an index without the required tables and data, we could simply throw an exception with a message to do a reindex with the option on.

agologan · 2019-07-07T19:42:09Z

There's another question on how others use tntsearch ?
Our engineer used it because he was familiar with Laravel and this solution using the scout driver looked the easiest to integrate for the project he was starting.
If used directly extending via a subclass should make it easy enough to use, but if used through Scout it probably needs some flag in the config.

Also a better way would be to use composition and/or feature flagging.
Instead of subclassing adding a flag in the config and then branching into functions should keep all code in the same context encouraging reuse and refactoring.

This PR introduces 3 major changes and if you agree we could create 3 config flags which change both the indexer and the search algorithm as needed:

document length weights
field length weights
match all terms for ranked results

teamtnt#190

Update bm25 algorithm with term matching and field normalization

904c6d1

detain added a commit to detain/tntsearch that referenced this pull request Dec 12, 2019

merged

4646540

teamtnt#190

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update bm25 algorithm with term matching and field normalization #190

Update bm25 algorithm with term matching and field normalization #190

agologan commented Jul 6, 2019 •

edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 •

edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 •

edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 •

edited

Update bm25 algorithm with term matching and field normalization #190

Are you sure you want to change the base?

Update bm25 algorithm with term matching and field normalization #190

Conversation

agologan commented Jul 6, 2019 • edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 • edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 • edited

nticaric commented Jul 7, 2019

agologan commented Jul 7, 2019 • edited

agologan commented Jul 6, 2019 •

edited

agologan commented Jul 7, 2019 •

edited

agologan commented Jul 7, 2019 •

edited

agologan commented Jul 7, 2019 •

edited