You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, idf is miscalculated in case of indexes with multiple chunks, e.g.:
drop table if exists t;
create table t(f text);
insert into t values(0,'abc'),(0,'def');
flush ramchunk t;
insert into t values(0,'abc'),(0,'def');
flush ramchunk t;
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25');
idf=0.43067655
The idf calculated here is equal to 0.43067655 while the correct idf, calculated manually, should be 0.12596481.
We can get this expected value by optimizing the index:
drop table if exists t;
create table t(f text);
insert into t values(0,'abc'),(0,'def');
flush ramchunk t;
insert into t values(0,'abc'),(0,'def');
flush ramchunk t;
optimize index t option sync=1, cutoff=1;
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25');
idf=0.12596481
The probable reason is that we retrieve the count of docs with a search term only per a single chunk.
Also, the global_idf option for CREATE TABLE doesn't appear to work with RT indexes. It's not displayed in index settings after the table's been created, and if we create a global idf file and set global_idf=1 when searching, all idf values get equal to 0.
Discussion
➤ Stan commented:
we already have local_df query option that should work for local index of distributed however it could also work and for disk chunks of RT index.
Could you check that?
➤ Nick Sergeev commented:
I've just checked it with the previous example:
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='0';
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='1';
but this haven't had effect on the idf, it's stayed the same.
➤ Sergey Nikolaev commented:
I think local_df=1 should be supported implicitly by RT indexes without even the need to specify it.
Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
Task estimated
Specification created, reviewed and approved
Implementation completed
Tests developed
Documentation updated
Documentation proofread
Changelog updated
OpenAPI YAML updated and issue created to rebuild clients
The text was updated successfully, but these errors were encountered:
Proposal:
Moved from https://github.com/manticoresoftware/dev/issues/371
Currently, idf is miscalculated in case of indexes with multiple chunks, e.g.:
idf=0.43067655
The idf calculated here is equal to
0.43067655
while the correct idf, calculated manually, should be0.12596481
.We can get this expected value by optimizing the index:
idf=0.12596481
The probable reason is that we retrieve the count of docs with a search term only per a single chunk.
Also, the
global_idf
option forCREATE TABLE
doesn't appear to work with RT indexes. It's not displayed in index settings after the table's been created, and if we create a global idf file and setglobal_idf=1
when searching, all idf values get equal to 0.Discussion
➤ Stan commented:
we already have local_df query option that should work for local index of distributed however it could also work and for disk chunks of RT index.
Could you check that?
➤ Nick Sergeev commented:
I've just checked it with the previous example:
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='0';
select *,weight(),packedfactors() from t where match ('abc') option ranker=expr('bm25'), local_df='1';
but this haven't had effect on the idf, it's stayed the same.
➤ Sergey Nikolaev commented:
I think
local_df=1
should be supported implicitly by RT indexes without even the need to specify it.Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: