Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用smlar插件, 相似度计算公式采用tfidf, 效率和你列出的差别特别大 #33

Open
miaojianxin opened this issue Jul 9, 2018 · 2 comments

Comments

@miaojianxin
Copy link

我参照网页上你的步骤1步步做的,
数据是从https://www.mockaroo.com/c5418bd0,通过命令curl "https://api.mockaroo.com/api/c5418bd0?count=100000&key=b21830e0" > "tfidf.csv"下载的,一次只能下载5000,所以我下载了20次,得到10万数据,

然后我的执行计划:

postgres=# explain (analyze,verbose,timing,costs,buffers) select
postgres-# *,
postgres-# smlar( tokenize(body), '{Etiam,pretium,iaculis}'::text[] )
postgres-# from
postgres-# documents
postgres-# where
postgres-# tokenize(body) %% '{Etiam,pretium,iaculis}'::text[] -- where TFIDF similarity >= smlar.threshold
postgres-# order by
postgres-# smlar( tokenize(body), '{Etiam,pretium,iaculis}'::text[] ) desc
postgres-# limit 10;
QUERY PLAN



Limit (cost=368.05..368.07 rows=10 width=258) (actual time=4957.830..4957.832 rows=10 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Etiam,pretium
,iaculis}'::text[]))
Buffers: shared hit=4870
-> Sort (cost=368.05..368.30 rows=100 width=258) (actual time=4957.829..4957.829 rows=10 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Etiam,p
retium,iaculis}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{Etiam,pretium,
iaculis}'::text[])) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=4870
-> Bitmap Heap Scan on public.documents (cost=17.05..365.89 rows=100 width=258) (actual time=111.75
7..4957.313 rows=666 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{Et
iam,pretium,iaculis}'::text[])
Recheck Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{Etiam,pr
etium,iaculis}'::text[])
Rows Removed by Index Recheck: 17737
Heap Blocks: exact=3525
Buffers: shared hit=4870
-> Bitmap Index Scan on documents_tokenize_idx (cost=0.00..17.03 rows=100 width=0) (actual ti
me=95.413..95.413 rows=18403 loops=1)
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{Etia
m,pretium,iaculis}'::text[])
Buffers: shared hit=1345
Planning time: 0.456 ms
Execution time: 4957.952 ms
(19 rows)

而你的类似的查询,才7毫秒左右, 我shared hit的特别多, 我自己中文的文档,10万文档,然类似查询的话,时间也在几千毫秒左右,这是哪边配置不对吗,
set smlar.type = tfidf;
set smlar.stattable = 'documents_body_stats'
set smlar.persistent_cache=true;
set smlar.threshold =0.4;
set smlar.tf_method = 'n';

@miaojianxin
Copy link
Author

@miaojianxin
Copy link
Author

set enable_bitmap_indexscan=false;
只走index scan,查询计划

                                                              QUERY PLAN                         


Limit (cost=416.94..416.97 rows=10 width=258) (actual time=3561.033..3561.033 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringilla,mal
esuada,euismod}'::text[]))
Buffers: shared hit=13022
-> Sort (cost=416.94..417.19 rows=100 width=258) (actual time=3561.032..3561.032 rows=0 loops=1)
Output: document_id, body, (smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fringil
la,malesuada,euismod}'::text[]))
Sort Key: (smlar(regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text), '{fringilla,male
suada,euismod}'::text[])) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=13022
-> Index Scan using documents_tokenize_idx on public.documents (cost=0.28..414.78 rows=100 width=25
8) (actual time=3561.019..3561.019 rows=0 loops=1)
Output: document_id, body, smlar(regexp_split_to_array(lower(body), '[^[:alnum:]]'::text), '{fr
ingilla,malesuada,euismod}'::text[])
Index Cond: (regexp_split_to_array(lower(documents.body), '[^[:alnum:]]'::text) %% '{fringilla,
malesuada,euismod}'::text[])
Rows Removed by Index Recheck: 12299
Buffers: shared hit=13022
Planning time: 0.247 ms
Execution time: 3561.105 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant