Skip to content

gav-ctf/Solr-ctf-query-parser

Repository files navigation

Solr-ctf-query-parser

Reranks and expands Solr query returns using filtered clickstream data, providing a simple, flexible collaborative filtering framework. Clickthroughfilter-x.x.x.jar runs the filter as a query parser plugin for Solr/Lucene.

The filter samples click data (from a separate clicks core) for items returned by a query and uses it to

1.reorganiseboost items in proportion to their click traffic (primary items)
2.extendinject new items not matching the query, but connected to the query's primary items by click traffic (secondary items)
3.customiseboost & inject selected item types and boost & inject using selected components of click traffic

These elements can be used separately or in combination for improving search returns, current awareness, identifying related material, making personal recommendations etc (see example CTF queries below).

To minimise processing times, the filter acts either on the top n items of a query return ("base=matches": faster, but lower improvement), or looks deeper into the return and draws out the top n items in terms of click traffic ("base=clicks": potentially slower, but with potentially higher improvement).

For the most part, the filter can be built straight into many types of complex queries, including conjunction with other parsers, methods, facets etc. Where you find that a CTF query is not directly compatible with other complex queries (e.g. certain joins & group functions), you can usually find a way round by rearranging your input query or writing your own request handler.

A "/ctf" request handler is provided to help with testing and tuning (see attached solrconfig.xml). This quantifies primary and secondary improvements relative to the unmodified return and provides other useful metrics.

For more detail about the CTF plugin go to https://www.slideshare.net/pontneo/better-search-implementation-of-click-through-filter-as-a-query-parser-plugin-for-apache-solr-lucene

SOME EXAMPLE CTF QUERIES (parameters are described below):

1. search in text field of your documents core for "melon" at your default CTF settings;
q={!ctf}text:melon or q={!ctf v=$qq} with qq=text:melon
2. weighted search for "melon" in title or body fields using only "botanist" click traffic but otherwise default settings;
q={!ctf ctp="user:botanist" cts="user:botanist" v=$qq} with qq=title:melon^2 body:melon
3. weighted search for "melon" or "cherry" blending click traffic & publication_date boosting;
q={!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}{!ctf cb=5 cx=2 v=$qq} with qq=text:melon^3 text:cherry^2
or q={!ctf cb=5 cx=2 v=$qq} with qq=({!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}body:melon^3 {!boost b=recip(ms(NOW/HOUR,publication_date),3.16e-11,1,1)}body:cherry^2) etc
4. filtered search for "orange" with all returned items in category fruit;
q={!ctf v=$qq} with qq=text:orange and fq=category:fruit
5. filtered search for "orange" with just secondary items in category fruit;
q={!ctf cf="category:fruit" v=$qq} with qq=text:orange
6. filtered search for "orange" with just primary items in category fruit;
q={!ctf v=$qq} with qq=+text:orange +category:fruit
7. search for "lychee" with a high sensitivity to changes over time, recoiling quickly to the unmodified search return;
q={!ctf cp=5 cd=2 ctp="time_stamp:[NOW-1DAY TO NOW]" v=$qq} with qq=text:lychee
8. search for "lychee" in title field preserving original sort & including best 10% of secondary items with title word like "lychee";
q={!ctf reorder=false cf="title:lychee~0.5" cs=0.1 v=$qq} with qq=title:lychee
9. last week's search for "kiwi" based on the 10 most clicked "kiwi" items, with the best 20% of non-pdf secondary items that are then pushed down the list to help maximise improvement;
q={!ctf base=clicks cn=10 cz="NOW-7DAY" cs=0.2 cf="-doctype:pdf" cy=0.95 v=$qq} with qq=text:kiwi
10. top 5 most visited "lemon" recipes by user types 2 & 3 over the last 7 days;
q={!ctf cn=5 base=clicks restrict=true extend=false ctp="+user_type:(2 3) +time_stamp:[NOW-7DAY/DAY TO NOW]" v=$qq} with qq=recipe:*lemon*
11. top 10 recommendations for userID:x, based on userID:x's up to 20 most visited items over the last month and click traffic through those items by any other user with an interest in "fruit" since userID:x's last visit;
q={!ctf base=clicks only2y=true cn=20 ctp="userID:x AND time_stamp:[NOW-31DAY TO NOW]" cts="-userID:x AND user_interests:fruit AND time_stamp:[NOW-(last_visit)DAY TO NOW]" v=$qq} with qq=docID:* and rows=10
To remove any recommendations that userID:x has visited before, include in q;
cf="-({!join from=toDocID to=docID fromIndex=clicks_core}userID:x {!join from=fromDocID to=docID fromIndex=clicks_core}userID:x)"
12. next item in non-repeating discovery query with a simple fallback to avoid blind alleys;
q={!ctf only2y=true cts="-sessionID:currentSessionID" v=$qq} with qq=docID:currentDocID^10 categoryID:currentCategoryID and rows=1
13. related material for a given document with a simple fallback query for where there are few direct connections;
q={!ctf only2y=true base=clicks v=$qq} OR {!ctf base=clicks v=$qqq} with qq=docID:currentDocID^10 and qqq=categoryID:currentCategoryID

REQUIREMENTS:

1. document core(s) with unique docIDs
2. a clicks core for storing clicks (see attached schema.xml) with a minimum of following fields;
timestamp - timestamp of click
fromDocID - referrer docID (may be a null value where necessary)
toDocID - destination docID
other fields (such as userID, usertype, user_interests, to_posn_in_list, user_query etc) are not required but are of course part of the point of using this plugin (see example queries above)

PARAMETERS IN SOLRCONFIG.XML AND FOR USE IN QUERIES (see attached solrconfig.xml):

ctf settings:
solr_host_url root containing your data cores
document_core_to_query document core name
clicks_core_to_query clicks core name
ctf mappings:
document_ID_field_name document core docID field name
click_fromID_field_name clicks core fromDocID field name
click_null_fromID_value clicks core null fromDocID value (for clicks without a fromID)
click_toID_field_name clicks core toDocID field name
click_time_stamp_field_name clicks core timestamp field name
ctf parameters:
base get primary (1y) items from best query matches or most clicked items (String matches or clicks)
restrict show only items with click boosts (String true or false)
reorder allow click boosts to effect score and sort (String true or false)
extend include secondary (2y) items (String true or false)
only2y show only 2y items (String true or false)
cn number of 1y items to sample (int)
cd average clicks per 1y item (double), controls click sample period
cp time integration (int num clicks >0), low values = responsive, high = stable
cb click boost = cb.(fn clicks)^cx (double cb values >0)
cx click boost = cb.(fn clicks)^cx (double cx values >0)
ctp click traffic type for 1y items - a filter query on click traffic (String use ctp="*" for any, ctp="user_type:2", ctp="userID:xxxx", ctp="time_stamp:[NOW-7DAY TO NOW]" or ctp="some function query" etc)
cts click traffic type for 2y items - a filter query on click traffic (String as ctp)
cg skews 2y scores from popular to specific (double value 0 to 1)
cs proportion of 2y items to allow through, lowest & oldest traffic removed first (double value 0 to 1)
cf 2y item type - a filter query on 2y items (String use cf="*" for any, cf="cat:2", cf="-doc_type:pdf", cf="published:[NOW-31DAY TO NOW]", cf="some function query" etc)
cy position 2y items in list (double value between 0 (next to parent) and 1 (own click boost))
cz lookback parameter for observing past query returns at a certain time (Solr date use cz="NOW" or e.g. cz="NOW-7DAY" or cz="2015-07-14T11:32:00Z")