Vectors in Search

Dice.com code for implementing the ideas discussed in the following talks:

'Vectors in Search' - Activate 2018 conference
'Searching with Vectors' - Haystack 2019 conference

This extends my earlier work on 'Conceptual Search' which can be found here - https://github.com/DiceTechJobs/ConceptualSearch (including slides and video links). In this talk, I present a number of different approaches for searching vectors at scale using an inverted index. This implements approaches to Approximate k-Nearest Neighbor Search including:

LSH (using the Sim Hash)
K-Means Tree
Vector Thresholding

and describes how these ideas can be implemented and queried efficiently within an inverted index.

UPDATE: After talking with Trey Grainger and Erik Hatcher from LucidWorks, they recommended using term frequency in place of payloads for the solutions where I embed term weights into the index and use a special payload aware similarity function (which would also not be needed). Payloads incur a significant performance penalty. The challenge with this is the negative weights, I assume it is not possible to encode negative term frequencies, but this can be worked around by having different tokens for positive and negative weighted tokens, and making similar adjustments at query time (where negative boosts can be applied in Solr as needed).

Lucene Documentation: Lucene Delimited Term Frequency Filter

There has also been a recent update to Lucene core that is applicable here and is soon to make it's way into Elastic search at time of writing: Block Max WAND. This produces a signifcant speed up for large boolean OR queries where you don't need to know the exact number of results but just care about getting the top-N results as fast as possible. All of the approaches I discuss here generate relatively large OR queries and so this is very relevant. I have also read that the current implementation of minimum-should-match also includes similar optimizations, and so the same sort of performance gain may already be attained using appropriate mm settings, something that I was already experimenting with in my code.

Directory Structure

python
- Code for implementing the k-means tree, LSH sim hash and vector thresholding algorithms, and indexing and searching vectors in solr using these techniques.
solr_plugins
- Java code for implementing the custom similarity classes and payloadEdismax parser described in the talk.
solr_configs
- Xml snippets for importing the solr plugins from the 'solr_vectors_in_search_plugins' java code.

Implementation Details

Solr Version - 7.5
Python Version - 3.x+ (3.5 used)

Links to Talks

Activate 2018: 'Vectors in Search'
- Slides
- Video
Haystack 2019: 'Searching with Vectors'
- Slides
- Video

Author

Simon Hughes ( Chief Data Scientist, Dice.com )

LinkedIn - https://www.linkedin.com/in/simon-hughes-data-scientist/
Twitter - https://twitter.com/hughes_meister

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
python		python
solr_configs		solr_configs
solr_plugins		solr_plugins
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python

python

solr_configs

solr_configs

solr_plugins

solr_plugins

.gitignore

.gitignore

LICENSE

LICENSE

Readme.md

Readme.md

Repository files navigation

Vectors in Search

Directory Structure

Implementation Details

Links to Talks

Author

About

Releases

Packages

Languages

License

DiceTechJobs/VectorsInSearch

Folders and files

Latest commit

History

Repository files navigation

Vectors in Search

Directory Structure

Implementation Details

Links to Talks

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages