Search Index Benchmark Game

A set of standardized benchmarks for comparing the speed of various aspects of search engine technologies.

This is useful both for comparing different libraries and as tooling for more easily and comprehensively comparing versions of the same technology.

Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

The lucene benchmarks requires Gradle. This can be installed from the Gradle website.

The tantivy benchmarks and benchmark driver code requires Cargo. This can be installed using rustup.

Installing

Clone this repo.

git clone git@github.com:jason-wolfe/search-index-benchmark-game.git

And that's it!

Running

You can now pass any file containing articles in JSON format, and a directory containing queries. A minimal example of articles is included in the project. A small set of queries is included in the project.

Running with the examples can be done like so

./run_all.sh ./common/datasets/minimal.json ./common/queries

This will:

build the benchmark driving code
For each engine being tested:
1. Build the code necessary to use it
2. Build an index using the supplied documents, and output timing in seconds to output/$engine/build_time.txt.
3. Run all of the supplied queries a number of times, recording the time taken to run in output/$engine/query_output.txt.

The supplied queries can be a directory, which will be searched recursively for .txt files to run, or it can be a .txt file itself, which will be used directly.

The output goes into the output subdirectory. It contains one folder per engine tested.

Running more

Maybe you want to query again after you know the page cache is warmed up, to better represent your production workflow. Or maybe you're debugging something or trying to improve query performance, and would like to run some queries without building the indexes again. For these use-cases, the query_all.sh script allows you to run the given set of queries against the already built indexes.

The argument format is the same as drive_queries.rs, which differs from run_all.sh, but allows more flexibility than run_all.sh currently offers.

./query_all.sh --queries ./common/queries/my_expensive_queries.txt -n 1

Important note: This assumes that each of your projects is already compiled as you wish them to be. If this is not the case, run preprocess_all.sh and they will build per the standard process.

TODO

Supply a better representative training set for easy use.

Support more engines.

Improve benchmark/run_all.sh to allow passing more parameters to the drive.sh program.

Output a more consumable summary format of any measurements made, to make comparison easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

common

common

lucene

lucene

tantivy

tantivy

.gitignore

.gitignore

INDEX_TYPES.txt

INDEX_TYPES.txt

LICENSE

LICENSE

README.md

README.md

preprocess_all.sh

preprocess_all.sh

query_all.sh

query_all.sh

run_all.sh

run_all.sh

Repository files navigation

Search Index Benchmark Game

Getting Started

Prerequisites

Installing

Running

Running more

TODO

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
benchmark		benchmark
common		common
lucene		lucene
tantivy		tantivy
.gitignore		.gitignore
INDEX_TYPES.txt		INDEX_TYPES.txt
LICENSE		LICENSE
README.md		README.md
preprocess_all.sh		preprocess_all.sh
query_all.sh		query_all.sh
run_all.sh		run_all.sh

License

jason-wolfe/search-index-benchmark-game

Folders and files

Latest commit

History

Repository files navigation

Search Index Benchmark Game

Getting Started

Prerequisites

Installing

Running

Running more

TODO

About

Resources

License

Stars

Watchers

Forks

Languages