Search Engine

After reading Improving the performance of full-text search on the Dropbox Blog (which is an excellent insight into the company) I was inspired to try building my own full-text search engine.

To build on what I learnt in last year's object oriented module, I'm going to implement this in Java 8.

Aims

Implement an inverted index algorithm
Measure performance
Reiterate on original algorithm

Current Functionality

Mapper
The Mapper class takes a path to a file and uses a Tokenizer instance to generate a map, with a K<Token>, V<Posting> structure.

class Main {
    public static void main(String[] args) {
        Mapper mapper = new Mapper("./data/example/example_1.txt");
        mapper._printMap();
    }
}

This code would output the structure of the generated map.

Term    Attributes
'fish'  (fileId: 0, frequency: 2)
'one'   (fileId: 0, frequency: 1)
'two'   (fileId: 0, frequency: 1)

Grouper
The Grouper class takes a list of Mapper objects, and generates an inverted index, which is a map of terms to an array of (file ID, token frequency) pairs. Note that terms in the map are sorted alphabetically.

class Main {
    public static void main(String[] args) {
        Mapper m1 = new Mapper("./data/example_1.txt");
        Mapper m2 = new Mapper("./data/example_2.txt");

        Grouper grouper = new Grouper(m1, m2);
        grouper._printInvertedIndex();
    }
}

This code would produce an inverted index with the following structure.

Term    Postings
blue    (fileId: 0, frequency: 1), 
fish    (fileId: 0, frequency: 2), (fileId: 1, frequency: 2), 
one     (fileId: 1, frequency: 1), 
red     (fileId: 0, frequency: 1), 
two     (fileId: 1, frequency: 1),

Reducer
The Reducer class takes the inverted index output by the Grouper and writes an encoded version of the postings lists to disk.

class Main {
    public static void main(String[] args) {
        Mapper m1 = new Mapper("./data/example_1.txt");
        Mapper m2 = new Mapper("./data/example_2.txt");
        Grouper grouper = new Grouper(m1, m2);
        
        Reducer reducer = new Reducer(grouper.getInvertedIndex());
    }
}

The Reducer constructor generates an _index0.txt file.

0:1,
0:2,1:2,
1:1,
0:1,
1:1,

The encoding of this file is simple at the moment.

Lines are ordered alphabetically according to their corresponding token. Each line contains information for files containing that token only. The number before the : corresponds to the file ID (which is assigned by the Mapper class). The second number corresponds to the number of times the token appears in that file.

The reducer is capable of distributing the index across multiple files to prevent the creation of a single monolithic file. In this simple example, that limit is set arbitrarily at 100, meaning if the search space contains greater than 100 tokens then the index will be split across multiple _index<i>.txt files.

Retriever
The Retriever class builds a map of term to term location whilst terms are indexed. This map can then be queried using the get() method to return the encoded locations of the relevant files.

Retriever.get("fish");

This code returns

0:2,1:2

which indicates that the term "fish" occurs in the file with ID 0 twice and in the file with ID 1 twice. This result can then be interpreted by the Query class.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
src/com/benhadfield		src/com/benhadfield
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SearchEngine.iml		SearchEngine.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src/com/benhadfield

src/com/benhadfield

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

SearchEngine.iml

SearchEngine.iml

Repository files navigation

Search Engine

Aims

Current Functionality

Resources

About

Releases

Packages

Languages

License

benmoose/search-engine

Folders and files

Latest commit

History

Repository files navigation

Search Engine

Aims

Current Functionality

Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages