Skip to content

benmoose/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine

After reading Improving the performance of full-text search on the Dropbox Blog (which is an excellent insight into the company) I was inspired to try building my own full-text search engine.

To build on what I learnt in last year's object oriented module, I'm going to implement this in Java 8.

Aims

  • Implement an inverted index algorithm
  • Measure performance
  • Reiterate on original algorithm

Current Functionality

Mapper
The Mapper class takes a path to a file and uses a Tokenizer instance to generate a map, with a K<Token>, V<Posting> structure.

class Main {
    public static void main(String[] args) {
        Mapper mapper = new Mapper("./data/example/example_1.txt");
        mapper._printMap();
    }
}

This code would output the structure of the generated map.

Term    Attributes
'fish'  (fileId: 0, frequency: 2)
'one'   (fileId: 0, frequency: 1)
'two'   (fileId: 0, frequency: 1)

Grouper
The Grouper class takes a list of Mapper objects, and generates an inverted index, which is a map of terms to an array of (file ID, token frequency) pairs. Note that terms in the map are sorted alphabetically.

class Main {
    public static void main(String[] args) {
        Mapper m1 = new Mapper("./data/example_1.txt");
        Mapper m2 = new Mapper("./data/example_2.txt");

        Grouper grouper = new Grouper(m1, m2);
        grouper._printInvertedIndex();
    }
}

This code would produce an inverted index with the following structure.

Term    Postings
blue    (fileId: 0, frequency: 1), 
fish    (fileId: 0, frequency: 2), (fileId: 1, frequency: 2), 
one     (fileId: 1, frequency: 1), 
red     (fileId: 0, frequency: 1), 
two     (fileId: 1, frequency: 1),

Reducer
The Reducer class takes the inverted index output by the Grouper and writes an encoded version of the postings lists to disk.

class Main {
    public static void main(String[] args) {
        Mapper m1 = new Mapper("./data/example_1.txt");
        Mapper m2 = new Mapper("./data/example_2.txt");
        Grouper grouper = new Grouper(m1, m2);
        
        Reducer reducer = new Reducer(grouper.getInvertedIndex());
    }
}

The Reducer constructor generates an _index0.txt file.

0:1,
0:2,1:2,
1:1,
0:1,
1:1, 

The encoding of this file is simple at the moment.

Lines are ordered alphabetically according to their corresponding token. Each line contains information for files containing that token only. The number before the : corresponds to the file ID (which is assigned by the Mapper class). The second number corresponds to the number of times the token appears in that file.

The reducer is capable of distributing the index across multiple files to prevent the creation of a single monolithic file. In this simple example, that limit is set arbitrarily at 100, meaning if the search space contains greater than 100 tokens then the index will be split across multiple _index<i>.txt files.

Retriever
The Retriever class builds a map of term to term location whilst terms are indexed. This map can then be queried using the get() method to return the encoded locations of the relevant files.

Retriever.get("fish");

This code returns

0:2,1:2

which indicates that the term "fish" occurs in the file with ID 0 twice and in the file with ID 1 twice. This result can then be interpreted by the Query class.

Resources

Releases

No releases published

Packages

No packages published

Languages