Build the Inverted Index #1

RobbieMcKinstry · 2016-11-22T21:58:32Z

We should build the index in Java so that we can use it from MR and Spark.

Our inverted index needs to be stored on disk. That means that the class must implement Serializable. That also means that all the instance variables of that class must also implement Serializable. Check the Java documentation for each type to affirm this is true. At the beginning of each Spark/MR job, we will load the index from disk. The first time a job is started there will be no file present, so we'll use an empty index. The second time and every time after that, the index will be loaded from disk. At the conclusion of each job, the index will be flushed to disk. There will be a separate path for the inverted index for MR and Spark, so 2 indices in total.

Because we need the index to be a singleton (only exists once, can't be created twice otherwise we'll be overwriting the same spot on disk), we make the constructor to the InvertedIndex class private, and we expose a public static method called createInvertedIndex which takes a single argument, the String or Path to the file, and returns an InvertedIndex object. The createInvertedIndex method will look to see if that file exists, and deserialize it if it does, returning the deserialized object.

The index object itself will need a Map from String to List. The implementation of Map and List doesn't matter as long as it implements Serializable. As we talked about, the element contained by the List is a small struct, which contains the DocumentID, the path to that document, and the number of instances in which the string appears in the document (hit count). The DocumentID needs to be some kind of unique identifier for the Document, but luckily the name of the document should suffice.

Next, we obviously need a way to add Documents to the index.

public void add(String key, ID uniqueID, Path toDocument, long count)

This method is called once for each unique word appearing in the document by our MR and Spark jobs.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build the Inverted Index #1

Build the Inverted Index #1

RobbieMcKinstry commented Nov 22, 2016

Build the Inverted Index #1

Build the Inverted Index #1

Comments

RobbieMcKinstry commented Nov 22, 2016