Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build the Inverted Index #1

Open
RobbieMcKinstry opened this issue Nov 22, 2016 · 0 comments
Open

Build the Inverted Index #1

RobbieMcKinstry opened this issue Nov 22, 2016 · 0 comments

Comments

@RobbieMcKinstry
Copy link
Owner

We should build the index in Java so that we can use it from MR and Spark.

Our inverted index needs to be stored on disk. That means that the class must implement Serializable. That also means that all the instance variables of that class must also implement Serializable. Check the Java documentation for each type to affirm this is true. At the beginning of each Spark/MR job, we will load the index from disk. The first time a job is started there will be no file present, so we'll use an empty index. The second time and every time after that, the index will be loaded from disk. At the conclusion of each job, the index will be flushed to disk. There will be a separate path for the inverted index for MR and Spark, so 2 indices in total.

Because we need the index to be a singleton (only exists once, can't be created twice otherwise we'll be overwriting the same spot on disk), we make the constructor to the InvertedIndex class private, and we expose a public static method called createInvertedIndex which takes a single argument, the String or Path to the file, and returns an InvertedIndex object. The createInvertedIndex method will look to see if that file exists, and deserialize it if it does, returning the deserialized object.

The index object itself will need a Map from String to List. The implementation of Map and List doesn't matter as long as it implements Serializable. As we talked about, the element contained by the List is a small struct, which contains the DocumentID, the path to that document, and the number of instances in which the string appears in the document (hit count). The DocumentID needs to be some kind of unique identifier for the Document, but luckily the name of the document should suffice.

Next, we obviously need a way to add Documents to the index.

public void add(String key, ID uniqueID, Path toDocument, long count)

This method is called once for each unique word appearing in the document by our MR and Spark jobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant