Hadoop-Letter-File-Index-Counter

A Hadoop-based Java project that counts the max number of word occurences for each letter in a textfile of a folder.

Project Description

We basically need for every letter of the alphabet to retrieve the maximum number of words beginning with this given letter, along with the file that these words are located. So in fact we have a case of a wordcount problem with a distributed file system twist.

To make this work based in the MapReduce model, we need 2 rounds of execution.

1st execution round

Map (`Map_A` Class)

We initially ap the text we have as input into key-value pairs as follows, by scanning each file and picking out the first character of every word, and adding 1 as the initial value since a given letter is found at least once in this given file by a particular compute node.

((letter, filename), 1)

Reduce (`Reduce_A` Class)

Then, we try to reduce the values we had from the pairs based by key, so for each individual letter and file we have the sum of occurences for words that start with that very letter, having this kind of key-value pairs as result:

((letter,filename), sum_of_occurences)

Inbetween results example

2nd execution round

Map (`Map_B` Class)

We map our inbetween results from the first round in a way so the key is each letter of the alphabet and the value is a composite one that refers to the filename with the maximum sum of occurences of words starting with that letter in particular.

(letter, (filename, sum_of_occurences))

Reduce (`Reduce_B` Class)

At last, we can simply reduce based by key once again, only this time instead of computing the sum we find and set the max of all the sums of occurences and the filename of them.

(letter, (filename, max_num_of_occurences))

Final results example

In order to make those pairs work within the Hadoop environment, we need to create specific classes (FilenameSumCompositeKey, LetterFilenameCompositeKey) to make objects and hold the data for the composite keys and values during the execution rounds.

Execution Guide

(while being in the project directory)

Create the input directory in the HDFS hadoop fs -mkdir input
Copy all the files from the local input directory to the HDFS input directory hadoop fs -put ./input input
Create a directory to store the .class files mkdir HLFIC_classes
Compile the project javac -classpath "$(yarn classpath)" -d HLFIC_classes LetterFileIndexCounter.java LetterFilenameCompositeKey.java FilenameSumCompositeKey.java
Create the jar file of the project jar -cvf HLFIC.jar -C HLFIC_classes/ .
Run the project hadoop jar HLFIC.jar org.myorg.LetterFileIndexCounter input inbetween output
Print the results to the terminal hadoop fs -cat output/part-00000

Project Info

All the input files are located in the input directory with 20 files that contain a small phrase from Metamorphosis by Franz Kafka.
All the files of inbetween results will appear to the distributed file system under the inbetween directory
All the files of the final results will appear to the distributed file system under the output directory
Number of reducers can be changed for both execution rounds
Also check out the

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
readme_pics		readme_pics
FilenameSumCompositeKey.java		FilenameSumCompositeKey.java
LICENSE		LICENSE
LetterFileIndexCounter.java		LetterFileIndexCounter.java
LetterFilenameCompositeKey.java		LetterFilenameCompositeKey.java
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input

input

readme_pics

readme_pics

FilenameSumCompositeKey.java

FilenameSumCompositeKey.java

LICENSE

LICENSE

LetterFileIndexCounter.java

LetterFileIndexCounter.java

LetterFilenameCompositeKey.java

LetterFilenameCompositeKey.java

README.md

README.md

Repository files navigation

Hadoop-Letter-File-Index-Counter

Project Description

1st execution round

Map (`Map_A` Class)

Reduce (`Reduce_A` Class)

Inbetween results example

2nd execution round

Map (`Map_B` Class)

Reduce (`Reduce_B` Class)

Final results example

Execution Guide

Project Info

About

Releases

Packages

Languages

License

Coursal/Hadoop-Letter-File-Index-Counter

Folders and files

Latest commit

History

Repository files navigation

Hadoop-Letter-File-Index-Counter

Project Description

1st execution round

Map (Map_A Class)

Reduce (Reduce_A Class)

Inbetween results example

2nd execution round

Map (Map_B Class)

Reduce (Reduce_B Class)

Final results example

Execution Guide

Project Info

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Map (`Map_A` Class)

Reduce (`Reduce_A` Class)

Map (`Map_B` Class)

Reduce (`Reduce_B` Class)