Skip to content

Formatted version of the exercise from Coursera course with all files included

License

Notifications You must be signed in to change notification settings

nihil0/map-reduce-join-exercise

Repository files navigation

Hadoop Platform and Application Framework

Week 4: Introduction to Map/Reduce

Assignment: Joining Data

Since the formatting and exercise description is very poor, here is a detailed tutorial version of the exercise from this Coursera course with all files included. Just clone this into your home folder on the Cloudera Quickstart VM and you're all set!

If you're not familiar with Git, here's the command

git clone https://github.com/nihil0/map-reduce-join-exercise.git

Exercise in Joining data with streaming using Python code:

In Lesson 2 of the Introduction to Map/Reduce module the Join task was described. In this assignment, first you are given a Python mapper and reducer to perform the Join described in the video (3rd video of lesson). The purpose of the first part of this assignment is mostly to provide an example for the second part. Your second assignment will be to modify that Python code (or if you are so inclined write one from scratch) to perform a different Join. You will asked to upload an output file(s).

Please read through all the instruction and if you are not a programmer, especially the programming notes. It is not a hard programming assignment, but I believe worth the effort to understand the nature of map/reduce framework.


Part 1

Step 1 : Make the Python scripts for the mapper and reducer executable. Execute the following commands (printed in one line here for ease of copying and pasting):

chmod +x join1_mapper.py && chmod +x join1_reducer.py

Step 2: Test the program in serial execution my executing the fllowing command:

cat join1_File*.txt | ./join1_mapper.py | sort | ./join1_reducer.py

Explanation: cat join1_File*.txt prints out the contents of the files join1_FileA.txt and join1_FileB.txt to the Standard Output Stream. The "pipe" operator | redirects the data to be written to the output stream to the script join1_mapper.py. The next pipe directs the output of join1_mapper.py to the built-in program sort, which sorts the lines of the output of join1_mapper.py in alphabetical order. The output of the sort program is then piped to join1_reducer.py which prints its output to the system's standard output, which sould look like this:

Apr-04 able 13 n-01 5
Dec-15 able 100 n-01 5
Feb-02 about 3 11
Mar-03 about 8 11
Feb-22 actor 3 22
Feb-23 burger 5 15
Mar-08 burger 2 15

Step 3 : Run the following commands to put the data files join1_FileA.txt and join1_FileB.txt into the Hadoop Distributed File System. You can also do this graphically using HUE.

Create a folder called "input"

hdfs dfs -mkdir /user/cloudera/input

Put the first file in

hdfs dfs -put ~/map-reduce-join-exercise/join1_FileA.txt /user/cloudera/input/

Now the second one

hdfs dfs -put ~/map-reduce-join-exercise/join1_FileB.txt /user/cloudera/input/

Step 4 : Run the map-reduce job using the following command (on one line to prevent line-break issues):

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/input -output /user/cloudera/output_join  -mapper /home/cloudera/join1_mapper.py -reducer /home/cloudera/join1_reducer.py

It should run successfully. You can check the output using HUE. It should be similar to that produced at the end of step 2.

Part 2

Step 1 : Create the data files using make_join2data.py with different arguments. This is included in the shell script make_data.sh which can be run by invoking the command:

sh make_data.sh

About

Formatted version of the exercise from Coursera course with all files included

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published