Norvig Web Data Science Award Examples

This is a fork of the original CommonCrawl examples, adapted to be used as a starting point for your entry to the Norvig Web Data Science Award.

Getting started

We recommend using the virtual machine image as development environment as described on the contest website.

Overview of the examples

Example MapReduce code

See the code for all examples on Github.

All examples support the same arguments:

org.commoncrawl.examples.Example*
                         -in <inputpath>
                         -out <outputpath>
                       [ -overwrite ]
                       [ -numreducers <number_of_reducers> ]
                       [ -conf <conffile> ]
                       [ -maxfiles <maxfiles> ]

Where:

-in
Point to the path of your input files. You can use globbing if your Hadoop distribution supports it.
-out
Point to the path to store the output files.
-overwrite
If output path exists, this switch will allow the example to overwrite the existing directory.
-numreducers
Set the maximum amount of reducers to run. Defaults to a single reducer.
-conf
Path to additional configuration.
-maxfiles
Maximum amount of files to process.

These examples are included:

org.commoncrawl.examples.ExampleArcMicroformat
An example showing how to analyze the CommonCrawl ARC web content files.
org.commoncrawl.examples.ExampleMetadataDomainPageCount
An example showing how to use the CommonCrawl 'metadata' files to quickly gather high level information about the corpus' content.
org.commoncrawl.examples.ExampleMetadataStats
An example showing how to use the CommonCrawl 'metadata' files to quickly gather high level information about the corpus' content.
org.commoncrawl.examples.ExampleTextWordCount An example showing how to use the CommonCrawl 'textData' files to efficiently work with CommonCrawl corpus text content.

Build and package the examples

In the terminal you can build and package the examples by moving to the commoncrawl-examples directory ~/git/commoncrawl-examples and run:

$ ant

Inside Eclipse you can build the project by selecting "Project → Build Project" from the menu bar.

Both methods wil create a jar bundle in ~/git/commoncrawl-examples/dist/lib.

Running the MapReduce examples

To run the an example on maximally 5 input files, open a shell and run:

$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar [EXAMPLECLASS] -in [INPUT] -out [OUTPUT] -maxfiles 5

For org.commoncrawl.examples.ExampleMetadataStats that would be

$ hadoop jar dist/lib/commoncrawl-examples-1.0.1.jar org.commoncrawl.examples.ExampleMetadataStats -in [INPUT] -out [OUTPUT] -maxfiles 5

You can use this same command for each included example.

The Eclipse project includes a run configuration for the ExampleTextWordCount example. You can select it from the "Run" menu entry. You can use this run configuration as a template for other configurations.

Example Pig script

example.pig

An example counting the occurrences of HTTP status codes. You can run the pig script from the terminal by moving to the examples directory and run:

$ pig example.pig

Using the CommonCrawl ARC files in MapReduce and Pig

These examples come with an InputFormat for MapReduce and a Loader for Pig:

The above examples should show you how to load the CommonCrawl ARC files using these classes.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
bin		bin
conf		conf
lib		lib
src		src
test/java/org/commoncrawl/hadoop/mapred		test/java/org/commoncrawl/hadoop/mapred
.gitignore		.gitignore
README-Amazon-AMI		README-Amazon-AMI
README.html		README.html
README.md		README.md
VERSION		VERSION
build.properties		build.properties
build.xml		build.xml
example.pig		example.pig

MennoTammens/commoncrawl-examples

Folders and files

Latest commit

History

Repository files navigation

Norvig Web Data Science Award Examples

Getting started

Overview of the examples

Example MapReduce code

Build and package the examples

Running the MapReduce examples

Example Pig script

Using the CommonCrawl ARC files in MapReduce and Pig

About

Resources

Stars

Watchers

Forks

Languages