Simple Apache Spark jobs for aggregating the IRS Form 909 dataset available here. Jobs include average and median calculation in a nationwide and per-state basis.
- Scala 2.11
- sbt 0.13.8
- Apache Spark 2.1 for Hadoop 2.4
More recent Hadoop versions are not supported because of a bug related to S3 support. See more about it here.
- Start the Spark master
$SPARK_HOME/sbin/start-master.sh
- Start a Spark slave
$SPARK_HOME/sbin/start-slave.sh MASTER_URL
- Build the fat jar
sbt clean assembly
- Deploy the jar
$SPARK_HOME/bin/spark-submit --class "IrsRevenueApp" --master "MASTER_URL" --executor-memory MAX_MEMORY target/scala-2.11/irs-revenue-assembly-1.0.jar