This project involves building a data warehouse with an airflow pipeline. The data involved was gotten from kaggle https://www.kaggle.com/lakritidis/identifying-influential-bloggers-techcrunch/data?select=authors.csv. The data model was built with business users in mind and on hwo they can derive meaningful insights from the data such as comparing sentiments on a post, blogger with the most influence etc. Built a sentiment analysis model to classify blog comments as positive and negative
The data is stored in aws S3 storage bucket, extracted and transformed with a pyspark script running on an EMR cluster with livy and then the transformed data is converted into a parquet file format and stored back in aws S3. The transformed data is then copied to a redshift database. All these is automated with airflow.
The data was organised into a star schema dimensional model. The Star schema model consist of 3 dimensional tables and 2 fact tables
-
- Author - author_id, author, meibi, meibx
-
- Comments - comment_id, post_id, content, author, date, vote
-
- posts - post_id, title, blogger_name, blogger_id, number_of_comments, content, url, date, number_of_retrieved_comments
-
- word cout - author_id, author, word_count_stopwords, word_count_nostopwords
-
- comment_review - author_id, author, post_id, comment_id, date, content, sentiment.
Before starting the dag, you need to set redshift and s3 connections. Also, you will need to set airflow variables. This configuration can be done on airflow UI by navigating to the connections and variables section under the admin tab.
The airflow variables to change can be found in the airflow_variables.json
The airflow dag runs a daily job to run the etl pipeline.
The airflow ETL needs the table to be ready before running the full pipeline.
- Run create_cluster.sh to create a redshift cluster.
- Create the tables by running create_tables.py
The configuration settings and files can be found in the export_env_variables.sh file.
EMR cluster is needed for running the transformation files. The pyspark script can be found in the transfom folder. The emr_util file contains functions for creating cluster, starting a spark session and terminating a spark session.
The EMR tasks are included in the dag file.