Skip to content

Easy-to-understand data modeling with Postgres to ETL data into a Star Schema

Notifications You must be signed in to change notification settings

omarfessi/data-modeling-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-modeling-Postgres

Introduction

The project is one of the many projects I worked on through my training at Udacity as a data engineer. In this project, I'll model user activity data for a music streaming app called Sparkify. I’ll create a relational database and ETL pipeline designed to optimize queries for understanding what songs users are listening to. In PostgreSQL, I will also define Fact and dimension tables and insert data into the new tables.

Dataset & Data model

Song Dataset

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset: song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

The second dataset consists of log files in JSON format generated by an event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset I'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset :

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

pandas-overview

Data model

ERD

ERD-Sparkify

Data Dictionary

songs : dimension table about songs

Column's name type example
song_id varchar PRIMARY KEY "SONHOTT12A8C13493C"
title varchar "Something Girls"
artist_id varchar "AR7G5I41187FB4CE6C"
year integer 1982
duration float 233.40363

artists : dimension table about artists

Column's name type example
artist_id varchar PRIMARY KEY "ARGSJW91187B9B1D6B"
name varchar "JennyAnyKind"
location varchar "North Carolina"
latitude float 35.21962
longitude float -80.01955

time : dimension table about the date-time

Column's name type example
start_time timestamp PRIMARY KEY "2018-11-11 02:33:56.796"
hour integer 2
day varchar 11
week integer 45
month integer 11
year integer 2018
dayofweek integer 6

users : dimension table about users

Column's name type example
userId integer PRIMARY KEY 69
firstName varchar "Anabelle"
lastName varchar "Simpson"
gender varchar "F"
lever varchar "free"

songplays : fact table about songplays

Column's name type example
songplay_id serial PRIMARY KEY 1
start_time timestamp FOREIGN KEY references start_time in time table "2018-11-11 02:33:56.796"
userId integer FOREIGN KEY references userId in users table 69
song_id varchar FOREIGN KEY references song_id in songs table null
artist_id varchar FOREIGN KEY references artist_id in artists table null
level varchar "free"
session_id integer 455
location varchar "Philadelphia-Camden-Wilmington, PA-NJ-DE-MD"
user_agent varchar """Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"""

ETL process

To fill in the database created above with these local JSON files, we need to create an ETL pipeline using pandas to parse files and extract the needed fields for each table. Steps to build ETL pipeline for to extract 1 file and fill in 1 table:

  • Create function get_files to return json song files
  • Use pandas to read the json file
  • Select columns for song ID, title, artist ID, year, and duration
  • Select the record in the dataframe and convert it to a list
  • Query the insert statement with that one record to insert the song into the song table

The same process is applied to the other tables with other fields extracted.

Note about the songplays table :

This one is a little more complicated since information from the songs table, artists table, and original log file are all needed for the songplays table. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time. More details are within the notebook ( etl-proto.ipynb )

Code execution steps:

  • Run create_tables.py to drop if exists then create your database and tables.
  • Run etl.py to execute the ETL process on the whole datasets and load data into the different tables
  • Run test.ipynb to confirm the creation of the tables with the correct columns. Make sure to click "Restart kernel" to close the connection to the database after running this notebook, and the etl-proto.ipynb as well
  • EXAMPLE-QUERIES.ipynb to execute some queries for analysis purpose

About

Easy-to-understand data modeling with Postgres to ETL data into a Star Schema

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published