Skip to content

benitomartin/de-ch-weather

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swiss Air Quality Index

This project has been developed as part of the Data Engineering Zoomcamp course provided by DataTalks.Club. The data used has been extracted from the Google Air Quality API.

Below, you can find the project descrition to understand the content and setup instructions. Feel free to ⭐ and clone this repo 😉

Tech Stack

Visual Studio Code Python Pandas Anaconda dbt Linux Ubuntu Google Cloud Looker Studio Terraform Git

  • Data Analysis & Exploration: SQL/Python
  • Cloud: Google Cloud Platform
    • Data Lake - Google Cloud Storage
    • Data Warehouse: BigQuery
  • Infrastructure as Code (IaC): Terraform
  • Data ingestion (Batch/Workflow Orchestration): Mage
  • Data Transformation: dbt
  • Data Visualization: Looker Studio
  • CI/CD: dbt

Project Structure

The project has been structured with the following folders and files:

  • mage: Workflow orchestration pipeline
  • dbt: Data transformation and CI/CD pipeline
  • looker: Report from Looker Studio
  • terraform: IaC stream-based pipeline infrastructure in GCP using Terraform
  • requirements.txt: Project requirements
  • images: Printouts of results

Project Description

The dataset was obtained from Google Air Quality API and contains various columns with air quality data for a specific list of Swiss cities. To prepare the data some preprocessing steps were conducted. The following actions were performed using Mage to get a clean dataset. This Medium article was used as reference to understand the API and extract the data. A big shotout to Robert Martin-Short for the tutorial.

  • Extract the relevant pollutants and air quality index (AQI) from the API
  • Create the columns with the selected cities, latitude, and longitude
  • Remove rows with NaN values
  • Remove duplicates

Afterward, the final clean data are ingested to a GCP Bucket and Big Query. Finally, transformations are perfomed using dbt (see dbt folder) to get the production-ready data for dashboarding using Looker.

Mage Data Ingestion

 

The following picture shows two pipelines used to send the data to the Google Cloud bucket. It can be sent either directly to the bucket or to a partitioned folder inside the bucket containing the year/month/day structure. The latter approach is taken so that the file can be updated on a daily basis and the data from previous days are kept. Finally, the data is sent from the bucket to BigQuery.

dbt Data Transformation

 

Once the data is in BigQuery, a complete transformation step is performed using dbt to have the final clean dataset again in BigQuery. Four datasets are generated with dbt, two staging, and two production, each having the air quality data from all cities and from one city, which can be changed in dbt accordingly. The dataset called prod_all_cities is the one taken for the dashboard visualization in Looker. For the deployment in Github a CI/CD Check was run in dbt with the command dbt build --select state:modified+

Visualization

CI/CD

Finally, to streamline the development process, a fully automated Build and CI/CD pipeline was created using dbt:

dbt Build

 

dbt CI/CD

 

Project Set Up

The Python version used for this project is Python 3.10.

  1. Clone the repo (or download it as zip):

    git clone https://github.com/benitomartin/de-ch-weather.git
  2. Create the virtual environment named main-env using Conda with Python version 3.10:

    conda create -n main-env python=3.10
    conda activate main-env
  3. Execute the requirements.txt script and install the project dependencies:

    pip install -r requirements.txt
    
  4. Install terraform:

    conda install -c conda-forge terraform

Each project folder contains a README.md file with instructions about how to run the code. I highly recommend creating a virtual environment for each one. Additionally, please note that a GCP Account, credentials, and proper IAM roles are necessary for the scripts to function correctly. The following IAM Roles have been used for this project:

  • BigQuery Data Editor
  • BigQuery Job User
  • BigQuery User
  • BigQuery Admin
  • Storage Admin
  • Compute Admin

To run the project, first go to the Terraform folder to create the infrastucture, then to Mage to get the data and store them in the GCS Bucket and Big Query and finally to dbt, to transform the data.

Evaluation Criteria

The following criteria for the evaluation have been fulfilled:

  • Problem description: The project is well described and it's clear and understandable
  • Cloud: The project is developed on the cloud (Google) and IaC tools (Terraform) are used for provisioning the infrastructure
  • Data Ingestion: Fully deployed workflow orchestration using Mage
  • Data warehouse: Tables are created in BigQuery
  • Transformations: Tables are transformed using dbt
  • Dashboard: 3 Pages with 6 visualizations using Looker
  • Reproducibility: Instructions are clear, it's easy to run the code, and it works.