Skip to content

ETL Pipeline hosted on GCP. Uses CloudSQL, GCS, CloudFunctions. Data validation and cron scheduled updates to fetch data from public sources.

Notifications You must be signed in to change notification settings

DnanaDev/Covid19-GCP-Data-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Pipeline For COVID-19 - India Analysis

Documentation for data pipeline created for the mentorskool community project COVID-19

The project provides an automated ETL pipeline that collects data related to the Covid outbreak from APIs, web sources etc. and provides analysts access to consistent, clean and up to date data. The Pipeline is hosted on the Google Cloud Platform and has the following components:

Pipeline

1. Starter Functions

  1. Cloud SQL : Connecting and querying starter notebook
  2. Data Ingestion : The helper code, functions etc. that were used to create the ETL Data Pipeline. notebook

2. Raw Data Ingestion Functions

  1. Cloud Function - fetch_raw_covid_api_data
  2. Offline Function - Script - Documentation

3. Clean SQL Ingestion Functions

  1. Cloud Function - Cloud_function_Ingestion_SQL
  2. Functions for Creating Local DB (Used pg_dump to create CloudSQL Instance), Connecting to CloudSQL , Offline Clean Data Ingestion - Well documented notebook
  3. Current DB ERD

About

ETL Pipeline hosted on GCP. Uses CloudSQL, GCS, CloudFunctions. Data validation and cron scheduled updates to fetch data from public sources.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published