COVID-19 Data Engineering Project NOTE:(Infrastructure code is not uploading due to some conflicts)

This project focuses on collecting, processing, and analyzing COVID-19 data using various data engineering tools and technologies. The project employs Terraform for infrastructure setup, dbt for analytical engineering, Mage.ai for workflow orchestration and data transformation, Google Cloud Platform (GCP) BigQuery for data warehousing, PySpark for batch processing, and Confluent Kafka for real-time data processing.

Introduction

The COVID-19 pandemic has generated massive amounts of data related to infection rates, testing, hospitalizations, and more. This project aims to centralize, process, and analyze this data to provide valuable insights for healthcare professionals, policymakers, and the general public.

Technologies Used

Terraform: Infrastructure as code tool used to provision and manage the project's infrastructure on cloud platforms.
dbt (Data Build Tool): Analytics engineering tool used for transforming and modeling data in the data warehouse.
Mage.ai: Workflow orchestration and data transformation platform used to streamline data processing tasks.
Google Cloud Platform (GCP) BigQuery: Fully managed, serverless data warehouse used for storing and querying large datasets.
PySpark: Python API for Apache Spark used for large-scale batch processing of data.
Confluent Kafka: Distributed streaming platform used for real-time data processing and event streaming.
Docker Compose: Tool for defining and running multi-container Docker applications. Used to run Mage.ai and Confluent Kafka services.
Looker: Business intelligence and data visualization platform used to create dashboards and reports.

Project Structure

The project is structured as follows:

covid19/
│
├── analyitcs/
│ ├── dbt/
│ │ ├── analyses/
│ │ ├── macros/
│ │ └── ...
│ └── ...
│
├── containerization/
│ ├── docker/
│ │ ├── docker-compose.yml
│ │ │ 
│ │ └── ...
│ └── ...
│
├── workflows/
│ ├── mage/
│ │ ├── export_data/
│ │ │ ├── export_to_gcp.py
│ │ ├── load_data/
│   │ ├── load_data_to_gcp.py
│ └── ...
│
│
├── kafka/
│ │── consumer.py
│ │── producer.py
└── README.md

Setup Instructions

Infrastructure Setup: Use Terraform scripts in the infrastructure/terraform/ directory to provision the required cloud resources. Make sure to configure your cloud provider credentials and settings.
Analytical Engineering: Utilize dbt models in the analytics/dbt/models/ directory to transform and model data in the data warehouse.
Workflow Orchestration: Define and manage data processing workflows using Mage.ai workflows in the workflows/mage/workflows/ directory.
Data Warehousing: Load Data to implement data warehousing in workflows/mage/ directory.
Real-time Processing: Develop real-time data processing pipelines using Confluent Kafka consumer and producer scripts in the kafka/ directory.
Docker Compose Setup: Use the provided docker-compose.yml file to run Mage.ai and Confluent Kafka services. Make sure Docker is installed on your system.
Looker Dashboards: Use Looker to import and customize dashboard.

Dashboard

Currently in Progress......

Usage

Modify and extend the provided scripts and configurations to suit your specific data processing requirements.
Run Docker Compose to start Mage.ai and Confluent Kafka services.
Use Looker to visualize and explore data through the imported dashboards.
Refer to individual tool documentation for detailed usage instructions and best practices.

Contributing

Contributions to improve and expand this project are welcome! Feel free to fork the repository, make your changes, and submit a pull request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
covid_project		covid_project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

covid_project

covid_project

README.md

README.md

Repository files navigation

COVID-19 Data Engineering Project NOTE:(Infrastructure code is not uploading due to some conflicts)

Table of Contents

Introduction

Technologies Used

Project Structure

Setup Instructions

Dashboard

Usage

Contributing

License

About

Releases

Packages

Languages

faranbutt/Data-Engineering-Capstone-Project

Folders and files

Latest commit

History

covid_project

covid_project

README.md

README.md

Repository files navigation

COVID-19 Data Engineering Project NOTE:(Infrastructure code is not uploading due to some conflicts)

Table of Contents

Introduction

Technologies Used

Project Structure

Setup Instructions

Dashboard

Usage

Contributing

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages