Realtime Data Streaming | End-to-End Data Engineering Project

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

System Architecture

The project is designed with the following components:

Data Source: We use randomuser.me API to generate random user data for our pipeline.
Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Apache Spark: For data processing with its master and worker nodes.
Cassandra: Where the processed data will be stored.

What You'll Learn

Setting up a data pipeline with Apache Airflow
Real-time data streaming with Apache Kafka
Distributed synchronization with Apache Zookeeper
Data processing techniques with Apache Spark
Data storage solutions with Cassandra and PostgreSQL
Containerizing your entire data engineering setup with Docker

Technologies

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

Getting Started

Clone the repository:

 https://github.com/ZeroTwoDataRW/DE-Stream-Project-Random-Generated-User-Data.git

Navigate to the project directory:

cd DE-Stream-Project-Random-Generated-User-Data

Run Docker Compose to spin up services:

docker-compose up -d

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dags		dags
script		script
venv		venv
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
spark_stream.py		spark_stream.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

script

script

venv

venv

README.md

README.md

docker-compose.yml

docker-compose.yml

requirements.txt

requirements.txt

spark_stream.py

spark_stream.py

Repository files navigation

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What You'll Learn

Technologies

Getting Started

Screenshots of Project Steps to Design

About

Releases

Packages

Contributors 2

Languages

ZeroTwoDataRW/DE-Stream-Project-Random-Generated-User-Data

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What You'll Learn

Technologies

Getting Started

Screenshots of Project Steps to Design

About

Topics

Resources

Stars

Watchers

Forks

Languages