Building-streaming-ETL-Data-pipeline

Building streaming Data pipeline using apache airflow, kafka ,...

1. Project overview and architecture

In this project, we build a real-time ETL (Extract, Transform, and Load) data pipeline. During this process we'll use an open api to get data. Building a streaming ETL data pipeline involves ingesting , processing, transforming and loading real-time data into a data storage or analytics system. This overview outlines the process of building such a pipeline requiring Apache Kafka for data ingestion, Apache Spark for data processing, and Amazon S3 for data storage.

Apache kafka

Set up Kafka Cluster: Deploy a Kafka cluster with multiple brokers for high availability and scalability.
Create Kafka Topics : Define topics to categorize and organize the incoming data streams based on their sources or types.
Configure Kafka Producers : integrate Kafka producers to send data from open api to the appropriate Kafka topics.

Automation and Orchestration

Leverage automation and orchestration tools (e.g., Apache Airflow) to manage and coordinate the various components of the pipeline, enabling efficient deployment, scheduling, and maintenance.

Data Processing with Apache Spark

Apache Spark is a powerful open-source distributed processing framework that excels at processing large-scale data streams. In this pipeline, Spark will consume data from Kafka topics, perform transformations and computations, and prepare the data for storage in Amazon S3.

Configure Spark Streaming : Set up a Spark Streaming application to consume data from Kafka topics in real-time.
Define Transformations : Implement the necessary transformations and computations on the incoming data streams using Spark's powerful APIs. This may include data cleaning, filtering, aggregations, and enrichment from other data sources.
Integrate with Amazon S3 : Configure Spark to write the processed data to Amazon S3 in a suitable format (e.g., Parquet, Avro, or CSV) for efficient storage and querying.

Data Storage in Minio S3

MinIO is a high-performance, S3 compatible object store. A MinIO "bucket" is equivalent to an S3 bucket, which is a fundamental container used to store objects (files) in object storage. In this pipeline, S3 will serve as the final destination for storing the processed data streams.

Create S3 Bucket : Set up an S3 bucket to store the processed data streams.
Define Data Organization: Determine the appropriate folder structure and naming conventions for organizing the data in the S3 bucket based on factors such as time, source, or data type.
Configure Access and Permissions : Set up appropriate access controls and permissions for the S3 bucket to ensure data security and compliance with organizational policies.

2. Getting Started

Prerequisites

Docker and docker compose
S3 bucket created: We will use Mini object storage

3. Setting up project environment:

a. Make sure docker is running: from terminal docker --version

b. Clone the repository and navigate to the project directory

 git clone https://github.com/fermat01/Building-streaming-ETL-Data-pipeline.git

and

 cd Building-streaming-ETL-Data-pipeline

c. We'll Create two folders dags and logs for apache airflow:

mkdir dags/ logs/

and give them permission

chmod -R 777 dags/
chmod -R 777 logs/

d. From terminal create a network

docker network create streaming_network

To be continued ...

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
dags		dags
images		images
scripts		scripts
spark_app		spark_app
.env		.env
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

images

images

scripts

scripts

spark_app

spark_app

.env

.env

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

requirements.txt

requirements.txt

Repository files navigation

Building-streaming-ETL-Data-pipeline

1. Project overview and architecture

Apache kafka

Automation and Orchestration

Data Processing with Apache Spark

Data Storage in Minio S3

2. Getting Started

3. Setting up project environment:

About

Releases

Packages

Languages

License

fermat01/Building-streaming-ETL-Data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Building-streaming-ETL-Data-pipeline

1. Project overview and architecture

Apache kafka

Automation and Orchestration

Data Processing with Apache Spark

Data Storage in Minio S3

2. Getting Started

3. Setting up project environment:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages