Skip to content

fermat01/Building-streaming-ETL-Data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building-streaming-ETL-Data-pipeline

GitHub GitHub top language GitHub language count GitHub last commit ViewCount

Building streaming Data pipeline using apache airflow, kafka ,...

1. Project overview and architecture

In this project, we build a real-time ETL (Extract, Transform, and Load) data pipeline. During this process we'll use an open api to get data. Building a streaming ETL data pipeline involves ingesting , processing, transforming and loading real-time data into a data storage or analytics system. This overview outlines the process of building such a pipeline requiring Apache Kafka for data ingestion, Apache Spark for data processing, and Amazon S3 for data storage.

Apache kafka

  • Set up Kafka Cluster: Deploy a Kafka cluster with multiple brokers for high availability and scalability.

  • Create Kafka Topics : Define topics to categorize and organize the incoming data streams based on their sources or types.

  • Configure Kafka Producers : integrate Kafka producers to send data from open api to the appropriate Kafka topics.

Automation and Orchestration

Leverage automation and orchestration tools (e.g., Apache Airflow) to manage and coordinate the various components of the pipeline, enabling efficient deployment, scheduling, and maintenance.

Data Processing with Apache Spark

Apache Spark is a powerful open-source distributed processing framework that excels at processing large-scale data streams. In this pipeline, Spark will consume data from Kafka topics, perform transformations and computations, and prepare the data for storage in Amazon S3.

  • Configure Spark Streaming : Set up a Spark Streaming application to consume data from Kafka topics in real-time.
  • Define Transformations : Implement the necessary transformations and computations on the incoming data streams using Spark's powerful APIs. This may include data cleaning, filtering, aggregations, and enrichment from other data sources.
  • Integrate with Amazon S3 : Configure Spark to write the processed data to Amazon S3 in a suitable format (e.g., Parquet, Avro, or CSV) for efficient storage and querying.

Data Storage in Minio S3

MinIO is a high-performance, S3 compatible object store. A MinIO "bucket" is equivalent to an S3 bucket, which is a fundamental container used to store objects (files) in object storage. In this pipeline, S3 will serve as the final destination for storing the processed data streams.

  • Create S3 Bucket : Set up an S3 bucket to store the processed data streams.

  • Define Data Organization: Determine the appropriate folder structure and naming conventions for organizing the data in the S3 bucket based on factors such as time, source, or data type.

  • Configure Access and Permissions : Set up appropriate access controls and permissions for the S3 bucket to ensure data security and compliance with organizational policies.


2. Getting Started

Prerequisites

  • Docker and docker compose
  • S3 bucket created: We will use Mini object storage

3. Setting up project environment:

a. Make sure docker is running: from terminal docker --version

b. Clone the repository and navigate to the project directory

 git clone https://github.com/fermat01/Building-streaming-ETL-Data-pipeline.git

and

 cd Building-streaming-ETL-Data-pipeline

c. We'll Create two folders dags and logs for apache airflow:

mkdir dags/ logs/

and give them permission

chmod -R 777 dags/
chmod -R 777 logs/

d. From terminal create a network

docker network create streaming_network

To be continued ...