Twitter ETL with Airflow and PySpark

🪧 Vitrine.Dev
✨ Nome	Twitter ETL with Airflow and PySpark
🏷️ Tecnologias	Apache Airflow, PySpark, Spark, Python

Process of scheduled data extraction, transform and load is done using Apache Airflow in a tweet simulator website (the API became paid).

Data lake is divided into 3 Categories: Bronze, silver and gold. Raw extrated data is stored in bronze folder. Silver is the data with some treatment to ease viasualization and statistical/machine learning model creation. Gold has more automated data organization.

Airflow is used to schedule and order the processes. Although this is a small data project, Spark is used for educational and training purposes. This is a simple version of a data engineering task with big data technology and automated pipelines.

Airflow pipeline is divided into 3 folders: Hook, operator and insight. Hook has the script in which the connection with API is stablished data is obtained. The operator script makes the data cleaning and treatment. Insight is responsible for the loading process.

'src' folder has also a notebooks folder, that contains short data analyses of each category of data lake groups.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
airflow_pipeline		airflow_pipeline
datalake		datalake
src		src
README.md		README.md
twitter.jpg		twitter.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

airflow_pipeline

airflow_pipeline

datalake

datalake

src

src

README.md

README.md

twitter.jpg

twitter.jpg

Repository files navigation

Twitter ETL with Airflow and PySpark

About

Releases

Packages

Languages

AntonioLunardi/Twitter_ETL_with_Airflow_and_PyPpark

Folders and files

Latest commit

History

Repository files navigation

Twitter ETL with Airflow and PySpark

About

Topics

Resources

Stars

Watchers

Forks

Languages