GitHub - James-Wachuka/podcasts_pipeline: Building a four-step data pipeline using Airflow to download podcast episodes.

AIRFLOW DATA PIPELINE TO DOWNLOAD PODCASTS

Building a four-step data pipeline using Airflow,The pipeline will download podcast episodes.

PREREQUISITES

Install the following locally:

Python 3.8+
Airflow 2.3+

use the constraints file to setup the airflow locally in a venv.

Python packages: install the required packages using the requirements file

RUNNING THE DAGS

Ensure that your dags are present in the dags folder of $AIRFLOW_HOME. If the webserver and scheduler are running you can view and trigger your dags at localhost:8080

podcasts pipeline This code creates a DAG (Directed Acyclic Graph) in Airflow with a single task to download, parse, and extract the list of episodes from the Marketplace podcast metadata. The DAG is scheduled to run once a day and will not catch up for missed runs. The download_parse_extract task uses the requests library to download the metadata from the specified URL, then uses xmltodict to parse the XML data and extract the list of episodes.

pipeline with database This code adds a new task to Airflow pipeline using the SqliteOperator to create a table in a SQLite database. The create_table task uses the sqlite_default connection ID to connect to the SQLite database and runs an SQL command to create a table named episodes with the specified fields if it doesn’t already exist.

storing data to db This code adds a new task to Airflow pipeline using the @task decorator to store the episode metadata into the SQLite database.The load_to_db task uses the SqliteHook to connect to the SQLite database and query it to determine which episodes are already stored. It then loops through the list of episodes and inserts new rows into the episodes table for any episodes that are not already stored.

checking sqlite database This code creates a new DAG in Airflow named check_sqlite_database with a single task that checks the contents of the SQLite database and logs the results. The check_database function uses the SqliteHook to connect to the SQLite database and query it to retrieve all rows from the episodes table. It then loops through the rows and prints each one.

downloading podcasts episodes This code adds a new task to Airflow pipeline using the @task decorator to download the actual podcast episodes. The download_episodes task loops through the list of episodes and for each episode, it creates a filename and filepath for the corresponding audio file. If the audio file does not already exist, it uses the requests library to download it from the specified URL and saves it to the specified filepath.

NEXT STEPS

You can customize this project in the following ways.

Schedule the project to run daily, so you'll have a new episode when it goes live.
Parallelize tasks and run them in the cloud.
Add speech recognition, summaries using Airflow.

CONTRIBUTING

Contributions are welcome! If you have any ideas, improvements, or bug fixes, please open an issue or submit a pull request.

LICENSE

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
dags		dags
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
constraints		constraints
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

imgs

imgs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

constraints

constraints

requirements.txt

requirements.txt

Repository files navigation

AIRFLOW DATA PIPELINE TO DOWNLOAD PODCASTS

PREREQUISITES

RUNNING THE DAGS

NEXT STEPS

CONTRIBUTING

LICENSE

About

Releases

Packages

Languages

License

James-Wachuka/podcasts_pipeline

Folders and files

Latest commit

History

Repository files navigation

AIRFLOW DATA PIPELINE TO DOWNLOAD PODCASTS

PREREQUISITES

RUNNING THE DAGS

NEXT STEPS

CONTRIBUTING

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Languages