Skip to content

michelmf/data_engineer_nd

Repository files navigation

Udacity - Data Engineer Nanodegree

In this course, students learn how to design data models, build data warehouses and data lakes, automate data pipelines, and work with massive datasets. At the end of the program, students must combine these new skills by completing a capstone project.

Skills Developed:

  • Dimensional Modeling of databases
  • SQL and NoSQL data modeling
  • ETL Techniques and strategies
  • Data Flows
  • Python and SQL Programming
  • Creation and Automation of Data Pipeline

Technologies used in this nanodegree:

  • PostgreSQL
  • Apache Cassandra
  • Amazon Web Services (IAM, EC2, Redshift, S3, ElasticMapReduce, Athena...)
  • Apache Spark using PySpark
  • Airflow

In the sections below I briefly describe each technology and project I developed during the course.

Section 1 - Data modeling using PostgreSQL and Apache Cassandra

In this section of Data Engineering Nanodegree, students have the opportunity to practice the following concepts learned during the classes:

  • Data modeling
  • Database Schemas (snowflake/star)
  • Creation of ETL pipelines
  • Database CRUD

This section have two hands-on projects, where we exercise database modeling, SQL and Python programming.

The first project involves a creation of a PostgreSQL database design to help a fake startup called Sparkify to analyze data from their product, a music streaming app. For more informations about the project, click here.

The second project uses a different approach from the first one, where students have to model the app database using Cassandra, a NoSQL database.

Section 2 - Cloud Data Warehouses with AWS

Description

Section 3 - Data Lakes with Spark

Description

Section 4 - Data Pipelines With AirFlow

Description

Section 5 - Capstone Project