Modern Data Platform PoC

A proof of concept for the core of Modern Data Platform using DataOps, Kubernetes, and Cloud-Native ecosystem to build a resilient Big Data platform based on Data Lakehouse architecture which is the base for Machine Learning (MLOps) and Artificial Intelligence (AIOps).

Note

This project is part of my Master of Science in Data Engineering at Edinburgh Napier University (April 2023).

Architecture

Core Components

The core components of the platform are:

Infrastructure (Kubernetes)
Data Ingestion (Argo Workflows + Python)
Data Storage (MinIO)
Data Processing (Dremio)

Initial Model

To visualise the interactions of the current implementation, the C4 software architecture model (Context, Containers, Components, and Code) is used.

The following is a simplified view of the initial architecture model (all the abstractions are combined together).

Deployment

Prerequisites: asdf, Linux operating system, and Docker Engine (tested with asdf 0.11.1, Ubuntu 20.04.5 LTS, and Docker Engine Community 23.0.1).

The following tools are used in the development:

Helm
KinD
Kubectl
Kustomize

They could be installed with corresponding versions via asdf:

asdf install

Create the local Kubernetes cluster:

kind create cluster \
  --config clusters/local/kind-cluster-config.yaml

Deploy the applications to the Kubernetes cluster:

kustomize build --enable-helm clusters/local | kubectl apply -f -

Wait for deployments to be ready:

# Ingress-Nginx.
kubectl rollout status deployment \
  --watch --namespace ingress-nginx ingress-nginx-controller

# MinIO.
kubectl rollout status deployment \
  --watch --namespace minio minio

# Argo Workflows.
kubectl rollout status deployment \
  --watch --namespace argo-workflows argo-workflows-server

# Dremio.
kubectl rollout status statefulset \
  --watch --namespace dremio dremio-master

Apply the data pipeline:

kubectl apply --namespace argo-workflows --filename \
  pipelines/ingestion/argo-workflow-covid19-subnational-data.yaml

Benchmarking

TPC-DS test suite has been used to assess the performance of the platform.

For complete results, please check the project Jupyter Notebook in the benchmarking section.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
applications		applications
benchmark		benchmark
clusters/local		clusters/local
pipelines/ingestion		pipelines/ingestion
.gitattributes		.gitattributes
.gitignore		.gitignore
.tool-versions		.tool-versions
README.md		README.md
initial-architecture-model.png		initial-architecture-model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

applications

applications

benchmark

benchmark

clusters/local

clusters/local

pipelines/ingestion

pipelines/ingestion

.gitattributes

.gitattributes

.gitignore

.gitignore

.tool-versions

.tool-versions

README.md

README.md

initial-architecture-model.png

initial-architecture-model.png

Repository files navigation

Modern Data Platform PoC

Contents

Architecture

Core Components

Initial Model

Deployment

Benchmarking

About

Languages

aabouzaid/modern-data-platform-poc

Folders and files

Latest commit

History

Repository files navigation

Modern Data Platform PoC

Contents

Architecture

Core Components

Initial Model

Deployment

Benchmarking

About

Topics

Resources

Stars

Watchers

Forks

Languages