Skip to content

Latest commit

 

History

History

2021-processing-petabytes-with-dask

Processing petabytes in Python with Argo Workflows & Dask

Pipekit Logo

Goal of this repository

  • Showcase the combination of Dask and Argo Workflows to dynamically scale a compuational workload
  • Provide a basic Argo-workflows installation applicable to a production-grade kubernetes clusters
    • The set-up has been tested on AWS EKS, and would likely work for similar kubernetes providers
    • The set-up will almost certainly NOT work for a local kubernetes installation, such as that with docker desktop or k3s
  • Package a Dask data pipeline into a docker container
  • Create an argo workflows WorkflowTemplate and related resources required to scale out the Dask pipeline in kubernetes

The talk

Processing petabytes in Python with Argo Workflows & Dask

The pipeline

This project includes a Dask data pipeline which showcases a simple set-up of the Futures Interface. The pipeline will:

  • Connect to a pre-existing Dask Scheduler
  • Consider a set of timeseries weather data for major cities in Spain
  • Submit a data-processing task to the available Dask Workers which accepts a single time stamp argument, and returns the name of a city
    • Takes the input timestamp, and extracts windspeed data at this timestamp for each city
    • Identifies the city with the highest windspeed
    • Returns that city's name
  • Counts the observations where each city had the fastest windspeed
  • Reports the city which is most often the windiest

About Pipekit

Pipekit is the control plane for Argo Workflows. Platform teams use Pipekit to manage data & CI pipelines at scale, while giving developers self-serve access to Argo. Pipekit's unified logging view, enterprise-grade RBAC, and multi-cluster management capabilities lower maintenance costs for platform teams while delivering a superior devex for Argo users. Sign up for a 30-day free trial at pipekit.io/signup.

Learn more about Pipekit's professional support for companies already using Argo at pipekit.io/services.