Robust Data Transformation with Pandas: Typing, Validation, Testing

Materials for Robust Pandas Tutorial, EuroPython, Prague, 2023.

Abstract

We will explore possibilities for making our data analyses and transformations in Pandas robust and production ready. We will see how advanced group-by, resample or rolling aggregations work on large time series weather data. (As a bonus, you will learn about Prague climate.) We will use type annotations and schema validations with the Pandera library to make our code more readable and robust. We will also show the potential of property-based testing using the Hypothesis package, with strategies generated from Pandera schemas. We will show how to avoid issues with time zones when working with time series data. By the end of the tutorial, you will have a deeper understanding of advanced Pandas aggregations and be able to write robust, production ready Pandas code.

Data sources

Two data sources are used in this workshop:

Preparation

Please prepare a Python environment that you can use during the workshop. We will work in Jupyter Notebook as well as in an editor or an IDE of your choice. Recommended are Visual Studio Code or PyCharm.

Note: All the instructions below are for Unix-like systems (Linux, macOS, WSL on Windows). If you want / need to work in Windows native cmd or PowerShell, you will need to adapt the commands accordingly. We cannot provide support for Windows native environments.

Clone this repository

git clone https://github.com/coobas/robust-pandas-workshop.git

or using gh client:

gh repo clone coobas/robust-pandas-workshop

Prepare Python Environment

We have included either requirements.txt or environment.yml files for you to create a Python environment using either pip or conda respectively.

Python version 3.10+ is required.

First, cd into the repository directory:

cd robust-pandas-workshop

Pip installation

python -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt

Conda installation

conda env create -f environment.yml -n "robust-pandas-workshop"
conda activate robust-pandas-workshop

The `weatherlyser` package

Code for this workshop is in the weatherlyser package in this repository. Before working with it, either in Jupyter notebooks, in your IDE, or when running tests, Python needs to know about it.

Either set the PYTHONPATH environment variable to the repository directory:

export PYTHONPATH=$PWD

(this of course assumes your current directory in the repository root)

or, which is more robust, install the package in editable mode:

pip install -e .

Online option

Clone in Deepnote

Follow the instructions therein and if you do not have it, create a free Deepnote account.

Workshop materials

All materials that we will use during the workshop are in Jupyter notebooks.

Visual Studio Code or PyCharm Professional users can work with notebooks directly in their IDE; this is the recommended way. You can also use Jupyter Lab, which will be installed in your environment and features an IDE environment too with and editor and command line.

Testing and linting

The tests directory contains tests for the weatherlyser package. We will use the tests throughout the workshop to test our code. It is also a good idea to run the tests to check whether your installation is working correctly.

To run tests, use pytest:

pytest

The mypy static type checker is configured to check the weatherlyser and tests folders. You can run it with:

mypy

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
solutions		solutions
tests		tests
weatherlyser		weatherlyser
.gitignore		.gitignore
01_introduction.ipynb		01_introduction.ipynb
02_first_data_exploration.ipynb		02_first_data_exploration.ipynb
03_type_annotations.ipynb		03_type_annotations.ipynb
04_data_loading_module.ipynb		04_data_loading_module.ipynb
05_time_zones.ipynb		05_time_zones.ipynb
06_hypothesis_testing.ipynb		06_hypothesis_testing.ipynb
07_resampling_and_aggregations.ipynb		07_resampling_and_aggregations.ipynb
08_windowing_and_differences.ipynb		08_windowing_and_differences.ipynb
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

License

coobas/robust-pandas-workshop

Folders and files

Latest commit

History

Repository files navigation

Robust Data Transformation with Pandas: Typing, Validation, Testing

Abstract

Data sources

Preparation

Clone this repository

Prepare Python Environment

Pip installation

Conda installation

The weatherlyser package

Online option

Workshop materials

Testing and linting

About

Resources

License

Stars

Watchers

Forks

Languages

The `weatherlyser` package