Skip to content

nsphung/pyspark-template

Repository files navigation

PySpark Template Project

made-with-python python-3.10 Code style: black Imports: isort Checked with mypy made-with-Markdown

People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.

Table of Contents

Inspiration

Development

Prerequisites

All you need is the following configuration already installed:

  • Git
  • The project was tested with Python 3.10.9 managed by pyenv:
    sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \
    libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
    libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
    
  • JAVA_HOME environment variable configured with a Java JDK11
  • SPARK_HOME environment variable configured with Spark version spark-3.3.1-bin-hadoop3 package
  • PYSPARK_PYTHON environment variable configured with "python3.10"
  • PYSPARK_DRIVER_PYTHON environment variable configured with "python3.10"
  • Install Make to run Makefile file
  • Why Python 3.10 because PySpark 3.3.1 doesn't work with Python 3.11 at the moment it seems (I haven't tried with Python 3.12)
  • Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10

Add format, lint code tools

Autolint/Format code with Black in IDE:

Code style: black

  • Auto format via IDE https://github.com/psf/black#pycharmintellij-idea

  • [Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration

  • Or remember to type black . to apply the black rules formatting to all sources before commit

  • Add integratin with Jenkins and it will complain and tests will fail if black format is not applied

  • Add same mypy option for vscode in Preferences: Open User Settings

  • Use the option to lint/format with black and flake8 on editor save in vscode

Checked optional type with Mypy PEP 484

Checked with mypy

Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.

  • Install mypy plugin for intellij
  • Adjust the plugin with the following options:
    "--follow-imports=silent",
    "--show-column-numbers",
    "--ignore-missing-imports",
    "--disallow-untyped-defs",
    "--check-untyped-defs"
    
  • Documentation: Type hints cheat sheet (Python 3)
  • Add same mypy option for vscode in Preferences: Open User Settings

Isort

Imports: isort

{
    "editor.formatOnSave": true,
    "python.formatting.provider": "black",
    "[python]": {
        "editor.codeActionsOnSave": {
            "source.organizeImports": true
        }
    }
}
  • isort configuration for pycharm. See Set isort and black formatting code in pycharm
  • You can use make lint command to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration
isort .

Fix

  • Show a way to treat json erroneous file like data/pubmed.json

Usage Local

  • Create a poetry env with python 3.10
poetry env use 3.10
  • Install dependencies in poetry env (virtualenv) make deps
  • Lint & Test make build
  • Lint,Test & Run make run
  • Run dev make dev
  • Build binary/python whell make dist

Use with poetry

poetry run drugs_gen --help

Usage: drugs_gen [OPTIONS]

Options:
  -d, --drugs TEXT             Path to drugs.csv
  -p, --pubmed TEXT            Path to pubmed.csv
  -c, --clinicals_trials TEXT  Path to clinical_trials.csv
  -o, --output TEXT            Output path to result.json (e.g
                               /path/to/result.json)
  --help                       Show this message and exit.

Usage in distributed-mode depending on your cluster manager type

  • Use spark-submit with the Python Wheel file build by make-dist in the dist folder.