Skip to content

Data Job

a_git_a edited this page Feb 5, 2024 · 7 revisions

Overview

VDK is Versatile Data Kit SDK.

It provides standard functionality for data ingestion and processing and CLI for managing the lifecycle of a Data Job.

Data Job is a Data processing unit that allows data engineers to implement automated pull ingestion (E in ELT) or batch data transformation into Data Warehouse (T in ELT). At the core of it, it is a directory with different scripts and inside of it.

Data Job Types

There are can be different types of Data Jobs depending on what they do:

  • Ingestion jobs - jobs that add data to a database
  • Processing jobs - jobs that process data from one database and populate another database
  • Publishing jobs - jobs that update and publish the end-user facing report or export data for user-facing app consumption

You can read more about the differences in data between those at Business Intelligence journey

A typical Ingestion job:

  • Reads from some API or database
  • Does NOT do any transformations on the data (besides formating the payload to be accepted by target (e.g json serialization)).
  • Pushes the data to some Ingestion target

A typical Processing job:

  • Creates a materialised view
  • Data comes from a source database
  • Data goes to a target database
  • Data in the target database is in a star schema
  • Schema is populated using standard fact/dimension loading strategies (relevant ones are implemented in the platform, so it is 1-liner in terms of Data Job code)

A typical Publishing Job

  • Pull data into memory and create (tableau) extract or publishes SQL query/view to a reporting system
  • Caches data into caching service so that it can be used by User facing UI (for example)

Data Job Structure

example/
├── 10_ensure_example_table_exists.sql (optional)
├── 20_add_data_to_example_table.py (optional)
├── 30_sql_step.sql (optional)
├── config.ini
├── requirements.txt (optional)
example.keytab (optional)

Data Job Steps

Data job consists of Steps. A data job step is a single unit of work for a Data Job. Which data job scripts or files are considered steps and executed by vdk is customizable.

By default, there are two types of steps:

  • SQL steps (SQL files)
  • Python steps (Python files implementing run(job_input) method)

Data Job steps exist as .sql and/or .py files in a directory along with the job configuration. Scripts are executed in alphanumeric order, e.g., if we have a Data Job that consists of file1.sql and file2.py , then file1.sql will be executed first, followed by file2.py.

See example:
data job step sequence

The steps will be executed in the order of the respective file names: 10_drop_table.sql, 20_create_table.sql, and 30_ingest_to_table.py

SQL Steps

SQL steps (.sql) SQL scripts are standard SQL scripts. They are executed against configured database. (run vdk config-help to see how to configure it)

Common uses of SQL steps are:

  • aggregating data from other tables to a new one
  • creating a table or a view that is needed for the python steps

Queries in .sql files can be parametrised. A valid query parameter looks like → {parameter}. Parameters will be automatically replaced if there is a corresponding value existing in the Data Job properties or (if used) Data Job Arguments.

Python steps (.py)

This is the structure of a python step:

def run(job_input: IJobInput):
    job_input.do_something()

See the documentation of the job_input object methods for details about the capabilities.

Only scripts which have implemented run(job_input) method will be executed.
VDK will not execute any other Python files and can be used to store common code (libraries) to be reused in job steps.

Note: Query parameter substitution works similarly for SQL scripts and Python: job_input.execute_query() method.

Create Your First Data Job

To create your first Data Job you need to:

  1. Install Quickstart VDK
  2. Execute command vdk create
  3. Follow the Create First Data Job page

Data Job Execution

An instance of a running Data Job deployment is called an execution.

To execute your data job you need to:

  1. Execute vdk run command
  2. Follow the output of this run

Local executions always comprise a single attempt.

➡️ Next section: Ingestion

Clone this wiki locally