Skip to content

Install VDK Control Service with custom SDK

a_git_a edited this page Feb 26, 2024 · 26 revisions

Overview

In this tutorial, we will install the Versatile Data Kit Control Service using custom created SDK.

This SDK will be used automatically by all Data Jobs being deployed to it. And any change to the SDK will be automatically applied for all deployed data jobs instantaneously (starting from the next run).

Prerequisites

Here are listed the minimum prerequisites needed to be able to install VDK Control Service using custom SDK.

Before follows more details and one example of how they can be set up.

1. Git and Docker repository.

This tutorial assumes Github will be used. Github provides both docker (container) and git repo. Any other docker and git repository would work.

Go to https://github.com/new and create a repository. For this example, we have created "github.com/tozka/demo-vdk.git"

1.2. Generate Github Token.

You will need this Github Token later. Make sure to save it in a known place.

Make sure you gave permissions for both repo and packages (as we'd use it for both git and docker repository)

See example:

github-token

2. Python (PyPi) repository

This is where we will release (upload) our custom SDK. For POC purposes we will use https://test.pypi.org

image

3. Kubernetes and Helm

We need Kubernetes to install the Control Service. And also helm to install it.

In production, you may want to use some cloud provider like GKE, TKG, EKS or other 3 letter abbreviation ...

In this example though, we will use kind and set up things locally.

  • First, install kind
  • Create a demo cluster using:
kind create cluster --name demo

Optional integrations

VDK comes with some optional integrations with 3th party systems to provide more value that can be enabled with configuration only.

Those we will not be covered in this tutorial. Start a new discussion or contact us on slack on how to integrate since the options are not as clearly documented as we'd like.

1. External Logging

All job logs can be forwarded to a centralized logging system.

Prerequisites: SysLog or Fluentd

2. Notifications

SMPT Server for mail notifications. It's configured in in both SDK and Control Service

Prerequisites: SMTP Server

3. Integration with a monitoring system (e.g Prometheus).

See list of metrics supported in here See more in monitoring configuration

Prerequisites: Prometheus or Wavefront or similar

4. Advanced Alerting rules

You can define some more advanced monitoring rules. The Helm chart comes with prepared PrometheusRules (e.g Job Delay alerting) that can be used with AlertManager and Prometheus

Prerequisites: The out of the box rules require AlertManager

5. SSO Support

It supports Oauth2-based authorization of all operations enabling easy to integrate with company SSO. Authorization using claims is also supported.

See more in security section of Control Service Helm chart

Prerequisites: OAuth2

6. Access Control Webhooks

Access Control Webhooks enables to create more complex rules for who is allowed to do what operations in the Control Service (for cases where Oauth2 is not enough).

Prerequisites: Webhook endpoint

Install Versatile Data Kit with custom SDK

Here we will install the Versatile Data Kit.

First, we will create our custom SDK. This is a very simple process. If you are familiar with python packaging using setuptools, you will find these steps trivial.

1. Create custom VDK

custom-sdk-process

NOTE: You can skip this if you do not want to create custom SDK. Quickstart VDK is a such custom SDK which can be used to start quickly.

1. Create a directory for our SDK

mkdir my-org-vdk
cd my-org-vdk

Note that you should change the my-org-vdk name to something appropriate to your organisation.

2. Create and edit setup.py

Open setup.py in your favorite IDE.

We want to create an SDK that will support

  • Database queries to both Postgres and Snowflake
  • Ingesting Data into Postgres, Snowflake and using HTTP and using file.
  • Control Service Operations - deploying data jobs.

In install_requires we specify the plugins we need to achieve that:

import setuptools

setuptools.setup(
    name="my-org-vdk",
    version="1.0",
    install_requires=[
        "vdk-core",
        "vdk-plugin-control-cli",
        "vdk-postgres",
        "vdk-snowflake",
        "vdk-ingest-http",
        "vdk-ingest-file",
    ]
)

Note that you should change the package name to something appropriate to your organisation, and amend subsequent commands to refer to that name instead of my-org-vdk.

3. Upload our SDK distribution to a PiPy repository

In order for our python SDK to be installable and usable, we need to release it.

  • First, we build and package it:
python setup.py sdist --formats=gztar
  • Then we upload it to pypi.org. Fill out PIP_REPO_UPLOAD_USER_PASSWORD and PIP_REPO_UPLOAD_USER_NAME from step 2 of the Prerequisites section.
twine upload --repository-url https://test.pypi.org/legacy/ -u "$PIP_REPO_UPLOAD_USER_NAME" -p "$PIP_REPO_UPLOAD_USER_PASSWORD" dist/my-org-vdk-1.0.tar.gz

2. Create SDK Docker image

We need to create a simple docker image with our SDK installed which will be used by all jobs managed by VDK Control Service.

1. Create Dockerfile with our SDK installed

Open empty Dockerfile-vdk-base with a text editor or IDE. The content of the Dockerfile is simply this:

FROM python:3.7-slim

WORKDIR /vdk

ENV VDK_VERSION $vdk_version

#Install VDK
RUN pip install --extra-index-url https://test.pypi.org/simple my-org-vdk

As you can see it's pretty basic. We just want to install VDK.

2. Build and publish the Docker image

First, we need to log in to the Github Container Registry. Export the following environment variable:

export CR_PAT=*Github Personal Access Token*

and replace *Github Personal Access Token* with the token you created earlier.

Then, run the following command:

echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin

Make sure to tag it both with the version of the SDK and with the tag "release".

For example (replace with your own GitHub repo created in prerequisite):

docker build -t ghcr.io/tozka/my-org-vdk:1.0 -t ghcr.io/tozka/my-org-vdk:release -f Dockerfile-vdk-base .

docker push ghcr.io/tozka/my-org-vdk:release
docker push ghcr.io/tozka/my-org-vdk:1.0

3. Install Versatile Data Kit Control Service with Helm.

Here it is time to put everything together.

custom-sdk-process

1. Create and edit new file values.yaml

Here we will use the GitHub token, account name, and repo created in step 2 of the Prerequisites.

We need to export the following variables:

export GITHUB_ACCOUNT_NAME=*your account name*
export GITHUB_URL=*URL of the repo you created earlier*

The content of the values.yaml is:


resources:
   limits:
      memory: 0
   requests:
      memory: 0

cockroachdb:
   statefulset:
      resources:
         limits:
            memory: 0
         requests:
            memory: 0  
   init:
      resources:
         limits:
            cpu: 0
            memory: 0
         requests:
            cpu: 0
            memory: 0


deploymentGitUrl: "${GITHUB_URL}"
deploymentGitUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentGitPassword: "${GITHUB_TOKEN}"
uploadGitReadWriteUsername: "${GITHUB_ACCOUNT_NAME}"
uploadGitReadWritePassword: "${GITHUB_TOKEN}"
deploymentDockerRegistryType: generic
deploymentDockerRegistryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPasswordReadOnly: "${GITHUB_TOKEN}"
deploymentDockerRegistryUsername: "${GITHUB_ACCOUNT_NAME}"
deploymentDockerRegistryPassword: "${GITHUB_TOKEN}"
deploymentDockerRepository: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"
proxyRepositoryURL: "ghcr.io/${GITHUB_ACCOUNT_NAME}/data-jobs/demo-vdk"


deploymentVdkDistributionImage:

  registryUsernameReadOnly: "${GITHUB_ACCOUNT_NAME}"
  registryPasswordReadOnly: "${GITHUB_TOKEN}"

  registry: ghcr.io/${GITHUB_ACCOUNT_NAME}
  repository: "my-org-vdk"
  tag: "release"

security:
  enabled: False

2. Install VDK Helm chart

helm repo add vdk-gitlab https://gitlab.com/api/v4/projects/28814611/packages/helm/stable
helm repo update

helm install my-vdk-runtime vdk-gitlab/pipelines-control-service -f values.yaml

3. Expose Control Service API

In order to access the application from our browser we need to expose it using kubectl port-forward command:

kubectl port-forward service/my-vdk-runtime-svc 8092:8092

Note that this command does not return, and you will need to open a new terminal window to proceed.

Use

custom-sdk-process

Then let's see how data or analytics engineers would use it in our organization to create, develop and deploy jobs:

Install custom VDK

pip install --extra-index-url https://test.pypi.org/simple/ my-org-vdk

Configure VDK to know about Control Service

export VDK_CONTROL_SERVICE_REST_API_URL=http://localhost:8092

Create a sample data job

This will create a data job and register it in the Control Service. Locally it will create a directory with sample files of a data job:

vdk create --name example --team my-team --path .

Develop the data job

Browse the files in the example directory

Deploy the data job

It's a single "click" (or CLI command). Behind the scenes, VDK will package and install all dependencies, create docker images and container, release and version it, and finally schedule it (if configured) for execution.

vdk deploy --job-path example --reason "reason"

We can see some details about our job

vdk show --name example --team my-team

Note how there is both a VDK version and a Job Version. Those are deployed independently. VDK version is taken from the Control Service configuration and managed centrally. While the Job version is separate and the data engineer developing the job is in control .

Both the VDK version and job version can be changed if needed with vdk deploy --update command.


➡️ Next Section: Properties and Secrets

Clone this wiki locally