Skip to content

Research: Getting started and Configuration

Antoni Ivanov edited this page Oct 2, 2023 · 20 revisions

As specified in the https://github.com/vmware/versatile-data-kit/tree/main/specs/vep-2420-getting-started-with-my-data

We have 5 goals so let's outline solutions for each. Those solutions are going beyond the scope of single initiative. And only some of them would be implemented in this initiative. But the goal is to gather as many ideas as possible and later they can be prioritized and scoped better.

Pre-requisite reading. To make sense of the page please read

1 Easily finding out which properties need to be set for a given task.

UI/Notebook Integration

Utilize the existing configuration builder collected metadata to automatically

Configuration Grouping

Extend add() to include a group_id, which will group related properties together.

Grouping: The add() method signature can be modified as follows:

add(key: ConfigKey, default_value: ConfigValue, ..., group_id: str = None)

Groups can be anything E.g all Postgres settings would in one group, Redshift in another, DAG plugin in yet another and so on. This would allow searching all relevant and related properties easier. Grouping can also be used to a wizard type of workflow (see below)

  • [CLI] vdk config --group postgres
  • [Notebook] show in Settings in Group Postgres ?

Python Files Configuration

Consider switching from .ini to Python files for configuration. Then you can have Autocompletion, type checking, syntax highlighting, tooltips when you hover, better depreciation of options, and so on There are some tools like Flask and Jupyter that already use python files as configuration so there's tooling around that that can be reused.

If you can declare configuration values in python like (this below could be auto-generated from vdk configuration builder)

@dataclass
class SnowflakeConfig(DBConfig):
    account: str
    user: str
    password: str
    warehouse: str
    role: str
    database: str
    schema: str = 'PUBLIC'  # default to PUBLIC schema
class MainConfig: 
    db_default_type: DbConfig, 
    ...

and user provides config.vdk.py

config=MainConfig(db_default_type=SnowflaeConfig(acocunt="xxx", ...)) 

And you can have config.staging.vdk.py with differnet configuration for staigng

It would have special separate extension still .py but also requiring .vdk.py to separate it. To make sure this classes are not used outside of a configuration file we can can make checks in the constructor. Or we can override import function (or extend sys.meta_path with new loader) to introduce custom behaviors when certain modules or classes are imported (edited).

  • [CLI] vdk run --config-file config.staging.vdk.py
  • [Notebook] We can add VDK Config Cell . But this config cells would need to be obfuscated (e.g if config option is marked as sensitive). This could be achieved using IPython cell magic (e.g user enters as password below 1234 and on save it's obfuscated)
%%vdkconfig
c.Postgres.user = name
c.Postgres.password = ****

Possible implementation:

Leveraging traitlets library which provides some support for python configuration. Create a new plugin, named vdk-traitlets, to facilitate this (TODO: Evaluate alternatives to Traitlets, such as Pydantic or other listed here )

Runtime Validation:

Failing at time of use may be too late. Better to fail as soon as the value is set by user. Enhance the add() method to include a validator function to validate configurations at runtime.

add(key: ConfigKey, default_value: ConfigValue, ..., validator: Callable) 

Search Functionality

Introduce a search feature that enables users to easily find properties in the UI or CLI

  • [CLI] vdk config --search .?
  • [Notebook/IDE] However if we adopt Python based properties we could leverage the native python based auto-complete and IDE search capabilities.

Guided workflow / Wizard Assistant

Provide blueprints with pre-filled configurations for common tasks, so users can start with a working example.

See below in section 2 for the workflow

2 Easily finding out which SDK functionalities and methods as needed for a given task

Guided workflow / Wizard Assistant

Extend the CLI and Jupyter Notebooks to offer an interactive job or step creation process that handles all needed configuration dynamically.

Below is example workflow with the CLI

  • Initiate Interactive CLI
vdk create --interactive

Here, --interactive flag initiates the guided workflow

Step 1: Choose Job Type

Prompt: "What type of job would you like to create?"
- Data Ingestion
- Data Transformation
- Data Validation
- Custom

Choose job type: Data Ingestion

Step 2: Source Configuration

Prompt: "Select your data source type:"
- File
- Database
- Stream
- API
- Custom

Choose source type: Database

Prompt: What database 

Choose some_db

Prompt: Database-specific configurations will appear based on the group_id (group_id == some_db)

Step 3: Destination Configuration

Prompt: "Select your data destination:"
- File
- Database
- Stream
- API

Similar to source

Step 4 Display a summary of all configurations.

Prompt: "Would you like to proceed?"

The system will then generate the necessary code for the chosen configurations. The code will be production-ready. It will have necessary configuration keys set, it will have the correct methods called for extracting and loading data (in case of ingestion). It would have the necessary pluigns and dependencies set in job requirements.txt and automatically installed.

Step 5. Run the job

vdk run <job_name>

Also it's important that it is a flexible, extendable framework that allows contributors to easily add new blueprints with custom workflows.

Possible implementation

Key Components:

  • Blueprint Repository: A GitHub repository (vdk-blueprints) where contributors can add their own folder blueprint with the necessary files and description.
  • Configuration Metafile: Each folder (blueprint) should have a config.meta file that describes the parameters and the workflow logic. This can be written in JSON or YAML.
  • CLI Interface: Enhance the existing VDK CLI to support the guided workflow by fetching all blueprints and then interpreting the config.meta file.
  • Notebook interface: Enhance existing Notebook UI to support the guided workflow similar to CLI interface
Blueprint repository

Structure: Every blueprint folder should contain: The code blueprint config.meta file README.md for manual instructions

/blueprint_folder
    /example_code_folder
    config.meta
    README.md
# config.meta
{
  "job_type": "Data Ingestion",
  "parameters": [
    {"name": "Database Host", "type": "string", "group_id": "database"},
    // ...
  ],
  "workflow_logic": "workflow.py" // optional, more below
}

For more complex, dynamic workflow logic, the workflow_logic key in config.meta can point to a Python script that's responsible for conditionally setting parameters.

# workflow.py
def execute_workflow(user_choices):
    if user_choices['source'] == 'API':
        # Do something
    else:
        # Do something else

Since both CLI and UI/Notebok need to be support we need to make sure to abstract the Workflow logic.

CLI/Notebook interface

  1. CLI and Notebook should have the ability to fetch the list of available blueprints from vdk-blueprints GitHub repo. And present them to the user as options

  2. Parsing config.meta (if there's one) if not just coph the example job

  • Interpret and validate config.meta for each blueprint.
  • Present options and parameters to the user based on the config.meta.
  • Dynamic Workflow Logic: Optionally, execute a Python script (workflow.py mentioned above) to allow conditional logic based on user's choices.

Adding a New Blueprint/Example

  • Creating a new folder in the vdk-blueprints GitHub repository.
  • Adding the necessary code blueprint.
  • Optionally, Writing a config.meta file that defines the parameters and workflow.
  • Optionally, adding a workflow.py for dynamic logic.

TODO: evaluate also leveraging libraries like

Snippet Generation

Provide user ability to auto-generate code snippets based on keywords or some other way

More advanced - Auto-generate code snippets based on the user's activity in the IDE to accelerate development.

Documentation (API Reference)

Provide standard API Reference documentation that user are used to . We could generate it using Sphinx or similar tool

3. Production-Ready Jobs

Store config.ini in CS Properties/Secrets

Currently, config.ini is stored in source control, making it difficult to maintain confidential or sensitive information securely.

We can transition to a more secure and centralized configuration by leveraging the VDK Control Service's Properties and Secrets API to keep vdk configuration.

To make the change smooth and ensure the user experience is preserved, the new workflow will allow users to still use config.ini for configuration. However, instead of committing this file to source control, we'll parse it and securely upload its contents to VDK Control Service.

Workflow:

  1. user do vdk deploy -p <directory> --env prod
  2. VDK will read the configuration from config.ini (or config.vdk.py) or config.prod.ini
  3. Instead of uploading the information to source control it would be stored in Secrets or properties in a special part separated for that.

What if a user wants to keep config.ini in their own source control? They can still do that. It's possible to provide vdk obsfusate-config command to obfuscate only sensitive values.

New Commands that may be introduced

  • vdk generate-config: To generate a new config.ini.
  • vdk upload-config: To upload the parsed configuration to the centralized system.

For this to happen we need to have Dynamic configuration - See research here for more : https://github.com/vmware/versatile-data-kit/wiki/Research:-Dynamic-Configuration

4. Environment Variables

Fix Documentation

Remove the promotion of environment variables from documentation. Search all environment variables mentioned and replaced them with config.ini

Provide machine level Global Settings:

Problem with using config.ini is that it is per data job. And users have many jobs that really have common configuration (e.g database settings)

  • [CLI] We should introduce more global settings using ~/.vdk/config file
  • [Notebook] Use Jupyter Settnings

Precedence Rules

Establish and document rules for what takes precedence when both env vars and properties are set as the number of configuration providers rises.

5. IDE Support

Non CLI entry point option

Allow running jobs within the IDE without going through the VDK CLI. This can be facilitated by enabling a method like StandaloneDataJob().run() in the main Python file. Implementation Details:

Define a class StandaloneDataJob with a run() method. This can internally call the necessary hooks and setup required by VDK. Then used by developers like that;

def main():
    result = StandaloneDataJob().run()

Solutions scoring

  1. Configuration Simplicity Does the solution reduce the number of steps or complexity in setting up a configuration? How well does it guide the user in avoiding common mistakes like wrong names or wrong sections?
  2. Secure Data Connection Ease Does the solution offer a straightforward, documented way to securely handle credentials and connect to data sources?
  3. Seamless Dev Experience Can a user easily run a basic example from within an IDE? Does the solution offer a one-click setup or boilerplate code to run within popular IDEs (incl. notebooks)?
  4. Configuration/environment Consistency Does it solve the "it works on my machine" problem by ensuring environment and conifguration consistency
  5. Run successfully on first try How easy is it to create a working code on "first" try?
  6. New users Does the solution improve the chance of new users understanding, trying out and using vdk?
  7. Cost of Implementation What are the time and resource costs associated with implementing the solution?

(Scores are 0 - no match for that criteria 1 - small improvement 2 - very good/big improvement 3 perfect match)

Solution Configuration Simplicity Secure Config Ease Seamless Dev Experience Config Consistency Run on first try New users Cost Total Score
Guided workflow/Wizard Assistant 3 1 3 2 3 3 1 16
Python Files Configuration 3 0 2 1 2 2 2 12
UI/Notebook Integration 2 1 2 1 1 2 2 11
Production-Ready Jobs 1 3 0 3 1 1 2 11
Store config.ini in CS 1 3 0 3 1 1 2 11
Machine Level Global Settings 2 2 0 2 2 1 2 11
Snippet Generation 1 0 3 1 2 2 1 10
Configuration Grouping 2 0 1 2 0 1 2 8
Documentation (API Reference) 1 1 1 0 1 2 1 7
No-CLI Entry Point Option 0 0 2 1 1 1 2 7
Runtime Validation 2 0 0 2 1 0 1 6
Search Functionality 2 0 1 1 0 1 1 6

Feedback

Here are my preferences and notes on the topic:

  • Python Files Configuration – I believe this is a great idea… It would be even better if you can have an interactive tool – even a CLI too which asks through a series of prompts to fill in the configuration and validates/stores it and saves it in a file… Does it make sense?
  • Guided workflow / Wizard Assistant – I would prefer this one to even a large library of blueprints (although I would assume it would be based on exactly on a – potentially extensible – library of blueprints)
  • Environment Variables – this ties in pretty well with my first choice and it has been a bane of mine… I find it very hard to change configuration through environment variables and putting everything in a single well organized file would be great.

Separate notes: Isn’t the IDE Support relatively cheap to implement? It could be a low hanging fruit for. Production-ready jobs – before we/you start implementing these, should we clarify the general idea for staging/prod deployments, environments, etc… this has a long way to go in terms of maturing…

./setup-vdk [exsting_config.py]

  1. which db do you want impala, presto ...

enter impala configs


logging configuration telemetry endpoint smtp server control service


database="SuperCollider"

job_input.execute_query("SuperCollider", "select 1")


my three favorites:

I selected the ones that seems to add the most value when you start with VDK'/'try to set your initial PoC' with the framework. However, the next thing (that might be extremally important for the evaluators) is the security and therefore, after the three in the list above, I would add: https://github.com/vmware/versatile-data-kit/wiki/Research:-Getting-started-and-Configuration#store-configini-in-cs-propertiessecrets or alternatively(as cheaper soution) provide a very well documented way for the VDK OSS users explaining (with tutorial) how to securely set their sensitive data(->how to set and use CS secrets with example).


Based on the goals and solutions you've outlined, the following top 3 features could have the greatest impact for new users evaluating VDK for the first time:

  • Guided Workflow / Wizard Assistant: When someone is new to a framework, the learning curve can often be steep. Offering a guided workflow can simplify this experience. By implementing an interactive CLI or notebook-based wizard that helps users set up their data jobs, source and destination configurations, etc., you can make it easier for new users to understand the framework's capabilities quickly. This can significantly reduce the time it takes to set up an initial PoC and evaluate VDK.

  • Comprehensive API Reference Documentation: Quality documentation is essential for new users to quickly grasp the capabilities of the VDK framework. Automatically generated, standardized API reference documentation can greatly assist in this. Implementation: Use tools like Sphinx to automatically generate API documentation that is easy to navigate and understand. These features aim to streamline the initial experience, bolster security, and provide robust documentation, all of which are critical factors in evaluating a new framework.

  • Production-Ready Jobs with CS Properties/Secrets: Security is often a major concern when evaluating new frameworks. By providing a secure, centralized place to store configurations, you alleviate concerns about sensitive data. Implementation: Transition from storing config.ini in source control to using VDK Control Service's Properties and Secrets API. Introduce new commands like vdk generate-config and vdk upload-config.

  • Python File Configuration with Runtime Validation: Switching from .ini to Python files for configuration, along with runtime validation features. Python configurations will provide immediate familiarity and offer better tooling, such as autocompletion and syntax highlighting. Runtime validation will catch errors early, enhancing the robustness of the PoC and improving the framework's reliability.

Clone this wiki locally