Skip to content

Using Data Job Properties vs Secrets

Antoni Ivanov edited this page Aug 4, 2023 · 12 revisions

This article outlines when and how you should use Data Job Properties or Secrets.

While both mechanisms can be used somewhat interchangeably there are certain things you should be aware of:

Properties vs Secrets

  • Properties are used to store state. They are generally faster to access and modify. If you need to overwrite a value often, sometimes on multiple occasions during the execution of a data job - properties are the way to go. They are not encrypted at rest. Using the data classification levels, storing internal or public data or low-sensitive private is likely appropriate.

  • Secrets are used to store sensitive data. Secrets are generally fast to access (somewhat slower than Properties), but slow to modify, as they are encrypted/decrypted during the storage/retrieval process. They are best suited for storing sensitive data - secrets, passwords, credentials, tokens, API keys, etc. They are stored in an encrypted state in a secure storage - for example, a Hashicorp Vault instance. Suitable for storing highly sensitive data

Use cases

Properties

  • Last processed state: store the timestamp of the last successful run or last ingested recod timestamp or last row id.
  • Query Parameters: properties are automatically expanded in SQL queries (select * from {db}.{table})
  • Progress information: information about the progress of a long-running ETL task, such as the percentage of data processed.
  • Configurations: like environment (staging or production) or other non-sensitive configurations

Secrets

  • API keys or tokens: tasks that involve pulling data from third-party APIs which require API key to authenticate
  • Service Account Credentials: connecting to some internal services such as email server passwords, third-party service credentials .
  • Cloud Service Credentials: When interacting with cloud services like AWS, Google Cloud, or Azure, the access keys, client IDs, client secrets, and other such sensitive credentials.

Example Scenarios

Properties

You need to store the date of the last processed data entry to ensure the job begins processing new data from the correct point the next day.

In this case, the 'last processed date' can be stored as a property. It's not sensitive information but necessary for maintaining the job's state.

def run(job_input):
    # get the properties
    properties = job_input.get_all_properties()

    current_date = str(date.today())

    if ('last_ingested_timestamp' not in properties) or current_date != properties['last_ingested_timestamp']:

        # some very complex processing goes here...

        # update the property value and store it
        properties['last_ingested_timestamp'] = current_date
        job_input.set_all_properties(properties)
    else:
        logging.info("Skipped ingestion")

You can also use the vdk properties command to store and retrieve properties via the command line. You can check all options and examples using vdk properties --help

Secrets

Now, suppose you have to extract data from a third-party service that requires API authentication. The API key, being a sensitive piece of information, needs to be securely stored. In this scenario, you will store the API key as a secret.

You can use the vdk secrets command to store and retrieve secrets via the command line. You can check all options and examples using vdk secrets --help

If you are using the vdk cli on a private/secure console, you can use the "--set-prompt" option and then you'll get prompted to enter it and it won't be kept in your console's history.

vdk secrets -n my-job -t my-team --set-prompt "api_key"

In a data job, you can access Job Secrets via the JobInput's secrets methods. In the following example we'll get the value of a single secret and use it to make an authenticated REST call:

import requests
from datetime import date, timedelta
from vdk.api.job_input import IJobInput


def run(job_input: IJobInput):
    # Get the API Key from the Job Secrets
    api_key = job_input.get_secret('api_key')

    # Get the data
    url=...
    response = requests.get(url, params=params)
    data = response.json()

    #  ...

Summary

At a Glance: Properties vs Secrets

Feature Properties Secrets
Recommended data type State or non/low-sensitive data Medium or highly sensitive data
Use cases state, configuration, status passwords, API keys, tokens, credentials
Size limit ~10 KBs per key/value 512 bytes per key/value
Read access Fast Slightly slower
Update request rate High (many times per job execution) Low (usually through UI or CLI)
Backend storage OLTP Database HashiCorp Vault
Encryption at rest No Yes

By understanding these differences, you can optimize your data jobs and maintain best practices for data security and efficiency.
Remember to consider the nature of your data before deciding whether to use properties or secrets

Clone this wiki locally