Skip to content

Research: Progress Indicators

Antoni Ivanov edited this page Sep 1, 2023 · 7 revisions

Overview

Different types of Journeys

There are 4 ways to develop and run jobs.

  1. Using CLI command vdk run job-name locally . Logs are local.
  2. If Job is deployed - Job Cloud Execution (Scheduled or manual using API). Logs go Cloud Logging Service
  3. Notebook cell by cell development and execution. Logs would go to notebook cells
  4. Using the Notebook Run Job Button to run the full job . Logs would go to file. and notebook cells?

Hierarchy of operations

There is a hierarchy of how operations are executed.

  • The data job is the root process
  • Each data job run initialization, runs steps, and finalization
  • The steps are where the user code is being run so they are the most relevant.
  • Each step may do different things :
    • Execute queries
    • Execute templates (other jobs encapsulated behind a method)
    • Send data for ingestion
    • Use other methods of JobInput provided by VDK
    • Arbitrary python code unrelated to VDK

Each can take some time and user should have a visibility.

Starting data job X 
   Starting step X.y 
      Executing query select * from ... 

User Journeys

The standard logging cases

1. Using CLI Command (vdk run job-name locally)

Success Case:

  1. User runs vdk run job-name.
  2. Terminal shows real-time status: "Running step 01", "Executing query XYZ", etc. The previous status is replaced inline. Longer log lines (e.g long SQL) should be truncated (full logs still available in file)
  3. Once completed, the terminal outputs "Job succeeded."
  4. Terminal provides a file path to detailed logs.

Error Case:

  1. User runs vdk run job-name.
  2. An error occurs in step 03. The terminal stops and shows the current steps (3 lines)
  3. Terminal shows the step name, file name, line number root cause. (all should be no more than 10 lines)
  4. Terminal provides a file path to detailed logs with stack trace.

2. Job Cloud Execution (Scheduled or Manual Using API)

Success Case:

  1. Job starts in the cloud.
  2. Cloud Logging captures keys such as timestamps, job name, opId, executionId, step name, perhaps query executed, etc. The format is configured by configuration
  3. On completion, a "Job succeeded" log appears in the Cloud Logging Service.

Error Case:

  1. Job starts in the cloud.
  2. An error occurs.
  3. Cloud Logging captures the error details, including the root cause and line number.

3. Notebook Cell-by-Cell Development and Execution

Success Case:

  1. User executes a cell.
  2. Cell is considered a VDK Step. So it's natural progress indicator for step.
  3. Lower level details (sub-step) may need to be outputted in the logs. If possible this should be similar to the CLI case.
  4. Immediate success status appears on cell execution as soon as the cell finishes. If everything is successful nothing really should be written on stderr (might be configurable as user should have an option to see detailed logs)
  5. stdout is reserve for output (e.g data frame) and must never be used for logging.

Error Case:

  1. User executes a cell.
  2. Cell fails
  3. The root cause error message is displayed near the top of cell
  4. Provides a link to detailed file logs or a collapsible section for the same.

4. Using the Notebook Run Job Button

Success Case:

  1. User clicks "Run Job".
  2. Notebook executes all "production" marked cells.
  3. The user see the progress of which cell is being executed if they want to.
  4. On completion, the user sees successful status. They can optionally access the logs of each cell (e.g by opening a copied notebook perhaps)

Error Case:

  1. User clicks "Run Job".
  2. Execution stops at the failing cell, which displays the error message below it.
  3. User can see the logs below each cell

Long running operation

Finalization Phase

When the job completes it runs some finalization hooks. Among some things, this flushes all the ingestion queues to make sure the data is sent and blocks the process until it is done. This sometimes can take time (if there is a lot of data being sent) and may appear the job as stucked.

1. vdk run job-name locally

  • Success Case: Shows "Finalizing" with a progress bar.
  • Error Case: Outputs an error and stops the process.

2. Job Cloud Execution

  • Success Case: Logs "Finalizing" and provides updates at regular interval (1 minute?).
  • Error Case: Logs an error and stops the process.

3. Notebook Cell-by-Cell

  • Success Case: Finalization happens automatically when notebook is closed. TODO: should be tested . What happens?
  • Error Case: Popup alert warning if notebook closure failed.

4. Notebook Run Job

  • Success Case: Job concludes normally.
  • Error Case: The job fails, and an error is printed in the last cell and in the log file

Other Long-running Operations like DAG Jobs

Or SQL queries. They are partially covered in the main scenarios

Success Case:

  1. User runs a DAG job.
  2. Each sub-job shows its status and a progress bar (this works similarly to steps in above). Step is considered a parent of a orchestrated job since DAG runs within a Job Step
  3. Once a job finishes, status update of what happen is printed.

Error Case:

  1. User runs a DAG job.
  2. An error occurs in one of the sub-jobs.
  3. The error log appears and stops the entire DAG job process.
  4. The error is concise: it says the step, file and line number, the dag job that failed and the root cause message only (all should be no more than 10 lines)
  5. Log output also provides a file path to detailed logs with stack trace
Clone this wiki locally