You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:
Golden Dataset - contains queries and golden responses
QA Asset - a set of difficult queries over which you want to perform regression testing
Sample of production data - pick and choose records from production to convert to an asset (overlaps with above)
Motivation
LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:
Modifying a prompt template
Iterate on various components of application and seeing if there are differences in output
Swapping out for new model releases
See performance on a cheaper new model or a fine-tuned model
Use-cases
Datasets will contain data from various data sources:
Pre-deployment
Store and maintain a set of synthetic queries
Store and maintain a set of hand curated queries
Make a copy of huggingface / CSV data to be maintained internally
Post-deployment
Move data from production or staging for regression testing or fine-tuning
Architecture
Dataset
A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.
Dataset Examples
A dataset is a set of examples. These examples contain:
input - data passed to an LLM, prompt, or function (e.x. a retriever)
expected / output the result of the invocation of an LLM, prompt, or function
metadata - any additional information that can be used during experimentation
In addition to the above a dataset record should optionally have
metadata any additional data associated with the record (e.x. attributes from a span)
**source span_rowid / trace_rowid ** if the data came from a span, it should link back to the source
Dataset Experiment
A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:
As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:
Motivation
LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:
Use-cases
Datasets will contain data from various data sources:
Pre-deployment
Post-deployment
Architecture
Dataset
A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.
Dataset Examples
A dataset is a set of examples. These examples contain:
In addition to the above a dataset record should optionally have
Dataset Experiment
A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:
Experiment Output
Planning
Infra
arize-phoenix-client
package #2914Tables
dataset_example_revisions
table #3241Rest API
GraphQL
Experiments SDK
UI
Tests
Bugs
data
payload #3363The text was updated successfully, but these errors were encountered: