🗺️ Datasets and Experiments #2017

mikeldking · 2023-12-28T17:57:53Z

As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:

Golden Dataset - contains queries and golden responses
QA Asset - a set of difficult queries over which you want to perform regression testing
Sample of production data - pick and choose records from production to convert to an asset (overlaps with above)

Motivation

LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:

Modifying a prompt template
Iterate on various components of application and seeing if there are differences in output
Swapping out for new model releases
See performance on a cheaper new model or a fine-tuned model

Use-cases

Datasets will contain data from various data sources:

Pre-deployment

Store and maintain a set of synthetic queries
Store and maintain a set of hand curated queries
Make a copy of huggingface / CSV data to be maintained internally

Post-deployment

Move data from production or staging for regression testing or fine-tuning

Architecture

Dataset

A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.

Dataset Examples

A dataset is a set of examples. These examples contain:

input - data passed to an LLM, prompt, or function (e.x. a retriever)
expected / output the result of the invocation of an LLM, prompt, or function
metadata - any additional information that can be used during experimentation

In addition to the above a dataset record should optionally have

metadata any additional data associated with the record (e.x. attributes from a span)
**source span_rowid / trace_rowid ** if the data came from a span, it should link back to the source

Dataset Experiment

A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:

Change in LLM or LLM params
Change in prompt template
Change in retrieval strategy

Experiment Output

Planning

[datasets][planning] ERD for datasets, dataset records, audit tables #3043

Infra

Tables

Rest API

GraphQL

Experiments SDK

[experiments] Evaluator Interface / Protocol #3356

UI

Tests

Bugs

[datasets][bug] dataset examples gql endpoint pulling from all revisions #3300
[datasets][ui] alert for dataset example created gets hidden under slideovers
[datasets] don't throw errors on CSV download (just return en empty csv) #3309
[ENHANCEMENT] use consistent download URLs for dataset downloads #3308
[datasets][gql] ensure dataset examples query pulls only from dataset id #3332
[datasets][bug] fix dataset fixtures #3337
[datasets] [bug] messages in input output is string #3342
[datasets] make /upload endpoint return dataset in data payload #3363

The text was updated successfully, but these errors were encountered:

mikeldking · 2024-05-21T01:12:47Z

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

axiomofjoy · 2024-05-21T01:35:10Z

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

What other criteria can we think of?

mikeldking added the roadmap label Dec 28, 2023

mikeldking mentioned this issue Apr 30, 2024

create an arize-phoenix-client package #2914

Open

mikeldking changed the title ~~🗺️ Evaluation / Fine-Tuning Datasets~~ 🗺️ Datasets Apr 30, 2024

mikeldking mentioned this issue May 10, 2024

feat(datasets): datasets feature #3167

Draft

mikeldking mentioned this issue May 30, 2024

[ENHANCEMENT] build golden datasets or manual evals #3249

Open

mikeldking changed the title ~~🗺️ Datasets~~ 🗺️ Datasets and Experiments May 31, 2024

mikeldking mentioned this issue May 31, 2024

🗺️ Evaluation Experiment Tracking #2220

Closed

mikeldking assigned axiomofjoy, RogerHYang, anticorrelator and mikeldking May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🗺️ Datasets and Experiments #2017

🗺️ Datasets and Experiments #2017

mikeldking commented Dec 28, 2023 •

edited

mikeldking commented May 21, 2024

axiomofjoy commented May 21, 2024

🗺️ Datasets and Experiments #2017

🗺️ Datasets and Experiments #2017

Comments

mikeldking commented Dec 28, 2023 • edited

Motivation

Use-cases

Architecture

Dataset

Dataset Examples

Dataset Experiment

Experiment Output

Planning

Infra

Tables

Rest API

GraphQL

Experiments SDK

UI

Tests

Bugs

mikeldking commented May 21, 2024

axiomofjoy commented May 21, 2024

mikeldking commented Dec 28, 2023 •

edited