Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🗺️ Datasets and Experiments #2017

Open
40 of 87 tasks
mikeldking opened this issue Dec 28, 2023 · 2 comments
Open
40 of 87 tasks

🗺️ Datasets and Experiments #2017

mikeldking opened this issue Dec 28, 2023 · 2 comments
Assignees
Labels

Comments

@mikeldking
Copy link
Contributor

mikeldking commented Dec 28, 2023

As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:

  • Golden Dataset - contains queries and golden responses
  • QA Asset - a set of difficult queries over which you want to perform regression testing
  • Sample of production data - pick and choose records from production to convert to an asset (overlaps with above)

Motivation

LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:

  • Modifying a prompt template
  • Iterate on various components of application and seeing if there are differences in output
  • Swapping out for new model releases
  • See performance on a cheaper new model or a fine-tuned model

Use-cases

Datasets will contain data from various data sources:

Pre-deployment

  • Store and maintain a set of synthetic queries
  • Store and maintain a set of hand curated queries
  • Make a copy of huggingface / CSV data to be maintained internally

Post-deployment

  • Move data from production or staging for regression testing or fine-tuning

Architecture

Dataset

A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.

Dataset Examples

A dataset is a set of examples. These examples contain:

  • input - data passed to an LLM, prompt, or function (e.x. a retriever)
  • expected / output the result of the invocation of an LLM, prompt, or function
  • metadata - any additional information that can be used during experimentation

In addition to the above a dataset record should optionally have

  • metadata any additional data associated with the record (e.x. attributes from a span)
  • **source span_rowid / trace_rowid ** if the data came from a span, it should link back to the source

Dataset Experiment

A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:

  • Change in LLM or LLM params
  • Change in prompt template
  • Change in retrieval strategy

Experiment Output

Planning

Infra

Tables

Rest API

GraphQL

Experiments SDK

UI

Tests

Bugs

@mikeldking mikeldking changed the title 🗺️ Evaluation / Fine-Tuning Datasets 🗺️ Datasets Apr 30, 2024
@mikeldking
Copy link
Contributor Author

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

@axiomofjoy
Copy link
Contributor

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

What other criteria can we think of?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 📘 Todo
Status: Todo
Development

No branches or pull requests

4 participants