Google Research Datasets

scin Public
The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

Jupyter Notebook 51 1 1 0 Updated May 8, 2024
MISeD Public
MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

2 0 0 0 Updated May 6, 2024
cpcd Public
The Conversational Playlist Creation Dataset (CPCD) contains 917 conversations between two people where users express preferences over sets of songs in natural language and wizards to elicit preferences from users. The dataset includes per-song ratings and can be used to design and evaluate conversational recommendation systems.

Python 8 2 1 1 Updated May 3, 2024
indic-gen-bench Public
IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.

19 2 0 0 Updated May 2, 2024
adversarial-nibbler Public
This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

3 CC0-1.0 0 0 0 Updated Apr 29, 2024
thesios Public
This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.

4 0 0 0 Updated Apr 29, 2024
D3code Public
D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.

0 CC0-1.0 0 0 0 Updated Apr 25, 2024
Taskmaster Public
Please see the readme file as well as our 2019 EMNLP paper linked here -->

188 57 4 0 Updated Apr 24, 2024
Crosslingual-Morphosyntactic-Divergence-dataset Public
This repository contains the annotations from the paper "To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation."

0 0 0 0 Updated Mar 25, 2024
QuoteSum Public
QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on Wikipedia passages.

Python 8 CC-BY-SA-4.0 0 0 0 Updated Mar 25, 2024

View all repositories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Research Datasets

Pinned

Repositories

People

Top languages

Most used topics