Skip to content
@google-research-datasets

Google Research Datasets

Datasets released by Google Research

Pinned

  1. natural-questions natural-questions Public

    Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question ans…

    Python 890 153

  2. conceptual-captions conceptual-captions Public

    Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.

    Shell 489 24

  3. Objectron Objectron Public

    Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the came…

    Jupyter Notebook 2.2k 264

  4. wit wit Public

    WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

    958 39

  5. paws paws Public

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase ident…

    Python 535 52

  6. dstc8-schema-guided-dialogue dstc8-schema-guided-dialogue Public

    The Schema-Guided Dialogue Dataset

    Python 520 120

Repositories

Showing 10 of 153 repositories
  • scin Public

    The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.

    Jupyter Notebook 51 1 1 0 Updated May 8, 2024
  • MISeD Public

    MISeD (Meeting Information Seeking Dialogs dataset) is an information-seeking dialog dataset focused on meeting transcripts. It includes 432 dialogs over transcripts from the QMSum dataset. MISeD is described in detail in the paper: Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts.

    2 0 0 0 Updated May 6, 2024
  • cpcd Public

    The Conversational Playlist Creation Dataset (CPCD) contains 917 conversations between two people where users express preferences over sets of songs in natural language and wizards to elicit preferences from users. The dataset includes per-song ratings and can be used to design and evaluate conversational recommendation systems.

    Python 8 2 1 1 Updated May 3, 2024
  • indic-gen-bench Public

    IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.

    19 2 0 0 Updated May 2, 2024
  • adversarial-nibbler Public

    This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).

    3 CC0-1.0 0 0 0 Updated Apr 29, 2024
  • thesios Public

    This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.

    4 0 0 0 Updated Apr 29, 2024
  • D3code Public

    D3code is a large-scale cross-cultural dataset of parallel annotations for offensive language detection by over 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions.

    0 CC0-1.0 0 0 0 Updated Apr 25, 2024
  • Taskmaster Public

    Please see the readme file as well as our 2019 EMNLP paper linked here -->

    188 57 4 0 Updated Apr 24, 2024
  • Crosslingual-Morphosyntactic-Divergence-dataset Public

    This repository contains the annotations from the paper "To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation."

    0 0 0 0 Updated Mar 25, 2024
  • QuoteSum Public

    QuoteSum is a textual QA dataset containing Semi-Extractive Multi-source Question Answering (SEMQA) examples written by humans, based on Wikipedia passages.

    Python 8 CC-BY-SA-4.0 0 0 0 Updated Mar 25, 2024

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…