The M-AILABS Speech Dataset

The M-AILABS Speech Dataset is the first large dataset that we are providing free-of-charge, freely usable as training data for speech recognitionand speech synthesis.

Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format.

A transcription is provided for each clip. Clips vary in length from 1 to 20 seconds and have a total length of approximately shown in the list (and in the respective info.txt-files) below.

The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded by the LibriVox project and is also in the public domain – except for Ukrainian.

Ukrainian audio was kindly provided either by Nash Format or Gwara Media for machine learning purposes only (please check the data info.txtfiles for details).

Before downloading, please read the license agreement at the bottom of this posting first!

You can download the M-AILABS Speech Dataset from my website.

This repository is only a placeholder for the actual Dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

The M-AILABS Speech Dataset

About

Releases

Packages

imdatceleste/m-ailabs-dataset

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

The M-AILABS Speech Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages