Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate HPLT Datasets v1.2 as a monolingual dataset #537

Open
gregtatum opened this issue Apr 25, 2024 · 0 comments
Open

Integrate HPLT Datasets v1.2 as a monolingual dataset #537

gregtatum opened this issue Apr 25, 2024 · 0 comments
Assignees
Labels

Comments

@gregtatum
Copy link
Member

gregtatum commented Apr 25, 2024

The data was produced from web crawls, and has a cleaned version of the data. It includes language detection via FastSpell (a combo of FastText and Hunspell). It also includes fluency scoring (a 7-gram modified Knesser-Ney character language model).

This fluency score can
be used to estimate the ‘quality’ of paragraphs
in the document, allowing to filter out noise
that may be detrimental for training language
models.

https://arxiv.org/abs/2403.14009

The data comes in as jsonl. Each line is a document, but the text is newline delimited.

Example line:

{
  "id": 65,
  "document_lang": "fi",
  "scores": [
    0.826,
    0.386,
    0.789,
    ...
  ],
  "langs": [
    "fi",
    "en",
    "fi",
    ...
  ],
  "text": "Tulevaisuuden työelämä vaatii uudenlaista osaamista - DigiMaMa\nSkip to content\nLiity jäseneksi\n...",
  "url": "https://www.digimama.fi/artikkelit/tulevaisuuden-tyoelama-vaatii-uudenlaista-osaamista/",
  "collection": "cc40"
}

In order to integrate this data source we would need to locate and download the files. These are structured logically and documented here: https://hplt-project.org/datasets/v1.2

We would want to use the clean data.

Then for each document, we would need to split at the "paragraph" level, which is newline delimited. Optionally we could include a hyperparameter to combine multiple paragraphs into one.

Then we would need to decide on a score threshold, which is another hyperparameter.

I think with this would we would be good to use the data in the pipeline.

Language Code Docs Words
Afrikaans af 747.23K 829.49M
Arabic ar 26.80M 31.85B
Azerbaijani az 1.10M 1.13B
Belarusian be 356.53K 394.19M
Bulgarian bg 6.50M 8.76B
Bangla bn 2.88M 2.77B
Catalan ca 4.54M 5.76B
Czech cs 16.99M 19.11B
Welsh cy 111.25K 124.06M
Danish da 8.18M 9.37B
German de 101.41M 110.98B
Greek el 15.83M 33.76B
English en 1.02B 2.31T
Esperanto eo 67.81K 101.70M
Spanish es 129.29M 181.23B
Estonian et 1.48M 1.74B
Basque eu 343.95K 324.64M
Persian fa 30.90M 47.58B
Finnish fi 7.15M 9.04B
French fr 99.59M 122.88B
Irish ga 115.53K 130.68M
Galician gl 731.36K 847.40M
Gujarati gu 264.82K 303.63M
Serbo-Croatian hbs 8.68M 10.03B
Hebrew he 4.98M 7.49B
Hindi hi 5.77M 7.54B
Hungarian hu 11.71M 14.39B
Armenian hy 621.47K 589.95M
Indonesian id 31.42M 42.08B
Icelandic is 481.33K 562.01M
Italian it 53.53M 74.45B
Japanese ja 190.41M 63.23B
Georgian ka 533.07K 573.88M
Kazakh kk 406.35K 471.76M
Kannada kn 228.22K 235.58M
Korean ko 31.85M 25.52B
Kyrgyz ky 88.32K 101.62M
Latin la 301.70K 294.13M
Lithuanian lt 2.72M 2.95B
Latvian lv 1.54M 1.59B
Macedonian mk 734.69K 736.55M
Malayalam ml 469.98K 517.83M
Mongolian mn 594.90K 803.21M
Marathi mr 453.69K 519.55M
Malay ms 4.87M 9.03B
Maltese mt 111.12K 102.42M
Burmese my 239.47K 357.11M
Norwegian Bokmål nb 6.12M 8.30B
Nepali ne 863.35K 694.40M
Dutch nl 31.75M 33.30B
Norwegian Nynorsk nn 228.48K 298.57M
Punjabi pa 152.78K 184.77M
Polish pl 39.38M 44.17B
Pashto ps 88.21K 113.19M
Portuguese pt 58.24M 81.41B
Romanian ro 14.47M 19.49B
Russian ru 224.20M 284.58B
Sinhala si 322.51K 568.03M
Slovak sk 4.62M 4.98B
Slovenian sl 2.20M 2.51B
Somali so 283.71K 211.80M
Albanian sq 1.24M 1.34B
Swedish sv 13.67M 16.91B
Swahili sw 698.57K 668.17M
Tamil ta 1.24M 1.91B
Telugu te 415.60K 437.74M
Thai th 8.19M 4.33B
Filipino tl 585.24K 911.06M
Turkish tr 27.05M 42.65B
Tatar tt 65.15K 74.86M
Ukrainian uk 9.31M 10.57B
Urdu ur 1.44M 1.42B
Uzbek uz 290.29K 367.25M
Vietnamese vi 31.50M 49.36B
Chinese zh 1.08B 432.88B
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants