You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The data was produced from web crawls, and has a cleaned version of the data. It includes language detection via FastSpell (a combo of FastText and Hunspell). It also includes fluency scoring (a 7-gram modified Knesser-Ney character language model).
This fluency score can
be used to estimate the ‘quality’ of paragraphs
in the document, allowing to filter out noise
that may be detrimental for training language
models.
In order to integrate this data source we would need to locate and download the files. These are structured logically and documented here: https://hplt-project.org/datasets/v1.2
We would want to use the clean data.
Then for each document, we would need to split at the "paragraph" level, which is newline delimited. Optionally we could include a hyperparameter to combine multiple paragraphs into one.
Then we would need to decide on a score threshold, which is another hyperparameter.
I think with this would we would be good to use the data in the pipeline.
Language
Code
Docs
Words
Afrikaans
af
747.23K
829.49M
Arabic
ar
26.80M
31.85B
Azerbaijani
az
1.10M
1.13B
Belarusian
be
356.53K
394.19M
Bulgarian
bg
6.50M
8.76B
Bangla
bn
2.88M
2.77B
Catalan
ca
4.54M
5.76B
Czech
cs
16.99M
19.11B
Welsh
cy
111.25K
124.06M
Danish
da
8.18M
9.37B
German
de
101.41M
110.98B
Greek
el
15.83M
33.76B
English
en
1.02B
2.31T
Esperanto
eo
67.81K
101.70M
Spanish
es
129.29M
181.23B
Estonian
et
1.48M
1.74B
Basque
eu
343.95K
324.64M
Persian
fa
30.90M
47.58B
Finnish
fi
7.15M
9.04B
French
fr
99.59M
122.88B
Irish
ga
115.53K
130.68M
Galician
gl
731.36K
847.40M
Gujarati
gu
264.82K
303.63M
Serbo-Croatian
hbs
8.68M
10.03B
Hebrew
he
4.98M
7.49B
Hindi
hi
5.77M
7.54B
Hungarian
hu
11.71M
14.39B
Armenian
hy
621.47K
589.95M
Indonesian
id
31.42M
42.08B
Icelandic
is
481.33K
562.01M
Italian
it
53.53M
74.45B
Japanese
ja
190.41M
63.23B
Georgian
ka
533.07K
573.88M
Kazakh
kk
406.35K
471.76M
Kannada
kn
228.22K
235.58M
Korean
ko
31.85M
25.52B
Kyrgyz
ky
88.32K
101.62M
Latin
la
301.70K
294.13M
Lithuanian
lt
2.72M
2.95B
Latvian
lv
1.54M
1.59B
Macedonian
mk
734.69K
736.55M
Malayalam
ml
469.98K
517.83M
Mongolian
mn
594.90K
803.21M
Marathi
mr
453.69K
519.55M
Malay
ms
4.87M
9.03B
Maltese
mt
111.12K
102.42M
Burmese
my
239.47K
357.11M
Norwegian Bokmål
nb
6.12M
8.30B
Nepali
ne
863.35K
694.40M
Dutch
nl
31.75M
33.30B
Norwegian Nynorsk
nn
228.48K
298.57M
Punjabi
pa
152.78K
184.77M
Polish
pl
39.38M
44.17B
Pashto
ps
88.21K
113.19M
Portuguese
pt
58.24M
81.41B
Romanian
ro
14.47M
19.49B
Russian
ru
224.20M
284.58B
Sinhala
si
322.51K
568.03M
Slovak
sk
4.62M
4.98B
Slovenian
sl
2.20M
2.51B
Somali
so
283.71K
211.80M
Albanian
sq
1.24M
1.34B
Swedish
sv
13.67M
16.91B
Swahili
sw
698.57K
668.17M
Tamil
ta
1.24M
1.91B
Telugu
te
415.60K
437.74M
Thai
th
8.19M
4.33B
Filipino
tl
585.24K
911.06M
Turkish
tr
27.05M
42.65B
Tatar
tt
65.15K
74.86M
Ukrainian
uk
9.31M
10.57B
Urdu
ur
1.44M
1.42B
Uzbek
uz
290.29K
367.25M
Vietnamese
vi
31.50M
49.36B
Chinese
zh
1.08B
432.88B
The text was updated successfully, but these errors were encountered:
The data was produced from web crawls, and has a cleaned version of the data. It includes language detection via FastSpell (a combo of FastText and Hunspell). It also includes fluency scoring (a 7-gram modified Knesser-Ney character language model).
https://arxiv.org/abs/2403.14009
The data comes in as jsonl. Each line is a document, but the text is newline delimited.
Example line:
In order to integrate this data source we would need to locate and download the files. These are structured logically and documented here: https://hplt-project.org/datasets/v1.2
We would want to use the clean data.
Then for each document, we would need to split at the "paragraph" level, which is newline delimited. Optionally we could include a hyperparameter to combine multiple paragraphs into one.
Then we would need to decide on a score threshold, which is another hyperparameter.
I think with this would we would be good to use the data in the pipeline.
The text was updated successfully, but these errors were encountered: