Skip to content

dumitrescustefan/RO-STS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 

Repository files navigation

License

RO-STS: Romanian Semantic Textual Similarity Dataset

This dataset is the Romanian version of the STS dataset. It is a high-quality translation of the aforementioned dataset, containing 8628 pairs of sentences with their similarity score. The dataset respects the same split: 5749 train, 1500 dev and 1379 test sentence pairs.

Dataset format

The dataset is offered in two flavours:

1. Textual similarity dataset

1.5	Un bărbat cântă la harpă.	Un bărbat cântă la claviatură.
1.8	O femeie taie cepe.	O femeie taie tofu.
3.5	Un bărbat merge pe o bicicletă electrică.	Un bărbat merge pe bicicletă.
2.2	Un bărbat cântă la tobe.	Un bărbat cântă la chitară.
2.2	Un bărbat cântă la chitară.	O doamnă cântă la chitară.

The train/dev/test splits are identical to the original English STS corpus splits.

Direct download link:

More information in the dataset folder.

2. Parallel corpus (RO-EN)

The parallel corpus is a direct result of the translation process. It can be used as-is in any other downstream NLP task. It is split in 3 train/dev/test pair of ro & en files, totaling 6 files. It is formatted in the standard one-sentence per line.

Direct download link, as a single zip file containing all the ro-en files: RO-STS.ro-en.zip

For more information and the unzipped files go to the dataset folder.

Baseline evaluation

We provide 2 baselines for this dataset, a transformer-based model and a recurrent neural network model. Both models were trained on the train set until the Pearson score did not improve on the dev set, and results are reported on both dev and test sets.

Model # of parameters Dev-set Pearson Dev-set Spearman Test-set Pearson Test-set Spearman
RNN 16.7M 0.7342 0.7349 0.6744 0.6662
Romanian BERT v1 (uncased) 124M 0.8459 0.8426 0.8159 0.8086
Romanian BERT v1 (cased) 124M 0.8426 0.8409 0.7911 0.7826
Multilingual BERT (uncased) 167M 0.8237 0.8235 0.7690 0.7650
Multilingual BERT (cased) 167M 0.8071 0.8077 0.7664 0.7641

For more details on how to reproduce these scores please check out the detailed evaluation page.

Creation process

The dataset was created in three steps:

  1. Automatic translation with Google's translation service.
  2. Correction round by a person that rectified all errors resulting from the automatic translation - and there were plenty.
  3. Validation by a different person that double-checked the translation.

This process has ensured the high-quality translation of this dataset.

Here are the annotators/contributors, alphabetically listed:

Licensing

This work, like it's original, is licensed as CC BY-SA 4.0. That means you're free to do anything you want with it, as long as you keep the same license.

Citation

Dumitrescu, S. D., Rebeja, P., Lorincz, B., Gaman, M., Avram, A., Ilie, M., Pruteanu, A., Stan, A., Rosia, L., Iacobescu, C., & others. (2021). Liro: Benchmark and leaderboard for romanian language tasks. Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
@inproceedings{dumitrescu2021liro,
  title={Liro: Benchmark and leaderboard for romanian language tasks},
  author={Dumitrescu, Stefan Daniel and Rebeja, Petru and Lorincz, Beata and Gaman, Mihaela and Avram, Andrei and Ilie, Mihai and Pruteanu, Andrei and Stan, Adriana and Rosia, Lorena and Iacobescu, Cristina and others},
  booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)},
  year={2021}
}