Skip to content

rafalposwiata/text-normalization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

text-normalization

Dataset prepared for publication "Numbers Normalisation in the Inflected Languages: a Case Study of Polish".

Zip contains two files:

  • data_set.txt (file with our whole dataset),
  • tts_answers.txt (file with written answers from TTS systems: Google Cloud Text-to-Speech and Amazon Polly).

In TTS answers there are <dav> tags which means "different accepted version" and indicates situations where TTS system normalize phrase in different way than we in our data set but this normalization is accepted. We treat these cases as correct normalization.

If you use the dataset from this repository, please cite:

@inproceedings{poswiata-perelkiewicz-2019-numbers,
    title = "Numbers Normalisation in the Inflected Languages: a Case Study of {P}olish",
    author = "Po{\'s}wiata, Rafa{\l} and Pere{\l}kiewicz, Micha{\l}",
    booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-3703",
    doi = "10.18653/v1/W19-3703",
    pages = "23--28"
}