text-normalization

Dataset prepared for publication "Numbers Normalisation in the Inflected Languages: a Case Study of Polish".

Zip contains two files:

data_set.txt (file with our whole dataset),
tts_answers.txt (file with written answers from TTS systems: Google Cloud Text-to-Speech and Amazon Polly).

In TTS answers there are <dav> tags which means "different accepted version" and indicates situations where TTS system normalize phrase in different way than we in our data set but this normalization is accepted. We treat these cases as correct normalization.

If you use the dataset from this repository, please cite:

@inproceedings{poswiata-perelkiewicz-2019-numbers,
    title = "Numbers Normalisation in the Inflected Languages: a Case Study of {P}olish",
    author = "Po{\'s}wiata, Rafa{\l} and Pere{\l}kiewicz, Micha{\l}",
    booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-3703",
    doi = "10.18653/v1/W19-3703",
    pages = "23--28"
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
dataset_and_tts_answers.zip		dataset_and_tts_answers.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataset_and_tts_answers.zip

dataset_and_tts_answers.zip

Repository files navigation

text-normalization

About

Releases

Packages

rafalposwiata/text-normalization

Folders and files

Latest commit

History

README.md

README.md

dataset_and_tts_answers.zip

dataset_and_tts_answers.zip

Repository files navigation

text-normalization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages