Skip to content

emanuelzaymus/Diacritics

Repository files navigation

Diacritics

What it is

Diacritics reconstruction (restoration) for Slovak text based on finding best match in n-grams (n-gram = group of n words usually occurring together in language). This program was created for Bachelor's thesis at Faculty of Management Science and Informatics, University of Žilina.

How it works

The program uses data from Slovak National Corpus from Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. We used data set/language corpus prim-8.0-public-all made out of 1.5 billion of tokens (namely subcorpuses of 4-grams, 3-grams, 2-grams and words). Yout can find them all here. Algorithm reconstructs every single word separately. It uses data structure trie for fastest access to the list of appropriate n-grams for each non-diacritics word. List of appropriate n-grams for non-diacritics word consists only of n-grams containing that word. In addition, the list is grouped by n (from 4-grams to 1-gram) and sorted by absolute occurrence in language. Then all n-grams are compared with the word and it's surrouding words one by one until there is match. After then the word is being replaced with found diacritics form.

More information

Used technologies

Final software

There are two final versions of the program: The first - faster one (0.4ms per word), using RAM only, with the success rate 98.07%. The second - slower one (4ms per word), using hard disk, with success rate 98.17%. Here you will find:

  • DLL ready to use
  • Simple web-site for easy, user-friendly interacting with the program

Try it here

diakritika.fri.uniza.sk

To run it you need to download these files:

https://www.dropbox.com/s/7uraxif4ocfay8k/diacritics-reconstructor-necessary-files.zip?dl=0