Diacritics

What it is

Diacritics reconstruction (restoration) for Slovak text based on finding best match in n-grams (n-gram = group of n words usually occurring together in language). This program was created for Bachelor's thesis at Faculty of Management Science and Informatics, University of Žilina.

How it works

The program uses data from Slovak National Corpus from Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. We used data set/language corpus prim-8.0-public-all made out of 1.5 billion of tokens (namely subcorpuses of 4-grams, 3-grams, 2-grams and words). Yout can find them all here. Algorithm reconstructs every single word separately. It uses data structure trie for fastest access to the list of appropriate n-grams for each non-diacritics word. List of appropriate n-grams for non-diacritics word consists only of n-grams containing that word. In addition, the list is grouped by n (from 4-grams to 1-gram) and sorted by absolute occurrence in language. Then all n-grams are compared with the word and it's surrouding words one by one until there is match. After then the word is being replaced with found diacritics form.

More information

Bachelor's thesis in Slovak language: Automatická rekonštrukcia diakritiky pre slovenčinu or in this repo here
Conference Paper in English: Automatic restoration of diacritics based on word n-grams for Slovak texts or on IEEE Xplore
Article in English: Diacritics restoration based on word n-grams for Slovak texts or on De Gruyter

Used technologies

C#
ASP.NET Core
PBCD.DataStructures.Trie

Final software

There are two final versions of the program: The first - faster one (0.4ms per word), using RAM only, with the success rate 98.07%. The second - slower one (4ms per word), using hard disk, with success rate 98.17%. Here you will find:

DLL ready to use
Simple web-site for easy, user-friendly interacting with the program

Try it here

diakritika.fri.uniza.sk

To run it you need to download these files:

https://www.dropbox.com/s/7uraxif4ocfay8k/diacritics-reconstructor-necessary-files.zip?dl=0

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
Diacritics-project1.UnitTests		Diacritics-project1.UnitTests
Diacritics-project1		Diacritics-project1
Diacritics.Tester		Diacritics.Tester
Diacritics		Diacritics
DiacriticsWeb		DiacriticsWeb
documents		documents
.gitattributes		.gitattributes
.gitignore		.gitignore
DELETE_ALL.sql		DELETE_ALL.sql
DiacriticsSolution.sln		DiacriticsSolution.sln
README.md		README.md
SQLQuery1.sql		SQLQuery1.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diacritics-project1.UnitTests

Diacritics-project1.UnitTests

Diacritics-project1

Diacritics-project1

Diacritics.Tester

Diacritics.Tester

Diacritics

Diacritics

DiacriticsWeb

DiacriticsWeb

documents

documents

.gitattributes

.gitattributes

.gitignore

.gitignore

DELETE_ALL.sql

DELETE_ALL.sql

DiacriticsSolution.sln

DiacriticsSolution.sln

README.md

README.md

SQLQuery1.sql

SQLQuery1.sql

Repository files navigation

Diacritics

What it is

How it works

More information

Used technologies

Final software

Try it here

To run it you need to download these files:

About

Releases 1

Packages

Languages

emanuelzaymus/Diacritics

Folders and files

Latest commit

History

Repository files navigation

Diacritics

What it is

How it works

More information

Used technologies

Final software

Try it here

To run it you need to download these files:

About

Topics

Resources

Stars

Watchers

Forks

Languages