Bezhta morphology

The current work is dedicated to building a morphological analyzer for Bezhta language (< Tsezic < Avar-Andic-Tsezic < Nakh-Dagestan; Glottolog: bezh1248). This repository contains a prototype for a Bezhta morphological analyzer. It is a part of a larger project by the students of the School of Linguistics and the Linguistic Convergence Laboratory at the NRU HSE that aims to provide digital tools for endangered languages.

The project is distributed under the GNU General Public License v3.0.

Sources

Grammar & dictionary

The parser follows (Comri et al., 2015) and (Madieva, 1965) descriptions of Bezhta Proper with the lexicon gathered from (Khalilov, 2015) dictionary. The digitized version of the dictionary is available at bezhta_dict.

Texts

For evaluation, I use Bezhta translation of The Gospel of Luke and The Book of Proverbs, a text from Madieva's grammar (1964) and two annotated texts. The texts are available in the corpora directory

Usage

The project requires lexd and hfst. You can get them by the following command:

curl -sS https://apertium.projectjj.com/apt/install-nightly.sh | sudo bash
apt install lexd
apt install hfst

Making the analyzer

make

Analyze a word:

echo 'соралила' | hfst-lookup bezhta.analyzer.hfst

Making the transliterator

Transliterator allows to transliterate Bezhta words from Cyrillic to Latin script.

make cy2lat.transliterator.disam.hfst

Transliterate a word:

echo 'соралила' | hfst-lookup cy2lat.transliterator.disam.hfst

Build transliterated analyzer:

make bezhta.tr.analyzer.hfst

Look up a word in Latin script:

echo 'soralila' | hfst-lookup bezhta.tr.analyzer.hfst

Making the segmenter

The segmenter identifies the morpheme boundaries in the input word.

make bezhta.segm.hfst

Segmenting a word:

echo 'нисойо' | hfst-lookup bezhta.segm.hfst

Result:

 нисойо        нисо>йо

Evaluating coverage

Analyzer:

make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol coverage
cd coverage
make check-coverage

Additionally, make-check-unrecog can be used to get a list of unrecognized tokens. Note that all text files should start with text-

Current performance: ~75% naive coverage

Transliterator:

make bezhta.tr.analyzer.hfst
mv bezhta.tr.analyzer.hfst transliterator
make check-coverage

Note: some symbols may be recognized incorrectly, I recommend using transliterator_coverage.ipynb instead.

Evaluating accuracy

make bezhta.analyzer.hfstol
mv bezhta.analyzer.hfstol accuracy
cd accuracy

To analyze texts with the parser, use

hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-1.txt
hfst-proc bezhta.analyzer.hfstol text-annotated-1.txt > FILENAME-2.txt

Then compute accuracy:

python3 accuracy.py FILENAME-1.txt text-1-gold.txt
python3 accuracy.py FILENAME-2.txt text-2-gold.txt

Making the guesser

cd guesser
make bezhta.guesser.hfst

Guessing a token:

echo 'войъис' bezhta.guesser.hfst

For evaluation, see guesser_evaluation.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
accuracy		accuracy
coverage		coverage
guesser		guesser
transliterator		transliterator
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
bezhta.lexd		bezhta.lexd
bezhta.twol		bezhta.twol
bezhta_adj.lexd		bezhta_adj.lexd
bezhta_adv.lexd		bezhta_adv.lexd
bezhta_case.lexd		bezhta_case.lexd
bezhta_misc.lexd		bezhta_misc.lexd
bezhta_noun.lexd		bezhta_noun.lexd
bezhta_noun_lexicon.lexd		bezhta_noun_lexicon.lexd
bezhta_num.lexd		bezhta_num.lexd
bezhta_pron.lexd		bezhta_pron.lexd
bezhta_verb.lexd		bezhta_verb.lexd
bezhta_verb_lexicon.lexd		bezhta_verb_lexicon.lexd
compare.sh		compare.sh
cyr2lat		cyr2lat
segm.twol		segm.twol
test.pass.txt		test.pass.txt
tests.csv		tests.csv
transliterator.regexp		transliterator.regexp
variants		variants

License

LingConLab/bezhta-morph

Folders and files

Latest commit

History

Repository files navigation

Bezhta morphology

Sources

Grammar & dictionary

Texts

Usage

Making the analyzer

Making the transliterator

Making the segmenter

Evaluating coverage

Evaluating accuracy

Making the guesser

About

Resources

License

Stars

Watchers

Forks

Languages