Icelandic NLP resources

This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.

This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list. If you notice a category or resource is missing or have suggestions on how to improve this list, please open a pull request.

Notable papers and reports ↑

Máltækniáætlun fyrir íslensku 2018-2022 (English version)
- The project plan for an ongoing language technology programme funded by the Icelandic Ministry of Education.
- Short paper describing the programme, note that the programme has been postponed by a year compared to the original plan.
Risamálheild: A Very Large Icelandic Text Corpus
- Paper describing the Icelandic Gigaword Corpus, a tagged and lemmatized corpus containing over 10^9 tokens.
A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System
Please send a pull request with additions to this list.

Other resource collections ↑

CLARIN-IS
- The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
SÍM homepage
- Overview page for SÍM (the Icelandic Language Technology Consortium), which contains mirrors and descriptions for all Language Technology Programme projects.
malfong.is
- List of language technology resources, maintained by Árnastofnun.
Comprehensive list of language resources
- This list of over 100 Icelandic language technology resources was compiled by @bjarnigithub in the summer of 2021.

Corpora ↑

Talrómur
- A large public domain TTS corpus designed for research and development. Contains over 160 hours of studio-recorded prompted speech, divided between 8 speakers.
Samrómur
- An open and accessible speech recognition dataset with FLAC audio files, corresponding text and metadata.
Icelandic broadcast speech
- 193 hours of radio and TV data from the Icelandic National Broadcasting Service (RÚV).
Spjallromur
- Icelandic Conversational Speech
Kennslurómur
- Icelandic lectures with audio and corresponding text.
GreynirCorpus
- A large, parsed treebank of modern Icelandic text

European Language Grid Services ↑

Toolkits ↑

Greynir

Python 3 package which is capable of syntactic parsing, lemmatization, POS tagging, noun phrase inflection and more
The GitHub repo for this project
Developed by Miðeind ehf.

IceNLP

Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
Developed by Hrafn Loftsson

LVL-tts-frontend

TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
- Developed by Anna Björk Nikulásdóttir at LVL

Tokenization and text normalization ↑

Icelandic tokenizer
Textahaukur - text normalization toolkit
- This seems to be in suspended development and claims to not be functional as of yet.
Regína normalizer
- Regex-based text normalization in python. Currently in early stages of development.

POS tagging ↑

Syntactic parsing ↑

Neural parsing pipeline for Icelandic
- The GitHub repo for this project
Greynir, see above
IceNLP, see above

Grapheme-to-phoneme ↑

Stress analysis ↑

LVL-tts-frontend performs stress analysis

Speech synthesis ↑

Speech recognition ↑

Ice-ASR
Alþingi
- Just the recipe
Samromur ASR
- Contains a vanilla recipe (base), subword modelling, and specialized children and adolescent recipes
alignment and segmentation
- Scripts to prepare RÚV TV data for alignment and segmentation to make an ASR dataset
Tiro Speech Core
Tal

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
README.md		README.md
_config.yml		_config.yml
g2p-reference.md		g2p-reference.md
language_resources.md		language_resources.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

_config.yml

_config.yml

g2p-reference.md

g2p-reference.md

language_resources.md

language_resources.md

Repository files navigation

Icelandic NLP resources

Contents

Notable papers and reports ↑

Other resource collections ↑

Corpora ↑

European Language Grid Services ↑

Toolkits ↑

Greynir

IceNLP

LVL-tts-frontend

Tokenization and text normalization ↑

POS tagging ↑

Syntactic parsing ↑

Grapheme-to-phoneme ↑

Stress analysis ↑

Speech synthesis ↑

Speech recognition ↑

About

Releases

Packages

Contributors 6

cadia-lvl/icelandic-NLP-resources

Folders and files

Latest commit

History

Repository files navigation

Icelandic NLP resources

Contents

Notable papers and reports ↑

Other resource collections ↑

Corpora ↑

European Language Grid Services ↑

Toolkits ↑

Tokenization and text normalization ↑

POS tagging ↑

Syntactic parsing ↑

Grapheme-to-phoneme ↑

Stress analysis ↑

Speech synthesis ↑

Speech recognition ↑

About

Topics

Resources

Stars

Watchers

Forks