Skip to content

cadia-lvl/icelandic-NLP-resources

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Icelandic NLP resources

This is an list of known tools and resources developed specifically to do linguistic processing in Icelandic. It is intended to give readers a clear overview of the ever-growing arsenal of tools for working with Icelandic natural language data at a glance.

This list is categorized by task to increase clarity. Due to that, some multi-functional tools and toolkits might appear more than once in the list. If you notice a category or resource is missing or have suggestions on how to improve this list, please open a pull request.

Contents

Notable papers and reports

Other resource collections

  • CLARIN-IS
    • The Icelandic branch of the CLARIN-ERIC language resource initiative. Contains information on and downloads for many tools and datasets.
  • SÍM homepage
    • Overview page for SÍM (the Icelandic Language Technology Consortium), which contains mirrors and descriptions for all Language Technology Programme projects.
  • malfong.is
    • List of language technology resources, maintained by Árnastofnun.
  • Comprehensive list of language resources
    • This list of over 100 Icelandic language technology resources was compiled by @bjarnigithub in the summer of 2021.

Corpora

  • Talrómur
    • A large public domain TTS corpus designed for research and development. Contains over 160 hours of studio-recorded prompted speech, divided between 8 speakers.
  • Samrómur
    • An open and accessible speech recognition dataset with FLAC audio files, corresponding text and metadata.
  • Icelandic broadcast speech
    • 193 hours of radio and TV data from the Icelandic National Broadcasting Service (RÚV).
  • Spjallromur
    • Icelandic Conversational Speech
  • Kennslurómur
    • Icelandic lectures with audio and corresponding text.
  • GreynirCorpus
    • A large, parsed treebank of modern Icelandic text

European Language Grid Services

Toolkits

  • Java toolkit which does tokenization, POS tagging, lemmatization, parsing and NER
  • Developed by Hrafn Loftsson
  • TTS frontend designed to work with the Merlin speech synthesis system developed by CSTR
  • It contains a pronunciation dictionary, sequitur g2p model, stress analysis component and more. Unfortunately it does not include any documentation.
    • Developed by Anna Björk Nikulásdóttir at LVL

Tokenization and text normalization

POS tagging

Syntactic parsing

Grapheme-to-phoneme

Stress analysis

Speech synthesis

Speech recognition