Usage

For details, follow the tutorial in the docs. 📖

Data Preparation

Parallel Data

For training a translation model, you need parallel data, i.e. a collection of source sentences and reference translations that are aligned sentence-by-sentence and stored in two files, such that each line in the reference file is the translation of the same line in the source file.

Pre-processing

Before training a model on it, parallel data is most commonly filtered by length ratio, tokenized and true- or lowercased.

The Moses toolkit provides a set of useful scripts for this purpose.

In addition, you might want to build the NMT model not on the basis of words, but rather sub-words or characters (the level in JoeyNMT configurations). Currently, JoeyNMT supports the byte-pair-encodings (BPE) format by subword-nmt and sentencepiece.

Configuration

Experiments are specified in configuration files, in simple YAML format. You can find examples in the configs directory. small.yaml contains a detailed explanation of configuration options.

Most importantly, the configuration contains the description of the model architecture (e.g. number of hidden units in the encoder RNN), paths to the training, development and test data, and the training hyperparameters (learning rate, validation frequency etc.).

Training

Start

For training, run

python3 -m joeynmt train configs/small.yaml.

This will train a model on the training data specified in the config (here: small.yaml), validate on validation data, and store model parameters, vocabularies, validation outputs and a small number of attention plots in the model_dir (also specified in config).

Note that pre-processing like tokenization or BPE-ing is not included in training, but has to be done manually before.

Tip: Be careful not to overwrite models, set overwrite: False in the model configuration.

Validations

The validations.txt file in the model directory reports the validation results at every validation point. Models are saved whenever a new best validation score is reached, in batch_no.ckpt, where batch_no is the number of batches the model has been trained on so far. best.ckpt links to the checkpoint that has so far achieved the best validation score.

Visualization

JoeyNMT uses Tensorboard to visualize training and validation curves and attention matrices during training. Launch Tensorboard with tensorboard --logdir model_dir/tensorboard (or python -m tensorboard.main ...) and then open the url (default: localhost:6006) with a browser.

For a stand-alone plot, run python3 scripts/plot_validation.py model_dir --plot_values bleu PPL --output_path my_plot.pdf to plot curves of validation BLEU and PPL.

CPU vs. GPU

For training on a GPU, set use_cuda in the config file to True. This requires the installation of required CUDA libraries.

Translating

There are three options for testing what the model has learned.

Whatever data you feed the model for translating, make sure it is properly pre-processed, just as you pre-processed the training data, e.g. tokenized and split into subwords (if working with BPEs).

1. Test Set Evaluation

For testing and evaluating on your parallel test/dev set, run

python3 -m joeynmt test configs/small.yaml --output_path out.

This will generate translations for validation and test set (as specified in the configuration) in out.[dev|test] with the latest/best model in the model_dir (or a specific checkpoint set with load_model). It will also evaluate the outputs with eval_metric. If --output_path is not specified, it will not store the translation, and only do the evaluation and print the results.

2. File Translation

In order to translate the contents of a file not contained in the configuration (here my_input.txt), simply run

python3 -m joeynmt translate configs/small.yaml < my_input.txt > out.

The translations will be written to stdout or alternatively--output_path if specified.

3. Interactive

If you just want try a few examples, run

python3 -m joeynmt translate configs/small.yaml

and you'll be prompted to type input sentences that JoeyNMT will then translate with the model specified in the configuration.

Documentation and Tutorial

The docs include an overview of the NMT implementation, a walk-through tutorial for building, training, tuning, testing and inspecting an NMT system, the API documentation and FAQs.
A screencast of the tutorial is available on YouTube. 🎥
Jade Abbott wrote a notebook that runs on Colab that shows how to prepare data, train and evaluate a model, at the example of low-resource African languages.
Matthias Müller wrote a collection of scripts for installation, data download and preparation, model training and evaluation.

Benchmarks

Benchmark results on WMT and IWSLT datasets are reported here. Please also check the Masakhane MT repository for benchmarks and available models for African languages.

Pre-trained Models

Pre-trained models from reported benchmarks for download (contains config, vocabularies, best checkpoint and dev/test hypotheses):

IWSLT14 de-en

Pre-processing with Moses decoder tools as in this script.

IWSLT14 de-en BPE RNN (641M)
IWSLT14 de-en Transformer (210M)

IWSLT15 en-vi

The data came preprocessed from Stanford NLP, see this script.

IWSLT15 en-vi Transformer (186M)

WMT17

Following the pre-processing of the Sockeye paper.

WMT17 en-de "best" RNN (2G)
WMT17 lv-en "best" RNN (1.9G)
WMT17 en-de Transformer (664M)
WMT17 lv-en Transformer (650M)

Autshumato

Training with data provided in the Ukuxhumana project, with additional tokenization of the training data with the Moses tokenizer.

Autshumato en-af small Transformer (147M)
Autshumato af-en small Transformer (147M)
Autshumato en-nso small Transformer (147M)
Autshumato nso-en small Transformer (147M)
Autshumato en-tn small Transformer (319M)
Autshumato tn-en small Transformer (321M)
Autshumato en-ts small Transformer (229M)
Autshumato ts-en small Transformer (229M)
Autshumato en-zu small Transformer (147M)
Autshumato zu-en small Transformer (147M)

If you trained JoeyNMT on your own data and would like to share it, please email us so we can add it to the collection of pre-trained models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JoeyNMT_v1.md

JoeyNMT_v1.md

Usage

Data Preparation

Parallel Data

Pre-processing

Configuration

Training

Start

Validations

Visualization

CPU vs. GPU

Translating

1. Test Set Evaluation

2. File Translation

3. Interactive

Documentation and Tutorial

Benchmarks

Pre-trained Models

IWSLT14 de-en

IWSLT15 en-vi

WMT17

Autshumato

Files

JoeyNMT_v1.md

Latest commit

History

JoeyNMT_v1.md

File metadata and controls

Usage

Data Preparation

Parallel Data

Pre-processing

Configuration

Training

Start

Validations

Visualization

CPU vs. GPU

Translating

1. Test Set Evaluation

2. File Translation

3. Interactive

Documentation and Tutorial

Benchmarks

Pre-trained Models

IWSLT14 de-en

IWSLT15 en-vi

WMT17

Autshumato