Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gabriel opusmt2 #202

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Gabriel opusmt2 #202

wants to merge 16 commits into from

Conversation

gabrielBusta
Copy link
Member

No description provided.

TommiNieminen and others added 16 commits September 12, 2023 15:37
…her changes) (#117)

* Integrated Tatoeba-Challenge models as part of the firefox-translations-training pipeline:
- Added download scripts and rules for downloading Tatoeba-Challenge data and models.
- Modified training rules to accept downloaded Tatoeba-Challenge models as teachers and backward models.
- Modified containerization to include conda environments inside the container (to abide by CSC's conda depreciation).
- Added subword segmentation rules to marian-specific rules (since the default pipeline uses Marian's integrated sentencepiece support and Tatoeba-Challenge models don't)
NOTE: The pipeline is still a work in progress, and it may fail for some Tatoeba-Challenge models due to subtle differences in the model make-up.

* reduced workspace, since Marian crashes training with larger workspaces (this might be fixed in newer marian versions)

* Update README.md

Added note about changing CSC account

* Update config.opusmt.yml

Fixed opusmt-teacher value to URL as it should be

* added target language token addition for multilingual models

* new test config for multilingual models

* fixed data language pair reverse with tatoeba data

* added config parameter for pretrained teacher model (only pretrained models using marian sentencepiece integration)

* Update flores.sh

Fixed swahili code in Flores importer

* Working on using multiple teacher models, not ready for action yet

* added profiles for csc mahti

* Update README.md

* multiteach additions

* more multiteacher changes

* multiple teachers added, monolingual src fixed

* fixed vocabs with multiteacher, other minor fixes

* fixed dummy mono src rules

* fixed model indices if no opus mt teachers

* added file for preinstalling snakemake envs (for easier containerization)

* added profiles for lumi, support for amd gpus, fixing the broken non-opus-mt training pipeline

* both train from scratch and opus-mt teacher should work now

* added separate compile script for browsermt marian

* new marian-dev submodule version (old one did not work with fp16 and opus models), cuda dirs and root specified in Snakefile if not in config, new makefile targets

* updated lumi profiles with automatic paths and energy monitoring

* fixing bicleaner-ai (model repository link changed), some more energy use monitoring additions

* updated bicleaner-ai env, the old one did not work for some reason

* added langid file to bicleaner-ai env also

* Update bicleaner-ai.yml

Added tensorflow-rocm to bi-cleaner env to get it working on lumi

* lumi slurm fixes and bicleaner-ai bug fixing

* Update README.md

Added instructions for using Snakemake without non-containerized conda installation.

* Update README.md

Formatting changes.

* updated mtdata in base env

* updated container to match envs

* added env variables required by new clean mono

* added separate bicleaner-ai env for lumi

* added lumi bicleaner env

* added tensorflow to bicleaner-ai env

* fixed bicleaner-ai script bug and added a missing argument for train_spm

* singularity fixes: kenlm installation, added hunspell dict download, edited local-container profile to work with current Snakefile setup

---------

Co-authored-by: Tommi Nieminen <niemine1@mahti-login11.mahti.csc.fi>
Co-authored-by: Tommi Nieminen <niemine1@mahti-login12.mahti.csc.fi>
Co-authored-by: Tommi Nieminen <tommi.nieminen@helsinki.fi>
# Conflicts:
#	Makefile
#	pipeline/bicleaner/packs.py
#	pipeline/cefilter/score.sh
#	pipeline/translate/collect.sh
#	pipeline/translate/merge-corpus.sh
# Conflicts:
#	Snakefile
#	pipeline/train/spm-vocab.sh
#	taskcluster/ci/tests/kind.yml
#	taskcluster/translations_taskgraph/parameters.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants