Teacher model does not continue training on original corpus #472

eu9ene · 2024-03-06T19:27:07Z

Language pair: en-ru

After training on a mix with back-translated data, it doesn't continue training on the original corpus. This is the same problem I investigated a couple of years ago and I don't think it's caused by anything new. It's likely related to the quality of the datasets and cleaning.

At the same time for lt-en:

eu9ene · 2024-04-30T18:39:17Z

The alternative would be to do just one stage of training with say 70:30 original/backtranslated ratio. I'll run more experiments but I think if we back-translate news with a strong student model we get a very clean back-translated corpus and still imperfect noisy original corpus. So it's understandable that the model might not beat the best checkpoint from the pre-training stage.

However, I recently found some issues with data for en-ru (for example this) so it might be another reason. cc @gregtatum

eu9ene added the quality label Mar 6, 2024

eu9ene self-assigned this Mar 6, 2024

gregtatum mentioned this issue Apr 1, 2024

[meta] Ship 30 languages #369

Open

eu9ene mentioned this issue Apr 9, 2024

Teacher does not continue training after pretraining on augmented corpus #75

Closed

This was referenced May 8, 2024

[meta] General translation quality improvements #216

Open

Add ability to switch to a one-stage teacher training #596

Merged

eu9ene closed this as completed in #596 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teacher model does not continue training on original corpus #472

Teacher model does not continue training on original corpus #472

eu9ene commented Mar 6, 2024

eu9ene commented Apr 30, 2024

Teacher model does not continue training on original corpus #472

Teacher model does not continue training on original corpus #472

Comments

eu9ene commented Mar 6, 2024

eu9ene commented Apr 30, 2024