You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After training on a mix with back-translated data, it doesn't continue training on the original corpus. This is the same problem I investigated a couple of years ago and I don't think it's caused by anything new. It's likely related to the quality of the datasets and cleaning.
At the same time for lt-en:
The text was updated successfully, but these errors were encountered:
The alternative would be to do just one stage of training with say 70:30 original/backtranslated ratio. I'll run more experiments but I think if we back-translate news with a strong student model we get a very clean back-translated corpus and still imperfect noisy original corpus. So it's understandable that the model might not beat the best checkpoint from the pre-training stage.
However, I recently found some issues with data for en-ru (for example this) so it might be another reason. cc @gregtatum
Language pair: en-ru
After training on a mix with back-translated data, it doesn't continue training on the original corpus. This is the same problem I investigated a couple of years ago and I don't think it's caused by anything new. It's likely related to the quality of the datasets and cleaning.
At the same time for lt-en:
The text was updated successfully, but these errors were encountered: