Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teacher model does not continue training on original corpus #472

Closed
Tracked by #216 ...
eu9ene opened this issue Mar 6, 2024 · 1 comment · Fixed by #596
Closed
Tracked by #216 ...

Teacher model does not continue training on original corpus #472

eu9ene opened this issue Mar 6, 2024 · 1 comment · Fixed by #596
Assignees
Labels

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Mar 6, 2024

Language pair: en-ru

Screenshot 2024-03-06 at 11 16 41 AM

After training on a mix with back-translated data, it doesn't continue training on the original corpus. This is the same problem I investigated a couple of years ago and I don't think it's caused by anything new. It's likely related to the quality of the datasets and cleaning.

At the same time for lt-en:

Screenshot 2024-03-06 at 11 25 48 AM
@eu9ene
Copy link
Collaborator Author

eu9ene commented Apr 30, 2024

The alternative would be to do just one stage of training with say 70:30 original/backtranslated ratio. I'll run more experiments but I think if we back-translate news with a strong student model we get a very clean back-translated corpus and still imperfect noisy original corpus. So it's understandable that the model might not beat the best checkpoint from the pre-training stage.

However, I recently found some issues with data for en-ru (for example this) so it might be another reason. cc @gregtatum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant