Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weights and biases reporting doesn't work correctly when resuming training after a spot termination #594

Closed
Tracked by #443 ...
bhearsum opened this issue May 14, 2024 · 0 comments · Fixed by #595
Closed
Tracked by #443 ...
Labels

Comments

@bhearsum
Copy link
Collaborator

Over in #580 we've been working on getting spot instances re-enabled. That part works correctly now, and training is continued from the last checkpoint, but I noticed in our latest run that reporting is turned off when resuming.

In run #1 we had:

[task 2024-05-13T23:07:22.900Z] wandb: Currently logged in as: moz-translations-wandb-bot (moz-translations). Use `wandb login --relogin` to force relogin
[task 2024-05-13T23:07:23.803Z] wandb: wandb version 0.17.0 is available!  To upgrade, please run:
[task 2024-05-13T23:07:23.803Z] wandb:  $ pip install wandb --upgrade
[task 2024-05-13T23:07:23.803Z] wandb: Tracking run with wandb version 0.16.1
[task 2024-05-13T23:07:23.803Z] wandb: Run data is saved locally in /home/ubuntu/tasks/task_171564135626786/checkouts/vcs/pipeline/train/wandb/run-20240513_230722-eeqsrpg6
[task 2024-05-13T23:07:23.803Z] wandb: Run `wandb offline` to turn off syncing.
[task 2024-05-13T23:07:23.805Z] wandb: Syncing run backwards
[task 2024-05-13T23:07:23.805Z] wandb: ⭐️ View project at https://wandb.ai/moz-translations/en-ru
[task 2024-05-13T23:07:23.805Z] wandb: 🚀 View run at https://wandb.ai/moz-translations/en-ru/runs/eeqsrpg6

But in run #2 we got:

[task 2024-05-14T00:51:23.329Z] [tracking WARNING] This run already exists on W&B: [<Run moz-translations/en-ru/eeqsrpg6 (running)>]. No data will be published.

Note that when resuming we usually will end up redoing some work that happens after the last checkpoint, but before the termination happened -- I'm not sure the best way to handle this in W&B - I imagine others will have a better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants