Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override W&B data on a resumed training #595

Merged
merged 2 commits into from
May 15, 2024

Conversation

vrigal
Copy link
Contributor

@vrigal vrigal commented May 14, 2024

Closes #594

@vrigal
Copy link
Contributor Author

vrigal commented May 15, 2024

I applied the changes (to resume an existing run on W&B) and tested here: https://wandb.ai/teklia/test/runs/tv2vtdha

we can just pass the artifacts dir to the script parse_tc_logs --from-stream -v --wandb-artifacts ${model_dir}.

Its probably worth opening an issue for this. W&B seems to sync (override) logs & artifacts automatically. (https://wandb.ai/teklia/test/groups/test/files/output.log).

One complication here is that we continue training from the last checkpoint so there will be small overlap with previously written data

W&B should handle this automatically, as it ignores data with a step inferior the last data written.
I think it should simply trigger some warnings from the W&B client.

@vrigal vrigal requested a review from eu9ene May 15, 2024 12:58
@eu9ene
Copy link
Collaborator

eu9ene commented May 15, 2024

@vrigal one thing I don't understand is who sets "RUN_ID" env?

@bhearsum
Copy link
Collaborator

@vrigal one thing I don't understand is who sets "RUN_ID" env?

This is set automatically by generic-worker before the task starts: https://github.com/taskcluster/taskcluster/blob/a85c8b9f7be096f6b9a4bad38612374b9a702372/workers/generic-worker/multiuser_posix.go#L146-L148

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this can work but from the code perspective there are some issues:

  • w&b publisher should not know anything about Taskcluster environment variables
  • why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

Other known issues:

  • W&B drops overlapped data from the new run instead of overwriting (wandb: WARNING (User provided step: 15000 is less than current step: 15001. Dropping entry: {'gnorm': 0.7161, '_timestamp': 1715790811.6654584}). )

I think since it kind of works we can merge it to unblock enabling spot instances but we should address those issues later.

@eu9ene eu9ene merged commit ea95bc0 into mozilla:main May 15, 2024
4 checks passed
@vrigal
Copy link
Contributor Author

vrigal commented May 16, 2024

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B.
It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

@eu9ene
Copy link
Collaborator

eu9ene commented May 16, 2024

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B. It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

We can rethink all that in #408, but I would use UID in model names as a last resort because they would clutter the dashboards.

@vrigal
Copy link
Contributor Author

vrigal commented May 17, 2024

@eu9ene to be clear, run ID (used to identify a run) is different that run name (used to display graphs). For now we do not use an ID, it is automatically set by W&B (e.g. brmhnekj) which guarantees unicity. But if there is a unique way to identify a run (e.g. <group>-<model_name>) it could help with deletion(override) or resuming a run. It can be a separate issue than #408. I'll write an issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

weights and biases reporting doesn't work correctly when resuming training after a spot termination
3 participants