Override W&B data on a resumed training #595

vrigal · 2024-05-14T13:56:42Z

Closes #594

tracking/translations_parser/publishers.py

vrigal · 2024-05-15T10:46:06Z

I applied the changes (to resume an existing run on W&B) and tested here: https://wandb.ai/teklia/test/runs/tv2vtdha

we can just pass the artifacts dir to the script parse_tc_logs --from-stream -v --wandb-artifacts ${model_dir}.

Its probably worth opening an issue for this. W&B seems to sync (override) logs & artifacts automatically. (https://wandb.ai/teklia/test/groups/test/files/output.log).

One complication here is that we continue training from the last checkpoint so there will be small overlap with previously written data

W&B should handle this automatically, as it ignores data with a step inferior the last data written.
I think it should simply trigger some warnings from the W&B client.

eu9ene · 2024-05-15T17:42:11Z

@vrigal one thing I don't understand is who sets "RUN_ID" env?

bhearsum · 2024-05-15T18:04:30Z

@vrigal one thing I don't understand is who sets "RUN_ID" env?

This is set automatically by generic-worker before the task starts: https://github.com/taskcluster/taskcluster/blob/a85c8b9f7be096f6b9a4bad38612374b9a702372/workers/generic-worker/multiuser_posix.go#L146-L148

eu9ene

Ok, this can work but from the code perspective there are some issues:

w&b publisher should not know anything about Taskcluster environment variables
why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

Other known issues:

W&B drops overlapped data from the new run instead of overwriting (wandb: WARNING (User provided step: 15000 is less than current step: 15001. Dropping entry: {'gnorm': 0.7161, '_timestamp': 1715790811.6654584}). )

I think since it kind of works we can merge it to unblock enabling spot instances but we should address those issues later.

vrigal · 2024-05-16T06:57:40Z

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B.
It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

eu9ene · 2024-05-16T16:44:24Z

why don't we just use resume="allow" so that if the run already exists, it continues automatically? I don't think there is a use case where we have a run with the same name when running things on Taskcluster except the restart of the same task. I guess this logic was implemented in the publisher for publishing offline experiments to prevent republishing the same ones. In this case, the publisher should accept an argument set by a cli that indicates what to do if the run already exists.

This is an old issue. The simpler way to handle this would be to use <group>-<model> as a unique ID in W&B. It should be possible to keep resume="allow" then. It would be compatible with the override option and would probably work in most case (W&B drops overlapped data, as you mentioned), but requires some important changes in the code.

We can rethink all that in #408, but I would use UID in model names as a last resort because they would clutter the dashboards.

vrigal · 2024-05-17T09:34:13Z

@eu9ene to be clear, run ID (used to identify a run) is different that run name (used to display graphs). For now we do not use an ID, it is automatically set by W&B (e.g. brmhnekj) which guarantees unicity. But if there is a unique way to identify a run (e.g. <group>-<model_name>) it could help with deletion(override) or resuming a run. It can be a separate issue than #408. I'll write an issue for this.

vrigal force-pushed the publish-resumed-train branch from 7e975b6 to 84e5d3a Compare May 14, 2024 14:42

eu9ene requested changes May 14, 2024

View reviewed changes

tracking/translations_parser/publishers.py Outdated Show resolved Hide resolved

vrigal added 2 commits May 15, 2024 09:16

Override W&B data on a resumed training

26f340c

Suggestions

7977001

vrigal force-pushed the publish-resumed-train branch from 84e5d3a to 7977001 Compare May 15, 2024 10:22

vrigal requested a review from eu9ene May 15, 2024 12:58

eu9ene approved these changes May 15, 2024

View reviewed changes

eu9ene merged commit ea95bc0 into mozilla:main May 15, 2024
4 checks passed

eu9ene mentioned this pull request May 15, 2024

Address issues with resuming training in W&B #601

Open

vrigal mentioned this pull request May 17, 2024

Use a defined run ID on W&B (refactoring) #610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override W&B data on a resumed training #595

Override W&B data on a resumed training #595

vrigal commented May 14, 2024

vrigal commented May 15, 2024

eu9ene commented May 15, 2024

bhearsum commented May 15, 2024

eu9ene left a comment

vrigal commented May 16, 2024

eu9ene commented May 16, 2024

vrigal commented May 17, 2024

Override W&B data on a resumed training #595

Override W&B data on a resumed training #595

Conversation

vrigal commented May 14, 2024

vrigal commented May 15, 2024

eu9ene commented May 15, 2024

bhearsum commented May 15, 2024

eu9ene left a comment

Choose a reason for hiding this comment

vrigal commented May 16, 2024

eu9ene commented May 16, 2024

vrigal commented May 17, 2024