Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

bhearsum · 2024-05-08T17:56:43Z

Aside from the included tests, I tested this by hand. What I did was:

Start a train-backwards job: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/0
Wait for it to checkpoint
Simulate a spot termination, which was handled correctly, and ended up with 20 artifacts
The Task automatically reran: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/1, and picked up the previous artifacts. From the log:

[task 2024-05-07T20:54:35.749Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T20:54:35.880Z] INFO:root:Run 0 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T20:54:35.880Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T20:54:36.123Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T20:54:36.319Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T20:54:36.584Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T20:54:38.649Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T20:54:40.588Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T20:54:40.770Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T20:54:42.583Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T20:54:42.821Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T20:54:44.517Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T20:54:44.758Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T20:54:44.990Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T20:54:51.438Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T20:54:51.634Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T20:54:51.834Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T20:54:52.042Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T20:54:52.243Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T20:54:52.458Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T20:57:19.459Z] [2024-05-07 20:57:19] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz and /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz.optimizer.npz

I simulated a spot termination on that run quite quickly, and it uploaded the artifacts it had just downloaded. (These get downloaded to the artifacts directory when we continue training, so even though this run didn't checkpoint, it essentially just re-uploaded run #0's work).
Run Requirements for the python venv #2 also started automatically: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/2, and picked up the artifacts from run Initial pipeline #1. Again from the log:

[task 2024-05-07T21:04:06.641Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T21:04:06.718Z] INFO:root:Run 1 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T21:04:06.718Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T21:04:06.955Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T21:04:07.181Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T21:04:07.433Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T21:04:10.476Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T21:04:12.597Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T21:04:12.812Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T21:04:15.629Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T21:04:15.833Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T21:04:19.250Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T21:04:19.447Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T21:04:19.660Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T21:04:28.699Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T21:04:28.941Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T21:04:29.166Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T21:04:29.399Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T21:04:29.609Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T21:04:29.842Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T21:06:50.280Z] [2024-05-07 21:06:50] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz and /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz.optimizer.npz

(I ended up canceling run 2 to avoid wasting resources.)

Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.

Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.

bhearsum · 2024-05-08T17:58:00Z

taskcluster/scripts/pipeline/train_taskcluster.py

 import subprocess
 import sys

 TRAINING_SCRIPT = os.path.join(os.path.dirname(__file__), "train-taskcluster.sh")
+CONTINUATION_ARTIFACTS = {
+    "config.opustrainer.yml",


@eu9ene - I originally used the list from

firefox-translations-training/taskcluster/translations_taskgraph/transforms/training_continuation.py

Line 5 in 99015e1

CONTINUE_TRAINING_ARTIFACTS = (

for this, but found it didn't work due to final models being present. Please sanity check this list.

And tangentially related: should we update the training continuation artifact list to expect opustrainer files?

Yes, to continue training from the same place OpusTrainer needs config.opustrainer.yml.state (this is a default path for the state file). I see that it already outputs this to the artifacts.

Ideally, we should modify the train.sh not to rebuild the config again if it's already present but this code is deterministic so it should work.

taskcluster/scripts/pipeline/train_taskcluster.py

gabrielBusta · 2024-05-16T16:54:34Z

taskcluster/scripts/pipeline/train_taskcluster.py

+                r.raise_for_status()
+
+                with open(os.path.join(model_dir, out_name), "wb+") as fd:
+                    for chunk in r.iter_content(chunk_size=8192):


Any reasoning behind chunk_size=8192?

Honestly - it was just what I copy/pasted from StackOverflow, and it seemed like a reasonable amount of data to write at one time.

taskcluster/scripts/pipeline/train_taskcluster.py

gabrielBusta · 2024-05-16T17:01:08Z

tests/test_train_taskcluster.py

Yay for tests!

eu9ene

I'm approving after reviewing the last training run. W&B dashboards and Marian training logs look correct. There are some comments to address about the number of arguments though.

taskcluster/scripts/pipeline/train_taskcluster.py

eu9ene · 2024-05-16T17:27:48Z

taskcluster/scripts/pipeline/train_taskcluster.py

+                    f"{root_url}/api/queue/v1/task/{task_id}/runs/{prev_run_id}/artifacts/{artifact['name']}",
+                    stream=True,
+                )
+                r.raise_for_status()


I wonder what happens if there's an exception in this code, say from data downloading. Will the task just be retried? Do we want to implement any retries right here?

Taskgraph has some support for automatic retries based on exit status; I'll set that up.

tests/test_train_taskcluster.py

bhearsum · 2024-05-16T19:56:42Z

I've kicked off two test runs to validate the latest changes:

https://firefox-ci-tc.services.mozilla.com/tasks/O9QxLh3FQ2uE29lb1y257w is the same as previous runs, where I wait for a checkpoint, simulate a spot termination, and verify that it continues. I've done this once so far, and will do it again after the next checkpoint
https://firefox-ci-tc.services.mozilla.com/tasks/D2fDAVu2Qge5ZHoMT9W8Lg is a run with this additional change to force a download error to prove that the new retry logic works (it does!).

gregtatum

Thanks! Looks good to me.

gregtatum · 2024-05-16T20:11:56Z

taskcluster/scripts/pipeline/train_taskcluster.py

+PRETRAINED_MODEL_MODE_ARG_NUMBER = 12
+# Nothing special about 17...just a number plucked out of thin air that
+# should be distinct enough to retry on.
+DOWNLOAD_ERROR_EXIT_CODE = 17


 def main(args):


Thought: I guess ultimately we'll peel back all the layers to this onion and merge all of these separate scripts into a single training script that can just handle all of this, and write it all in python. I filed: #607 for this.

Yeah. I'm not proud of introducing this extra layer of indirection, but I do intend to follow up when there's more time.

gregtatum · 2024-05-16T20:20:29Z

tests/test_train_taskcluster.py

+
+
+@pytest.mark.parametrize(
+    "current_run_id,resumable_run_id,reasons_created,run_artifacts,orig_pretrained_model_mode,expected_pretrained_model_mode",


Thought (no action need): This style of parameterization is hard to connect the parameters with their names. I'm not sure if there is a way to label them easier with a TypedDict or something. I do like the fact that they have ids on them.

I agree. I'm not aware of any way to do this with pytest params :(. It looks like pytest-dev/pytest#9216 talks about this a bit.

gregtatum · 2024-05-16T20:22:46Z

tests/test_train_taskcluster.py

+
+            tt_mock["requests"].get = mock.Mock(wraps=fake_get)
+
+            with tempfile.TemporaryDirectory() as model_dir:


Suggestion: We have the DataDir fixture that puts these directories in the same place. This makes it easier debug failing tests.

data_dir = DataDir("test_train_taskcluster") data_dir.path

That's handy! Let's use it here too.

bhearsum · 2024-05-17T15:08:10Z

#605 appears to be breaking finetune-student in the PR triggered tasks :(

I've kicked off two test runs to validate the latest changes:

* https://firefox-ci-tc.services.mozilla.com/tasks/O9QxLh3FQ2uE29lb1y257w is the same as previous runs, where I wait for a checkpoint, simulate a spot termination, and verify that it continues. I've done this once so far, and will do it again after the next checkpoint

The run is now complete, and worked fine with the two simulated spot terminations.

…f a Task.

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

…f a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> Co-authored-by: Valentin Rigal <rigal@teklia.com> Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>

bhearsum requested review from gregtatum and eu9ene May 8, 2024 17:56

bhearsum requested a review from a team as a code owner May 8, 2024 17:56

bhearsum requested a review from hneiva May 8, 2024 17:56

bhearsum commented May 8, 2024

View reviewed changes

bhearsum mentioned this pull request May 8, 2024

Support training continuation for student models #515

Open

bhearsum changed the title ~~Add support for automatically continuing training from earlier runs of a Task~~ Add support for automatically continuing training from earlier runs of a Task (fixes #270) May 9, 2024

bhearsum force-pushed the autocontinue branch 3 times, most recently from 9814d5c to 6e45e93 Compare May 13, 2024 22:41

bhearsum mentioned this pull request May 14, 2024

weights and biases reporting doesn't work correctly when resuming training after a spot termination #594

Closed

bhearsum force-pushed the autocontinue branch from 8631f67 to a411cd9 Compare May 15, 2024 22:56

gabrielBusta approved these changes May 16, 2024

View reviewed changes

bhearsum force-pushed the autocontinue branch from a411cd9 to 4626c89 Compare May 16, 2024 17:34

eu9ene approved these changes May 16, 2024

View reviewed changes

bhearsum force-pushed the autocontinue branch from 4626c89 to 41529b6 Compare May 16, 2024 17:59

bhearsum mentioned this pull request May 16, 2024

finetune-student died when publication failed #605

Closed

gregtatum approved these changes May 16, 2024

View reviewed changes

bhearsum force-pushed the autocontinue branch 3 times, most recently from b392d36 to 9bf4f6b Compare May 17, 2024 17:56

bhearsum added 2 commits May 17, 2024 15:54

Add support for automatically continuing training from earlier runs o…

f87b99e

…f a Task.

Automatically retry training tasks on exit code 17

db1a7e2

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

bhearsum force-pushed the autocontinue branch from 9bf4f6b to db1a7e2 Compare May 17, 2024 19:54

bhearsum merged commit 7e97421 into mozilla:main May 17, 2024
7 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

bhearsum commented May 8, 2024

bhearsum May 8, 2024

eu9ene May 9, 2024

gabrielBusta May 16, 2024

bhearsum May 16, 2024

gabrielBusta May 16, 2024

eu9ene left a comment

eu9ene May 16, 2024

bhearsum May 16, 2024

bhearsum commented May 16, 2024

gregtatum left a comment

gregtatum May 16, 2024

bhearsum May 16, 2024

gregtatum May 16, 2024

bhearsum May 16, 2024

gregtatum May 16, 2024

bhearsum May 16, 2024

bhearsum commented May 17, 2024



		@pytest.mark.parametrize(
		"current_run_id,resumable_run_id,reasons_created,run_artifacts,orig_pretrained_model_mode,expected_pretrained_model_mode",


		tt_mock["requests"].get = mock.Mock(wraps=fake_get)

		with tempfile.TemporaryDirectory() as model_dir:

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

Conversation

bhearsum commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eu9ene left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhearsum commented May 16, 2024

gregtatum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhearsum commented May 17, 2024