Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580

Merged
merged 2 commits into from
May 17, 2024

Conversation

bhearsum
Copy link
Collaborator

@bhearsum bhearsum commented May 8, 2024

Aside from the included tests, I tested this by hand. What I did was:

[task 2024-05-07T20:54:35.749Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T20:54:35.880Z] INFO:root:Run 0 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T20:54:35.880Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T20:54:36.123Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T20:54:36.319Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T20:54:36.584Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T20:54:38.649Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T20:54:40.588Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T20:54:40.770Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T20:54:42.583Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T20:54:42.821Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T20:54:44.517Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T20:54:44.758Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T20:54:44.990Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T20:54:51.438Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T20:54:51.634Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T20:54:51.834Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T20:54:52.042Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T20:54:52.243Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T20:54:52.458Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T20:57:19.459Z] [2024-05-07 20:57:19] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz and /home/ubuntu/tasks/task_171511512470744/artifacts/model.npz.optimizer.npz
[task 2024-05-07T21:04:06.641Z] INFO:root:run_id > 0, attempting to resume training from an earlier run...
[task 2024-05-07T21:04:06.718Z] INFO:root:Run 1 appears to have the artifacts we need! Downloading them...
[task 2024-05-07T21:04:06.718Z] INFO:root:Fetching public/build/config.opustrainer.yml...
[task 2024-05-07T21:04:06.955Z] INFO:root:Fetching public/build/config.opustrainer.yml.state...
[task 2024-05-07T21:04:07.181Z] INFO:root:Fetching public/build/devset.out...
[task 2024-05-07T21:04:07.433Z] INFO:root:Fetching public/build/model.npz...
[task 2024-05-07T21:04:10.476Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz...
[task 2024-05-07T21:04:12.597Z] INFO:root:Fetching public/build/model.npz.best-bleu-detok.npz.decoder.yml...
[task 2024-05-07T21:04:12.812Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz...
[task 2024-05-07T21:04:15.629Z] INFO:root:Fetching public/build/model.npz.best-ce-mean-words.npz.decoder.yml...
[task 2024-05-07T21:04:15.833Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz...
[task 2024-05-07T21:04:19.250Z] INFO:root:Fetching public/build/model.npz.best-chrf.npz.decoder.yml...
[task 2024-05-07T21:04:19.447Z] INFO:root:Fetching public/build/model.npz.decoder.yml...
[task 2024-05-07T21:04:19.660Z] INFO:root:Fetching public/build/model.npz.optimizer.npz...
[task 2024-05-07T21:04:28.699Z] INFO:root:Fetching public/build/model.npz.progress.yml...
[task 2024-05-07T21:04:28.941Z] INFO:root:Fetching public/build/model.npz.yml...
[task 2024-05-07T21:04:29.166Z] INFO:root:Fetching public/build/opustrainer.log...
[task 2024-05-07T21:04:29.399Z] INFO:root:Fetching public/build/train.log...
[task 2024-05-07T21:04:29.609Z] INFO:root:Fetching public/build/valid.log...
[task 2024-05-07T21:04:29.842Z] INFO:root:Fetching public/build/vocab.spm...
<a bit further down>
[task 2024-05-07T21:06:50.280Z] [2024-05-07 21:06:50] [training] Master parameters and optimizers restored from training checkpoint /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz and /home/ubuntu/tasks/task_171511569156440/artifacts/model.npz.optimizer.npz

(I ended up canceling run 2 to avoid wasting resources.)

Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.

Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.

@bhearsum bhearsum requested review from gregtatum and eu9ene May 8, 2024 17:56
@bhearsum bhearsum requested a review from a team as a code owner May 8, 2024 17:56
@bhearsum bhearsum requested a review from hneiva May 8, 2024 17:56
import subprocess
import sys

TRAINING_SCRIPT = os.path.join(os.path.dirname(__file__), "train-taskcluster.sh")
CONTINUATION_ARTIFACTS = {
"config.opustrainer.yml",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eu9ene - I originally used the list from

for this, but found it didn't work due to final models being present. Please sanity check this list.

And tangentially related: should we update the training continuation artifact list to expect opustrainer files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to continue training from the same place OpusTrainer needs config.opustrainer.yml.state (this is a default path for the state file). I see that it already outputs this to the artifacts.

Ideally, we should modify the train.sh not to rebuild the config again if it's already present but this code is deterministic so it should work.

@bhearsum bhearsum changed the title Add support for automatically continuing training from earlier runs of a Task Add support for automatically continuing training from earlier runs of a Task (fixes #270) May 9, 2024
@bhearsum bhearsum force-pushed the autocontinue branch 3 times, most recently from 9814d5c to 6e45e93 Compare May 13, 2024 22:41
taskcluster/scripts/pipeline/train_taskcluster.py Outdated Show resolved Hide resolved
taskcluster/scripts/pipeline/train_taskcluster.py Outdated Show resolved Hide resolved
r.raise_for_status()

with open(os.path.join(model_dir, out_name), "wb+") as fd:
for chunk in r.iter_content(chunk_size=8192):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reasoning behind chunk_size=8192?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly - it was just what I copy/pasted from StackOverflow, and it seemed like a reasonable amount of data to write at one time.

taskcluster/scripts/pipeline/train_taskcluster.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay for tests!

Copy link
Collaborator

@eu9ene eu9ene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving after reviewing the last training run. W&B dashboards and Marian training logs look correct. There are some comments to address about the number of arguments though.

taskcluster/scripts/pipeline/train_taskcluster.py Outdated Show resolved Hide resolved
f"{root_url}/api/queue/v1/task/{task_id}/runs/{prev_run_id}/artifacts/{artifact['name']}",
stream=True,
)
r.raise_for_status()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what happens if there's an exception in this code, say from data downloading. Will the task just be retried? Do we want to implement any retries right here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taskgraph has some support for automatic retries based on exit status; I'll set that up.

tests/test_train_taskcluster.py Outdated Show resolved Hide resolved
@bhearsum
Copy link
Collaborator Author

I've kicked off two test runs to validate the latest changes:

Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looks good to me.

PRETRAINED_MODEL_MODE_ARG_NUMBER = 12
# Nothing special about 17...just a number plucked out of thin air that
# should be distinct enough to retry on.
DOWNLOAD_ERROR_EXIT_CODE = 17


def main(args):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: I guess ultimately we'll peel back all the layers to this onion and merge all of these separate scripts into a single training script that can just handle all of this, and write it all in python. I filed: #607 for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm not proud of introducing this extra layer of indirection, but I do intend to follow up when there's more time.



@pytest.mark.parametrize(
"current_run_id,resumable_run_id,reasons_created,run_artifacts,orig_pretrained_model_mode,expected_pretrained_model_mode",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought (no action need): This style of parameterization is hard to connect the parameters with their names. I'm not sure if there is a way to label them easier with a TypedDict or something. I do like the fact that they have ids on them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I'm not aware of any way to do this with pytest params :(. It looks like pytest-dev/pytest#9216 talks about this a bit.


tt_mock["requests"].get = mock.Mock(wraps=fake_get)

with tempfile.TemporaryDirectory() as model_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We have the DataDir fixture that puts these directories in the same place. This makes it easier debug failing tests.

data_dir = DataDir("test_train_taskcluster")
data_dir.path

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's handy! Let's use it here too.

@bhearsum
Copy link
Collaborator Author

#605 appears to be breaking finetune-student in the PR triggered tasks :(

I've kicked off two test runs to validate the latest changes:

* https://firefox-ci-tc.services.mozilla.com/tasks/O9QxLh3FQ2uE29lb1y257w is the same as previous runs, where I wait for a checkpoint, simulate a spot termination, and verify that it continues. I've done this once so far, and will do it again after the next checkpoint

The run is now complete, and worked fine with the two simulated spot terminations.

@bhearsum bhearsum force-pushed the autocontinue branch 3 times, most recently from b392d36 to 9bf4f6b Compare May 17, 2024 17:56
This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.
@bhearsum bhearsum merged commit 7e97421 into mozilla:main May 17, 2024
7 of 13 checks passed
eu9ene pushed a commit that referenced this pull request May 21, 2024
…f a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.
eu9ene added a commit that referenced this pull request May 22, 2024
* Update bicleaner

* Lock requirements

* Use larger multilingual models

* Fix requirements

* Fix typo

* Add libs for hunspell

* Fix model downloading

* Fix model downloading

* Use a toolchain for cyhunspell

* Remove a hard github dependency

* Fix test

* Add COMET to the evaluation (#587)

* Custom cleaning (#547)

* Update default config

* Pre-download fast text model

* Add custom filters

* Add unit tests for config generation

* Make using custom filtering configs configurable

* Fix substitution

* Parse tasks with label finetune-student (#609)

* Add test case

* Update regex

* Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580)

* Add support for automatically continuing training from earlier runs of a Task.

* Automatically retry training tasks on exit code 17

This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.

* Support news-crawl importer (#608)

* Ensure all of the task labels can be parsed in the task graph (#612)

* Show the stack trace when there is an error in tracking

* Rename parse_tag to parse_task_label

* Document the regexes

* Turn the parsed label into a NamedTuple for better typing hints

* Get the tag parsing working with the full task graph

* Allow for 3 letter language codes

* Temporarily disable an evaluate task that is failing

* Update code docs a bit

* Fix tests for finetune-student

* Use shorter names for URLs (#611)

* Change the caching strategy for teachers (#606)

* Fix comment

---------

Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com>
Co-authored-by: Valentin Rigal <rigal@teklia.com>
Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants