-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for automatically continuing training from earlier runs of a Task (fixes #270) #580
Conversation
import subprocess | ||
import sys | ||
|
||
TRAINING_SCRIPT = os.path.join(os.path.dirname(__file__), "train-taskcluster.sh") | ||
CONTINUATION_ARTIFACTS = { | ||
"config.opustrainer.yml", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eu9ene - I originally used the list from
firefox-translations-training/taskcluster/translations_taskgraph/transforms/training_continuation.py
Line 5 in 99015e1
CONTINUE_TRAINING_ARTIFACTS = ( |
final
models being present. Please sanity check this list.
And tangentially related: should we update the training continuation artifact list to expect opustrainer files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to continue training from the same place OpusTrainer needs config.opustrainer.yml.state
(this is a default path for the state file). I see that it already outputs this to the artifacts.
Ideally, we should modify the train.sh not to rebuild the config again if it's already present but this code is deterministic so it should work.
9814d5c
to
6e45e93
Compare
r.raise_for_status() | ||
|
||
with open(os.path.join(model_dir, out_name), "wb+") as fd: | ||
for chunk in r.iter_content(chunk_size=8192): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reasoning behind chunk_size=8192
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly - it was just what I copy/pasted from StackOverflow, and it seemed like a reasonable amount of data to write at one time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay for tests!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving after reviewing the last training run. W&B dashboards and Marian training logs look correct. There are some comments to address about the number of arguments though.
f"{root_url}/api/queue/v1/task/{task_id}/runs/{prev_run_id}/artifacts/{artifact['name']}", | ||
stream=True, | ||
) | ||
r.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what happens if there's an exception in this code, say from data downloading. Will the task just be retried? Do we want to implement any retries right here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taskgraph has some support for automatic retries based on exit status; I'll set that up.
I've kicked off two test runs to validate the latest changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me.
PRETRAINED_MODEL_MODE_ARG_NUMBER = 12 | ||
# Nothing special about 17...just a number plucked out of thin air that | ||
# should be distinct enough to retry on. | ||
DOWNLOAD_ERROR_EXIT_CODE = 17 | ||
|
||
|
||
def main(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: I guess ultimately we'll peel back all the layers to this onion and merge all of these separate scripts into a single training script that can just handle all of this, and write it all in python. I filed: #607 for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'm not proud of introducing this extra layer of indirection, but I do intend to follow up when there's more time.
|
||
|
||
@pytest.mark.parametrize( | ||
"current_run_id,resumable_run_id,reasons_created,run_artifacts,orig_pretrained_model_mode,expected_pretrained_model_mode", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought (no action need): This style of parameterization is hard to connect the parameters with their names. I'm not sure if there is a way to label them easier with a TypedDict or something. I do like the fact that they have id
s on them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm not aware of any way to do this with pytest params :(. It looks like pytest-dev/pytest#9216 talks about this a bit.
tests/test_train_taskcluster.py
Outdated
|
||
tt_mock["requests"].get = mock.Mock(wraps=fake_get) | ||
|
||
with tempfile.TemporaryDirectory() as model_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: We have the DataDir
fixture that puts these directories in the same place. This makes it easier debug failing tests.
data_dir = DataDir("test_train_taskcluster")
data_dir.path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's handy! Let's use it here too.
#605 appears to be breaking finetune-student in the PR triggered tasks :(
The run is now complete, and worked fine with the two simulated spot terminations. |
b392d36
to
9bf4f6b
Compare
This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume.
* Update bicleaner * Lock requirements * Use larger multilingual models * Fix requirements * Fix typo * Add libs for hunspell * Fix model downloading * Fix model downloading * Use a toolchain for cyhunspell * Remove a hard github dependency * Fix test * Add COMET to the evaluation (#587) * Custom cleaning (#547) * Update default config * Pre-download fast text model * Add custom filters * Add unit tests for config generation * Make using custom filtering configs configurable * Fix substitution * Parse tasks with label finetune-student (#609) * Add test case * Update regex * Add support for automatically continuing training from earlier runs of a Task (fixes #270) (#580) * Add support for automatically continuing training from earlier runs of a Task. * Automatically retry training tasks on exit code 17 This is the exit code when train_taskcluster.py fails to download an artifact when attempting to resume. * Support news-crawl importer (#608) * Ensure all of the task labels can be parsed in the task graph (#612) * Show the stack trace when there is an error in tracking * Rename parse_tag to parse_task_label * Document the regexes * Turn the parsed label into a NamedTuple for better typing hints * Get the tag parsing working with the full task graph * Allow for 3 letter language codes * Temporarily disable an evaluate task that is failing * Update code docs a bit * Fix tests for finetune-student * Use shorter names for URLs (#611) * Change the caching strategy for teachers (#606) * Fix comment --------- Co-authored-by: Greg Tatum <gregtatum@users.noreply.github.com> Co-authored-by: Valentin Rigal <rigal@teklia.com> Co-authored-by: Ben Hearsum (he/him) <ben@mozilla.com>
Aside from the included tests, I tested this by hand. What I did was:
train-backwards
job: https://firefox-ci-tc.services.mozilla.com/tasks/bNP6s4FaRwaU7Bz8gxgJ-Q/runs/0artifacts
directory when we continue training, so even though this run didn't checkpoint, it essentially just re-uploaded run #0's work).(I ended up canceling run 2 to avoid wasting resources.)
Additional interpretation of the logs and artifacts is welcome; I'm also happy to do more simulated test runs if that's useful.
Landing this depends on an update to the GPU images that includes the latest spot termintation handling code. I expect this to land in the next day.