These scripts are used during the E2E test GitHub action to run some models and validate the results.
The workflows in e2e_test.yml does a few things:
- Set up
gcloud
credentials from a Service Account key managed in repo secrets. - Install
torchprime
. - Test
tp use
and point it to an XPK cluster hosted internally. - Test
tp run
on a few models.
After kicking off the training of some models, it starts a parallel job for each model, and runs a few checks. This is implemented in reusable_e2e_check.yml:
- Stream the logs.
- Check workload exit code.
- Check for specific log strings that indicate training success.
- Check that there is a profile
.pb
file. - Check that the step time is within a reasonable range.
As of #208, we'll start checking that the step time of model training E2E tests falls within experimentally derived bounds.
If this is the result of an optimization/updated deps, you may re-center the bounds to the latest step time, keeping the confidence intervals unchanged.
If there were no code changes and the step time is still below the lower bound, consider growing the confidence interval.
If this is the result of a code change, you may have introduced a regression. Investigate the root cause and avoid the slow down before landing the PR.
If there were no code changes and the step time is still above the upper bound, we'll need to discuss with the hardware teams because it may be the result of hardware changes.
E2E tests are launched onto an XPK cluster named tpu-v6e-ci
.
(Googlers only) To heal or re-create the cluster, use the following:
xpk cluster create \
--tpu-type v6e-4 \
--cluster tpu-v6e-ci \
--num-slices 64 \
--on-demand \
--zone us-central2-b \
--project tpu-pytorch \
--default-pool-cpu-machine-type=n2-standard-32