e2e_testing

Apr 14, 2025

71d71db · Apr 14, 2025

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Check step duration in E2E test (#208 )	Apr 14, 2025
check_logs.py	check_logs.py	Check step duration in E2E test (#208 )	Apr 14, 2025
check_profile.py	check_profile.py	End-to-end tests for torch_xla llama and mixtral (#100 )	Feb 11, 2025
check_step_time.py	check_step_time.py	Check step duration in E2E test (#208 )	Apr 14, 2025
gen_name.py	gen_name.py	End-to-end tests for torch_xla llama and mixtral (#100 )	Feb 11, 2025

README.md

E2E testing

These scripts are used during the E2E test GitHub action to run some models and validate the results.

E2E test design

The workflows in e2e_test.yml does a few things:

Set up gcloud credentials from a Service Account key managed in repo secrets.
Install torchprime.
Test tp use and point it to an XPK cluster hosted internally.
Test tp run on a few models.

After kicking off the training of some models, it starts a parallel job for each model, and runs a few checks. This is implemented in reusable_e2e_check.yml:

Stream the logs.
Check workload exit code.
Check for specific log strings that indicate training success.
Check that there is a profile .pb file.
Check that the step time is within a reasonable range.

What to do when step time is out of range

As of #208, we'll start checking that the step time of model training E2E tests falls within experimentally derived bounds.

If the step time falls below the lower bound

If this is the result of an optimization/updated deps, you may re-center the bounds to the latest step time, keeping the confidence intervals unchanged.

If there were no code changes and the step time is still below the lower bound, consider growing the confidence interval.

If the step time goes above the upper bound

If this is the result of a code change, you may have introduced a regression. Investigate the root cause and avoid the slow down before landing the PR.

If there were no code changes and the step time is still above the upper bound, we'll need to discuss with the hardware teams because it may be the result of hardware changes.

v6e XPK cluster

E2E tests are launched onto an XPK cluster named tpu-v6e-ci.

(Googlers only) To heal or re-create the cluster, use the following:

xpk cluster create \
    --tpu-type v6e-4 \
    --cluster tpu-v6e-ci \
    --num-slices 64 \
    --on-demand \
    --zone us-central2-b \
    --project tpu-pytorch \
    --default-pool-cpu-machine-type=n2-standard-32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

e2e_testing

e2e_testing

README.md

E2E testing

E2E test design

What to do when step time is out of range

If the step time falls below the lower bound

If the step time goes above the upper bound

v6e XPK cluster

Files

e2e_testing

Directory actions

More options

Directory actions

More options

Latest commit

History

e2e_testing

Folders and files

parent directory

README.md

E2E testing

E2E test design

What to do when step time is out of range

If the step time falls below the lower bound

If the step time goes above the upper bound

v6e XPK cluster