Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom training job on Vertex AI failed #2880

Open
jieshi1010 opened this issue Apr 12, 2024 · 0 comments
Open

Custom training job on Vertex AI failed #2880

jieshi1010 opened this issue Apr 12, 2024 · 0 comments

Comments

@jieshi1010
Copy link
Contributor

Expected Behavior

Custom training job completed successfully.

Actual Behavior

Job failed with error "Replica exited with a non-zero status code 1"

Steps to Reproduce the Problem

  1. Follow this notebook
  2. Try to run custom job on Vertex AI Training with both pre-built or custom container
  3. Job will fail with the error "Replica exited with a non-zero status code 1"

Specifications

When running the notebook: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/pytorch_text_classification_using_vertex_sdk_and_gcloud/pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb, the custom training job always failed with "Replica exited with a non-zero status code" error. This error codes are potentially caused by problems in the training code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant