You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran the notebook, and the only significant change I made was changing the accelerator to NVIDIA_TESLA_T4. When running job.run(), I expected it to successfully train the model.
Actual Behavior
Once the custom job started, it quickly gave an error and stopped:
The replica workerpool0-0 exited with a non-zero status of 1.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 91, in <module>
main()
File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 87, in main
experiment.run(args)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 110, in run
trainer = train(args, text_classifier, train_dataset, test_dataset)
File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 69, in train
output_dir=os.path.join("/tmp", args.model_name)
File "<string>", line 111, in __init__
File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1340, in __post_init__
and (self.device.type != "cuda")
File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1764, in device
return self._setup_devices
File "/root/.local/lib/python3.7/site-packages/transformers/utils/generic.py", line 54, in __get__
cached = self.fget(obj)
File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1673, in _setup_devices
"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`"
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`
So I added accelerate to the REQUIRED_PACKAGES, and now it doesn't output any errors. But the training doesn't seem to work.
When I run the training script on Colab, it finishes in a few minutes (using a smaller model, nlpaueb/legal-bert-small-uncased). But when I send it to Vertex with this notebook with the same small model, it will just keep running and running, saying Job is running once a minute, but even after an hour, it's still not done.
I added a TrainerCallback to log a message after every step. As well as one message right before trainer.train(). When I run the script in Colab, all the logging messages work. But in the custom job, only the one before trainer.train() works. The ones in the callback are never added to the Logs Explorer, so it seems like that part of the training never starts for some reason.
pytorch-text-sentiment-classification-custom-train-deploy.ipynb
Expected Behavior
I ran the notebook, and the only significant change I made was changing the accelerator to NVIDIA_TESLA_T4. When running job.run(), I expected it to successfully train the model.
Actual Behavior
Once the custom job started, it quickly gave an error and stopped:
Steps to Reproduce the Problem
See above.
Specifications
The text was updated successfully, but these errors were encountered: