"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

ruvilonix · 2024-03-11T02:53:17Z

pytorch-text-sentiment-classification-custom-train-deploy.ipynb

Expected Behavior

I ran the notebook, and the only significant change I made was changing the accelerator to NVIDIA_TESLA_T4. When running job.run(), I expected it to successfully train the model.

Actual Behavior

Once the custom job started, it quickly gave an error and stopped:

The replica workerpool0-0 exited with a non-zero status of 1. 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 91, in <module>
    main()
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 87, in main
    experiment.run(args)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 110, in run
    trainer = train(args, text_classifier, train_dataset, test_dataset)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 69, in train
    output_dir=os.path.join("/tmp", args.model_name)
  File "<string>", line 111, in __init__
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1340, in __post_init__
    and (self.device.type != "cuda")
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1764, in device
    return self._setup_devices
  File "/root/.local/lib/python3.7/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1673, in _setup_devices
    "Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`"
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

Steps to Reproduce the Problem

See above.

Specifications

Python 3.7

PyTorch 1.11

datasets-2.13.2
dill-0.3.6 
filelock-3.12.2 
huggingface-hub-0.16.4 
multiprocess-0.70.14 
regex-2023.12.25 
safetensors-0.4.2 
tokenizers-0.13.3 
trainer-0.1 
transformers-4.30.2 
xxhash-3.4.1

The text was updated successfully, but these errors were encountered:

gericdong · 2024-03-11T13:45:10Z

@katiemn could you please check? Thanks.

ruvilonix · 2024-03-16T16:03:49Z

So I added accelerate to the REQUIRED_PACKAGES, and now it doesn't output any errors. But the training doesn't seem to work.

When I run the training script on Colab, it finishes in a few minutes (using a smaller model, nlpaueb/legal-bert-small-uncased). But when I send it to Vertex with this notebook with the same small model, it will just keep running and running, saying Job is running once a minute, but even after an hour, it's still not done.

I added a TrainerCallback to log a message after every step. As well as one message right before trainer.train(). When I run the script in Colab, all the logging messages work. But in the custom job, only the one before trainer.train() works. The ones in the callback are never added to the Logs Explorer, so it seems like that part of the training never starts for some reason.

ruvilonix · 2024-03-16T21:56:12Z

I haven't tried it yet, but it looks like the notebook pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb would have the same issue, as the code is similar and it does not use the accelerate dependency either.

ruvilonix · 2024-03-17T19:33:25Z

In regards to the issue with trainer.train() not working, it looks like this bug report is related: https://issuetracker.google.com/issues/243267023

gericdong assigned katiemn Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

ruvilonix commented Mar 11, 2024 •

edited

gericdong commented Mar 11, 2024

ruvilonix commented Mar 16, 2024 •

edited

ruvilonix commented Mar 16, 2024 •

edited

ruvilonix commented Mar 17, 2024

"Using the Trainer with PyTorch requires accelerate>=0.20.1" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

"Using the Trainer with PyTorch requires accelerate>=0.20.1" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

Comments

ruvilonix commented Mar 11, 2024 • edited

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

gericdong commented Mar 11, 2024

ruvilonix commented Mar 16, 2024 • edited

ruvilonix commented Mar 16, 2024 • edited

ruvilonix commented Mar 17, 2024

"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

"Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

ruvilonix commented Mar 11, 2024 •

edited

ruvilonix commented Mar 16, 2024 •

edited

ruvilonix commented Mar 16, 2024 •

edited