Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Using the Trainer with PyTorch requires accelerate>=0.20.1" when running "pytorch_text_sentiment_classification_custom_train_deploy.ipynb" #2769

Open
ruvilonix opened this issue Mar 11, 2024 · 4 comments
Assignees

Comments

@ruvilonix
Copy link

ruvilonix commented Mar 11, 2024

pytorch-text-sentiment-classification-custom-train-deploy.ipynb

Expected Behavior

I ran the notebook, and the only significant change I made was changing the accelerator to NVIDIA_TESLA_T4. When running job.run(), I expected it to successfully train the model.

Actual Behavior

Once the custom job started, it quickly gave an error and stopped:

The replica workerpool0-0 exited with a non-zero status of 1. 
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 91, in <module>
    main()
  File "/root/.local/lib/python3.7/site-packages/trainer/task.py", line 87, in main
    experiment.run(args)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 110, in run
    trainer = train(args, text_classifier, train_dataset, test_dataset)
  File "/root/.local/lib/python3.7/site-packages/trainer/experiment.py", line 69, in train
    output_dir=os.path.join("/tmp", args.model_name)
  File "<string>", line 111, in __init__
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1340, in __post_init__
    and (self.device.type != "cuda")
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1764, in device
    return self._setup_devices
  File "/root/.local/lib/python3.7/site-packages/transformers/utils/generic.py", line 54, in __get__
    cached = self.fget(obj)
  File "/root/.local/lib/python3.7/site-packages/transformers/training_args.py", line 1673, in _setup_devices
    "Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`"
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

Steps to Reproduce the Problem

See above.

Specifications

Python 3.7

PyTorch 1.11

datasets-2.13.2
dill-0.3.6 
filelock-3.12.2 
huggingface-hub-0.16.4 
multiprocess-0.70.14 
regex-2023.12.25 
safetensors-0.4.2 
tokenizers-0.13.3 
trainer-0.1 
transformers-4.30.2 
xxhash-3.4.1
@gericdong
Copy link
Contributor

@katiemn could you please check? Thanks.

@ruvilonix
Copy link
Author

ruvilonix commented Mar 16, 2024

So I added accelerate to the REQUIRED_PACKAGES, and now it doesn't output any errors. But the training doesn't seem to work.

When I run the training script on Colab, it finishes in a few minutes (using a smaller model, nlpaueb/legal-bert-small-uncased). But when I send it to Vertex with this notebook with the same small model, it will just keep running and running, saying Job is running once a minute, but even after an hour, it's still not done.

I added a TrainerCallback to log a message after every step. As well as one message right before trainer.train(). When I run the script in Colab, all the logging messages work. But in the custom job, only the one before trainer.train() works. The ones in the callback are never added to the Logs Explorer, so it seems like that part of the training never starts for some reason.

@ruvilonix
Copy link
Author

ruvilonix commented Mar 16, 2024

I haven't tried it yet, but it looks like the notebook pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb would have the same issue, as the code is similar and it does not use the accelerate dependency either.

@ruvilonix
Copy link
Author

In regards to the issue with trainer.train() not working, it looks like this bug report is related: https://issuetracker.google.com/issues/243267023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants