Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while trying to fine-tune using FSDP on a TPU #2775

Closed
4 tasks
tengomucho opened this issue May 14, 2024 · 2 comments
Closed
4 tasks

Error while trying to fine-tune using FSDP on a TPU #2775

tengomucho opened this issue May 14, 2024 · 2 comments

Comments

@tengomucho
Copy link

tengomucho commented May 14, 2024

System Info

accelerate version: 0.30.1 and current master (27a607e)
torch/torch_xla: 2.3.0

- `Accelerate` version: 0.30.1.dev0
- Platform: Linux-5.19.0-1030-gcp-x86_64-with-glibc2.35
- `accelerate` bash location: /home/amoran/Dev/venv/pt23/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 377.82 GB
- `Accelerate` default config:
        Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I used the example on how to use Accelerate with FSDP via SPMD on TPU. When I did so, I ended up with an error raised from dataloader:

  0%|                                                                                   | 0/100 [00:00<?, ?it/s]Exception in thread Thread-3 (_loader_worker):
Traceback (most recent call last):
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/accelerate/data_loader.py", line 464, in __iter__
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (100) higher than the number of available samples.
    next_batch = next(dataloader_iter)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
{'train_runtime': 0.0043, 'train_samples_per_second': 0.0, 'train_steps_per_second': 23093.844, 'train_loss': 0.0, 'epoch': 0}
    index = self._next_index()  # may raise StopIteration
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 621, in _next_index
  0%|                                                                                   | 0/100 [00:00<?, ?it/s]    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  0%|                                                                                   | 0/100 [00:00<?, ?it/s]    self.run()

  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch_xla/distributed/parallel_loader.py", line 152, in _loader_worker
    _, data = next(data_iter)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/accelerate/data_loader.py", line 472, in __iter__
    yield current_batch
UnboundLocalError: local variable 'current_batch' referenced before assignment

For clarity, I used this script to reproduce the issue.
Any hint on how to fix this?

Expected behavior

I would expect the trainer to start tuning the model.

@alanwaketan
Copy link

Just to cc myself.

@tengomucho
Copy link
Author

It seems to be an error in the config in the example:

fsdp_config = {
-    "fsdp_transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"],
+    "transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"],
    "xla": True,
    "xla_fsdp_v2": True,
    "xla_fsdp_grad_ckpt": True
}

With this, and splitting the dataset to get the train set, the problem disappears. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants