Error while trying to fine-tune using FSDP on a TPU #2775

tengomucho · 2024-05-14T14:00:50Z

System Info

accelerate version: 0.30.1 and current master (27a607e)
torch/torch_xla: 2.3.0

- `Accelerate` version: 0.30.1.dev0
- Platform: Linux-5.19.0-1030-gcp-x86_64-with-glibc2.35
- `accelerate` bash location: /home/amoran/Dev/venv/pt23/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 377.82 GB
- `Accelerate` default config:
        Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I used the example on how to use Accelerate with FSDP via SPMD on TPU. When I did so, I ended up with an error raised from dataloader:

  0%|                                                                                   | 0/100 [00:00<?, ?it/s]Exception in thread Thread-3 (_loader_worker):
Traceback (most recent call last):
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/accelerate/data_loader.py", line 464, in __iter__
There seems to be not a single sample in your epoch_iterator, stopping training at step 0! This is expected if you're using an IterableDataset and set num_steps (100) higher than the number of available samples.
    next_batch = next(dataloader_iter)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
{'train_runtime': 0.0043, 'train_samples_per_second': 0.0, 'train_steps_per_second': 23093.844, 'train_loss': 0.0, 'epoch': 0}
    index = self._next_index()  # may raise StopIteration
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 621, in _next_index
  0%|                                                                                   | 0/100 [00:00<?, ?it/s]    return next(self._sampler_iter)  # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
  0%|                                                                                   | 0/100 [00:00<?, ?it/s]    self.run()

  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/torch_xla/distributed/parallel_loader.py", line 152, in _loader_worker
    _, data = next(data_iter)
  File "/home/amoran/Dev/venv/pt23/lib/python3.10/site-packages/accelerate/data_loader.py", line 472, in __iter__
    yield current_batch
UnboundLocalError: local variable 'current_batch' referenced before assignment

For clarity, I used this script to reproduce the issue.
Any hint on how to fix this?

Expected behavior

I would expect the trainer to start tuning the model.

The text was updated successfully, but these errors were encountered:

alanwaketan · 2024-05-14T19:15:46Z

Just to cc myself.

tengomucho · 2024-05-24T09:46:55Z

It seems to be an error in the config in the example:

fsdp_config = {
-    "fsdp_transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"],
+    "transformer_layer_cls_to_wrap": ["GemmaDecoderLayer"],
    "xla": True,
    "xla_fsdp_v2": True,
    "xla_fsdp_grad_ckpt": True
}

With this, and splitting the dataset to get the train set, the problem disappears. Closing this.

tengomucho mentioned this issue May 14, 2024

Basic Llama2 Tuning huggingface/optimum-tpu#39

Merged

3 tasks

tengomucho closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while trying to fine-tune using FSDP on a TPU #2775

Error while trying to fine-tune using FSDP on a TPU #2775

tengomucho commented May 14, 2024 •

edited

alanwaketan commented May 14, 2024

tengomucho commented May 24, 2024

Error while trying to fine-tune using FSDP on a TPU #2775

Error while trying to fine-tune using FSDP on a TPU #2775

Comments

tengomucho commented May 14, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

alanwaketan commented May 14, 2024

tengomucho commented May 24, 2024

tengomucho commented May 14, 2024 •

edited