You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using accelerate FSDP to align a Pythia-2.8B LLM, using DPOTrainer in the TRL library. The code works well in a multi-gpu setting (the config in the custom-configuration section here). For context, I am using 4 A100s in a single node on a compute cluster. But when i try to use FSDP, with the same configuration provided in the huggingface examples, and execute accelerate launch --config_file=fsdp.yaml --num_processes=4 dpo.py, I run into the following error RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32/64 notwithstanding
I have looked around a bit and came across this question on the pytorch discussion form. The suggested workarounds from the thread are:
use foreach=False with optimizer (you’d be missing out on performance here)
avoid torch.set_default_dtype(torch.float64)
The first involves passing an optimizer, which is not allowed when using FSDP. What could be causing this and how do I resolve this?
Also, some additional package information. I am using Python version 3.10.14, trl 0.8.6, transformers 4.40.1, torch 2.3.0, peft 0.10.0, cuda 12.1, and accelerate 0.29.3. I have also provided my script here for the sake of completeness. Additionally, I have observed this error even when using the SFTTrainer and RewardTrainer in TRL.
Hi,
I am using accelerate FSDP to align a Pythia-2.8B LLM, using DPOTrainer in the TRL library. The code works well in a multi-gpu setting (the config in the custom-configuration section here). For context, I am using 4 A100s in a single node on a compute cluster. But when i try to use FSDP, with the same configuration provided in the huggingface examples, and execute
accelerate launch --config_file=fsdp.yaml --num_processes=4 dpo.py,
I run into the following error RuntimeError: Tensors of the same index must be on the same device and the same dtype exceptstep
tensors that can be CPU and float32/64 notwithstandingI have looked around a bit and came across this question on the pytorch discussion form. The suggested workarounds from the thread are:
The first involves passing an optimizer, which is not allowed when using FSDP. What could be causing this and how do I resolve this?
Also, some additional package information. I am using Python version 3.10.14, trl 0.8.6, transformers 4.40.1, torch 2.3.0, peft 0.10.0, cuda 12.1, and accelerate 0.29.3. I have also provided my script here for the sake of completeness. Additionally, I have observed this error even when using the SFTTrainer and RewardTrainer in TRL.
`
The text was updated successfully, but these errors were encountered: