Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate FSDP RuntimeError: Tensors of the same index must be on the same device and the same dtype #2764

Open
yaswanthchittepu opened this issue May 10, 2024 · 0 comments

Comments

@yaswanthchittepu
Copy link

yaswanthchittepu commented May 10, 2024

Hi,

I am using accelerate FSDP to align a Pythia-2.8B LLM, using DPOTrainer in the TRL library. The code works well in a multi-gpu setting (the config in the custom-configuration section here). For context, I am using 4 A100s in a single node on a compute cluster. But when i try to use FSDP, with the same configuration provided in the huggingface examples, and execute accelerate launch --config_file=fsdp.yaml --num_processes=4 dpo.py, I run into the following error RuntimeError: Tensors of the same index must be on the same device and the same dtype except step tensors that can be CPU and float32/64 notwithstanding

I have looked around a bit and came across this question on the pytorch discussion form. The suggested workarounds from the thread are:

  • use foreach=False with optimizer (you’d be missing out on performance here)
  • avoid torch.set_default_dtype(torch.float64)

The first involves passing an optimizer, which is not allowed when using FSDP. What could be causing this and how do I resolve this?

Also, some additional package information. I am using Python version 3.10.14, trl 0.8.6, transformers 4.40.1, torch 2.3.0, peft 0.10.0, cuda 12.1, and accelerate 0.29.3. I have also provided my script here for the sake of completeness. Additionally, I have observed this error even when using the SFTTrainer and RewardTrainer in TRL.

`

train_data = get_tldr(datapath, 'train')

test_data = get_tldr(datapath, 'test')

train_data, test_data = map(to_hf_format, [train_data, test_data])

######################### ARGS ###########################
model_name = "EleutherAI/pythia-2.8b"
sft_ckpt_name = "EleutherAI/pythia-2.8b"
batch_size = 16
gradient_accumulation_steps = 16
gradient_checkpoint = True
##########################################################

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=36000))

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(sft_ckpt_name, \
            device_map={"": Accelerator().local_process_index},
        )

if(gradient_checkpoint):
    # Needed when we use gradient checkpointing
    # Otherwise complains that no gradients to compute 
    if hasattr(model, "enable_input_require_grads"):
        model.enable_input_require_grads()
    else:
        def make_inputs_require_grad(module, input, output):
            output.requires_grad_(True)
        model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config, adapter_name='__train__')

training_args = TrainingArguments(
    output_dir=<out_dir>,
    report_to='none',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=8,
    bf16=True, # Need A100s
    learning_rate=1e-4,
    lr_scheduler_type="cosine",
    warmup_ratio = 0.1,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=gradient_checkpoint,
    gradient_checkpointing_kwargs={"use_reentrant":False},
    evaluation_strategy="steps",
    eval_steps=10,
    num_train_epochs=4,
    # logging strategies 
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="steps",
    save_steps = 20,
    save_total_limit = 10,
    load_best_model_at_end = True,
    max_grad_norm=1.,
    remove_unused_columns=False,
)

# Initialize the trainer, without a ref_model param.
dpo_trainer = DPOTrainer(
    model,
    ref_model=None,
    beta=0.1,
    train_dataset=train_data,
    eval_dataset=test_data,
    tokenizer=tokenizer,
    max_length=512,
    max_prompt_length=256,
    args=training_args,
)

dpo_trainer.train()` 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant