Loading the trained model for inference is very slow #2727

YamingZhang · 2024-04-30T14:56:04Z

deepspeed 0.13.0
accelerate 0.24.1
peft 0.7.1

I use accelerate for accelerated calculations, using a single card. In accelerate, I use deepspeed settings to use bf16 precision for lora fine-tuning and inference. However, when I save the model and reload it for inference, the speed is very slow from 7min to 40min , and the results are different from the original results, and the results cannot be reproduced. The same random seed is set in training and loading model inference.

lora:

peft_config = LoraConfig(task_type=task_type, inference_mode=False, r=args.lora_r, lora_alpha=args.lora_alpha,
                                 lora_dropout=0.1,
                                 target_modules=target_modules)
model = get_peft_model(model, peft_config)

save model:

self.accelerator.wait_for_everyone()
unwrapped_model = self.accelerator.unwrap_model(self.model)
self.accelerator.save(unwrapped_model.state_dict(), self.args.save_path+"/model.pkl")

load model:

unwrapped_model = self.accelerator.unwrap_model(self.model)
unwrapped_model.bfloat16()
state_dict = torch.load(self.args.load_path)
unwrapped_model.load_state_dict(state_dict)

deepspeed config:

{
    "bf16": {
        "enabled": true
    },
    "fp16": {
        "enabled": false,
        "hparams": "fp16",
        "loss_scale": 0,
        "loss_scale_window": 90000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "contiguous_gradients": true,
        "offload_optimizer": {
            "device": "cpu"
        },
        "offload_param": {
            "device": "none"
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 200000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-30T15:06:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading the trained model for inference is very slow #2727

Loading the trained model for inference is very slow #2727

YamingZhang commented Apr 30, 2024 •

edited

github-actions bot commented May 30, 2024

Loading the trained model for inference is very slow #2727

Loading the trained model for inference is very slow #2727

Comments

YamingZhang commented Apr 30, 2024 • edited

github-actions bot commented May 30, 2024

YamingZhang commented Apr 30, 2024 •

edited