You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that when fine-tune a quantized model using trainer with Zero3, the quantized model will be loaded all to the GPU first, and then partitioning the parameters across data-parallel processes. What if there is not enough memory to load the whole quantized model?
The code that load all quantized model is in the deepspeed/runtime/engine.py about line262:
self._configure_distributed_model(model)
It was entered from transformers trainer: inner_train_loop: about line1082:
System Info
transformers
version: 4.41.0.dev0Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I found that when fine-tune a quantized model using trainer with Zero3, the quantized model will be loaded all to the GPU first, and then partitioning the parameters across data-parallel processes. What if there is not enough memory to load the whole quantized model?
The code that load all quantized model is in the deepspeed/runtime/engine.py about line262:
It was entered from transformers trainer: inner_train_loop: about line1082:
Expected behavior
How to partitioning the parameters during load from_pretrain instead of in trainer? like load float model.
The text was updated successfully, but these errors were encountered: