-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on single GPU is much better than on Multi-GPUs #2754
Comments
Have you also tried scaling the learning rate according to the multiple GPUs? (What I mean by this is in multi-GPU the scheduler is stepped N times, which could account for some of this) |
Hi @muellerzr, thanks for the response! For the experiments above, I have already disabled the learning rate scheduler. In addition, I have tried adjust the learning rate according to FYI, here is the result after using
|
Thanks, let me try running this today and see what happens |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
According to performance comparison guideline, if we use
batch_size_multi_gpu
in multiple GPUs scenario, then we can get similar performance if we usebatch_size_single_gpu = batch_size_multi_gpu * num_GPUs
in 1 GPU scenario.But when I testing the official example code, setting
batch_size=4
with single GPU training can give much better performance than settingbatch_size=1
with 4 GPUs training. I disabled the learning rate scheduler in case the learning rate steps in different manner for distributed training.What I changed in the official example script:
batch_size=1
for training with 4 GPUs and usebatch_size=4
for training with single GPU.FYI, below are the config and training output.
Single GPU:
4 GPUs:
Here is the script I modified:
Expected behavior
When setting batch_size according to
batch_size_single_gpu = batch_size_multi_gpu * num_GPUs
, training with single GPU should give similar performance as training with multi GPUs.The text was updated successfully, but these errors were encountered: