all_gather with gloo backend does not work in inference mode #126032

youkaichao · 2024-05-12T16:55:57Z

🐛 Describe the bug

A minimal reproducible example:

import torch
import torch.distributed as dist
dist.init_process_group(backend='gloo')
# dist.init_process_group(backend='nccl')
# torch.cuda.set_device(dist.get_rank())
with torch.inference_mode():
    data = [torch.ones((3, 3))] * dist.get_world_size()
    obj = data[dist.get_rank()]
    dist.all_gather(data, obj)
    # dist.broadcast(obj, src=0)

The error is:

E RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details.

It looks strange, that nccl backend works in this case. broadcast works, too. Only all_gather does not work.

Versions

pytorch 2.3.0

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

The text was updated successfully, but these errors were encountered:

wconstab · 2024-06-05T22:26:41Z

I can reproduce this issue.

It may be that gloo implements all_gather in another cpu thread and the thread-local 'inference-mode' is not attached to that thread.

cc @albanD, do you know if it is the case and if we can easily attach the inference mode context onto the other thread?

note: i also tried adding .wait() on the allgather op, to ensure the operation completes before the inference mode context exits; it did not help.

mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 14, 2024

yf225 added the module: c10d Issues/PRs related to collective communications and process groups label May 20, 2024

wconstab added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all_gather with gloo backend does not work in inference mode #126032

all_gather with gloo backend does not work in inference mode #126032

youkaichao commented May 12, 2024 •

edited by pytorch-bot bot

wconstab commented Jun 5, 2024

all_gather with gloo backend does not work in inference mode #126032

all_gather with gloo backend does not work in inference mode #126032

Comments

youkaichao commented May 12, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

wconstab commented Jun 5, 2024

youkaichao commented May 12, 2024 •

edited by pytorch-bot bot