Torch distribution blocking ports and failing to connect #126008
Labels
module: c10d
Issues/PRs related to collective communications and process groups
module: elastic
Related to torch.distributed.elastic
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Describe the bug
i am stuck in what seems to be a PyTorch bug. i isolated it down, and the following is an example code of the hiccup:
and when i use the command:
torchrun --nproc_per_node 8 example_chat_completion.py --ckpt_dir Meta-Llama-3-70B-Instruct/ --tokenizer_path Meta-Llama-3-70B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6
i get the error:
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:18355 (system error: 10049 - The requested address is not valid in its context.).
I have cuda 12.1 and PyTorch latest installed. I am on windows, so hence the backend change to gloo. I have tried it on my other machines with the same issue. i disconnected the internet, and still pesists. Eventually, i tried it on a friends machine that lives in the nearby and he also faced the same issue.
Versions
Collect_env:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @dzhulgakov
The text was updated successfully, but these errors were encountered: