Torch distribution blocking ports and failing to connect #126008

nightsSeeker · 2024-05-11T19:40:32Z

🐛 Describe the bug

i am stuck in what seems to be a PyTorch bug. i isolated it down, and the following is an example code of the hiccup:

if not torch.distributed.is_initialized():
**this line of code hangs and causes the following error.**  ----->   torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:18355', rank = torch.cuda.device_count(), world_size = 8) 
        if not model_parallel_is_initialized():
            if model_parallel_size is None:
                model_parallel_size = int(os.environ.get("WORLD_SIZE", 8))
            initialize_model_parallel(model_parallel_size)

and when i use the command:
torchrun --nproc_per_node 8 example_chat_completion.py --ckpt_dir Meta-Llama-3-70B-Instruct/ --tokenizer_path Meta-Llama-3-70B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6

i get the error:
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:18355 (system error: 10049 - The requested address is not valid in its context.).

I have cuda 12.1 and PyTorch latest installed. I am on windows, so hence the backend change to gloo. I have tried it on my other machines with the same issue. i disconnected the internet, and still pesists. Eventually, i tried it on a friends machine that lives in the nearby and he also faced the same issue.

Versions

Collect_env:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA RTX A3000 Laptop GPU
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2496
DeviceID=CPU0
Family=198
L2CacheSize=10240
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2496
Name=11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @dzhulgakov

The text was updated successfully, but these errors were encountered:

wconstab · 2024-06-05T22:49:52Z

I don't fully understand the tcp rendezvous method. CCing @kurman to add more details.

But i found 2 workarounds and I suspect you were passing the arguments incorrectly.

if you DO want to use torchrun to launch your program, you can simply omit specifying the 'init_method' and the 'rank' and 'world_size' arguments.

e.g.
doesnt_hang.py

import torch
torch.distributed.init_process_group(backend='gloo')

torchrun --nproc-per-node 8 doesnt_hang.py

If you want to use the 'tcp' init method, you need to specify the rank correctly. rank=num_devices() would give '8' on every process, which is wrong. you should pass a unique value per rank.

mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 14, 2024

wconstab added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: c10d Issues/PRs related to collective communications and process groups module: elastic Related to torch.distributed.elastic labels Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torch distribution blocking ports and failing to connect #126008

Torch distribution blocking ports and failing to connect #126008

nightsSeeker commented May 11, 2024 •

edited by pytorch-bot bot

wconstab commented Jun 5, 2024

Torch distribution blocking ports and failing to connect #126008

Torch distribution blocking ports and failing to connect #126008

Comments

nightsSeeker commented May 11, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

wconstab commented Jun 5, 2024

nightsSeeker commented May 11, 2024 •

edited by pytorch-bot bot