Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torch distribution blocking ports and failing to connect #126008

Open
nightsSeeker opened this issue May 11, 2024 · 1 comment
Open

Torch distribution blocking ports and failing to connect #126008

nightsSeeker opened this issue May 11, 2024 · 1 comment
Labels
module: c10d Issues/PRs related to collective communications and process groups module: elastic Related to torch.distributed.elastic oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@nightsSeeker
Copy link

nightsSeeker commented May 11, 2024

馃悰 Describe the bug

i am stuck in what seems to be a PyTorch bug. i isolated it down, and the following is an example code of the hiccup:

if not torch.distributed.is_initialized():
**this line of code hangs and causes the following error.**  ----->   torch.distributed.init_process_group(backend='gloo', init_method='tcp://localhost:18355', rank = torch.cuda.device_count(), world_size = 8) 
        if not model_parallel_is_initialized():
            if model_parallel_size is None:
                model_parallel_size = int(os.environ.get("WORLD_SIZE", 8))
            initialize_model_parallel(model_parallel_size)

and when i use the command:
torchrun --nproc_per_node 8 example_chat_completion.py --ckpt_dir Meta-Llama-3-70B-Instruct/ --tokenizer_path Meta-Llama-3-70B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6

i get the error:
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:18355 (system error: 10049 - The requested address is not valid in its context.).

I have cuda 12.1 and PyTorch latest installed. I am on windows, so hence the backend change to gloo. I have tried it on my other machines with the same issue. i disconnected the internet, and still pesists. Eventually, i tried it on a friends machine that lives in the nearby and he also faced the same issue.

Versions

Collect_env:

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Enterprise
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 (tags/v3.11.4:d2340ef, Jun  7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA RTX A3000 Laptop GPU
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2496
DeviceID=CPU0
Family=198
L2CacheSize=10240
L2CacheSpeed=
Manufacturer=GenuineIntel
MaxClockSpeed=2496
Name=11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
ProcessorType=3
Revision=

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @dzhulgakov

@mikaylagawarecki mikaylagawarecki added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 14, 2024
@wconstab
Copy link
Contributor

wconstab commented Jun 5, 2024

I don't fully understand the tcp rendezvous method. CCing @kurman to add more details.

But i found 2 workarounds and I suspect you were passing the arguments incorrectly.

  1. if you DO want to use torchrun to launch your program, you can simply omit specifying the 'init_method' and the 'rank' and 'world_size' arguments.

e.g.
doesnt_hang.py

import torch
torch.distributed.init_process_group(backend='gloo')

torchrun --nproc-per-node 8 doesnt_hang.py

  1. If you want to use the 'tcp' init method, you need to specify the rank correctly. rank=num_devices() would give '8' on every process, which is wrong. you should pass a unique value per rank.

@wconstab wconstab added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: c10d Issues/PRs related to collective communications and process groups module: elastic Related to torch.distributed.elastic labels Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: c10d Issues/PRs related to collective communications and process groups module: elastic Related to torch.distributed.elastic oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants