Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel Gaudi bootc container is missing InfiniBand #483

Open
tiran opened this issue May 7, 2024 · 3 comments
Open

Intel Gaudi bootc container is missing InfiniBand #483

tiran opened this issue May 7, 2024 · 3 comments

Comments

@tiran
Copy link

tiran commented May 7, 2024

The container file https://github.com/containers/ai-lab-recipes/blob/main/training/intel-bootc/Containerfile does not contain necessary bits and pieces to setup InfiniBand Intel Gaudi devices. Without IB, it is not possible to use Intel Gaudi 2 cards for training. I'm not entirely sure if this only affects servers with multiple Intel Gaudi 2 cards or also servers with a single card. My test systems have 8 Intel Gaudi 2 cards each.

The server with bootc image did not have the habanalabs_ib module loaded. Manually loading the module doesn't make a difference. rdma dev does not show any hlib (Babana Labs InfiniBand) devices. Without the devices, PyTorch and Habana's PyTorch plugin habana_frameworks fail to initialize the devices: hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)

On the other bare metal server with regular RHEL 9 and Habana's packages, rdma dev shows a node for each card and rdma link shows over 160 active connections with LINK_UP.

# rdma dev
0: mlx5_0: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:5242 sys_image_guid b83f:d203:004a:5242 
1: mlx5_1: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:522e sys_image_guid b83f:d203:004a:522e 
18: hlib_4: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
19: hlib_6: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
20: hlib_2: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
21: hlib_7: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
22: hlib_1: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
23: hlib_3: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
24: hlib_5: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 
25: hlib_0: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000

journal shows that habanalabs Kernel module registers a habana labs infiniband device for each card:

# journalctl -o cat | grep hlib
habanalabs 0000:db:00.0 hlib_4: IB device registered
habanalabs 0000:bb:00.0 hlib_2: IB device registered
habanalabs 0000:19:00.0 hlib_6: IB device registered
habanalabs 0000:5d:00.0 hlib_7: IB device registered
habanalabs 0000:9b:00.0 hlib_1: IB device registered
habanalabs 0000:cb:00.0 hlib_3: IB device registered
habanalabs 0000:4c:00.0 hlib_5: IB device registered
habanalabs 0000:3b:00.0 hlib_0: IB device registered

According to https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html we also need the habanalabs-thunk and habanalabs-rdma-core packages in the bootc image.

reproducer

#!/usr/bin/env python3
import os

os.environ["ENABLE_CONSOLE"] = "true"
os.environ["LOG_LEVEL_ALL"] = "1"

from habana_frameworks.torch import hpu
hpu.init()
[19:18:28.773286][HCL       ][info ][tid:55][C:host[f6fa5532e33d] device[0]] hcl::IntermediateBufferContainer::IntermediateBufferContainer Allocated device memory. Address: 0x100160016d800000, Size: 168MB
[19:18:28.774092][HCL_IBV   ][info ][tid:55][C:host[f6fa5532e33d] device[0]] loaded: /opt/habanalabs/rdma-core/src/build/lib/libhlib.so
[19:18:28.774101][HCL_IBV   ][info ][tid:55][C:host[f6fa5532e33d] device[0]] loaded: /opt/habanalabs/rdma-core/src/build/lib/libibverbshl.so.1
[19:18:28.775252][HCL_IBV   ][error][tid:55][C:host[f6fa5532e33d] device[0]] hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi2/hcl_device.cpp::68(HclDeviceGaudi2): The condition [ hcclSuccess == ret ] failed. ibverb init returned 6[hcclInternalError] 
[19:18:28.775294][HCL       ][critical][tid:55][C:host[f6fa5532e33d] device[0]] HclDeviceGaudi2: The condition [ hcclSuccess == ret ] failed. ibverb init returned 6[hcclInternalError]
[19:18:28.775409][HCL       ][info ][tid:55][C:host[f6fa5532e33d] device[0]] ----------------- interface counters (for 0 interfaces) -------------
[19:18:28.775435][SYN_DEVICE    ][error][tid:55][C:host[f6fa5532e33d] device[0]] _acquire: Failed to initialize HCCL device for device
@rhatdan
Copy link
Member

rhatdan commented May 21, 2024

Care to open a PR to fix?

@tiran
Copy link
Author

tiran commented May 21, 2024

@enriquebelarte solved the issue in ec05b07

Enrique, is there anything left to do here?

@enriquebelarte
Copy link
Contributor

Could not test upstream as build checks failed because of kernel version in runners.
Need to find a solution for it but apart from that, the fix adds packages that made the bootc image work with Gaudi2 hardware, so I guess this issue can be closed by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants