Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vfio_mode="guest-kernel" causes StartContainer failure #9614

Open
l8huang opened this issue May 9, 2024 · 6 comments · May be fixed by #9687
Open

vfio_mode="guest-kernel" causes StartContainer failure #9614

l8huang opened this issue May 9, 2024 · 6 comments · May be fixed by #9687
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.

Comments

@l8huang
Copy link

l8huang commented May 9, 2024

In NVIDIA DPU VIFO passthrough case, the VFIO device should be claimed by mlx5_core driver in guest VM as a network interface(eth0), so vfio_mode is set to guest-kernel:

#  - guest-kernel
#    This is a Kata-specific behaviour that's useful in certain cases.
#    The VFIO device is managed by whatever driver in the VM kernel
#    claims it.  This means it will appear as one or more device nodes
#    or network interfaces depending on the nature of the device.
#    Using this mode requires specially built workloads that know how
#    to locate the relevant device interfaces within the VM.
#
vfio_mode="guest-kernel"

the VFIO device type will be vfio-pci-gk, below code will not be executed to override its driver to vfio-pci and add it to pcimap:

https://github.com/kata-containers/kata-containers/blob/main/src/agent/src/device.rs#L856-L872

        if vfio_in_guest {
            pci_driver_override(SYSFS_BUS_PCI_PATH, guestdev, "vfio-pci")?;

            // Devices must have an IOMMU group to be usable via VFIO
            let devgroup = pci_iommu_group(SYSFS_BUS_PCI_PATH, guestdev)?
                .ok_or_else(|| anyhow!("{} has no IOMMU group", guestdev))?;

            if let Some(g) = group {
                if g != devgroup {
                    return Err(anyhow!("{} is not in guest IOMMU group {}", guestdev, g));
                }
            }

            group = Some(devgroup);

            pci_fixups.push((host, guestdev));
        }

Later when update_env_pci() is called, the sandbox.pcimap doesn't have the device mapping for the vfio-pci-gk device, which causes error:

level=error 
msg="StartContainer for "896c10aad9ead12c8b758563f2bbee36a1691db501a3ca323207b01b1fbeb0d7" failed" 
error="failed to create containerd task: failed to create shim task: Unable to translate host PCI address 0000:84:02.0: unknown"

Should update_env_pci() ignore devices which have type vfio-pci-gk?

@l8huang l8huang added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels May 9, 2024
@Apokleos
Copy link
Contributor

Apokleos commented May 9, 2024

Hi @l8huang What's the relationship between this issue and PR #9605 ?

@l8huang
Copy link
Author

l8huang commented May 10, 2024

@Apokleos they are separated issues. I found them when I was trying to passthrough a Nvidia DPU VFIO to guest VM, PR #9605 happened firstly, then this issue came up.

@zvonkok
Copy link
Contributor

zvonkok commented May 10, 2024

Who owns the VFIO device, the DPU or the Host?
Can you show us the IOMMU group in which the device is located and which other devices are may in there?
I suppose you're passing through a Mellanox NIC, are you passing through the PF or have you created VFs?

@l8huang
Copy link
Author

l8huang commented May 10, 2024

VFIO device is owned by the Host.
We passthrough VFs. The VFIO passthrough is same for Nvidia BlueField and ConnectX, no need to worry about them here.

In the Host, VFIO devices are:

# lspci | grep -i eth
...
84:01.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
84:02.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
84:02.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)
84:02.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function (rev 01)

corresponding IOMMU groups for these VFIO devices are:

# find /sys/kernel/iommu_groups/ -type l | grep 84 | sort
...
/sys/kernel/iommu_groups/128/devices/0000:84:01.7
/sys/kernel/iommu_groups/129/devices/0000:84:02.0
/sys/kernel/iommu_groups/130/devices/0000:84:02.1
/sys/kernel/iommu_groups/131/devices/0000:84:02.2

I worked around bunch of issues, then the VFIO device can be plugged into guest VM, it can be claimed by NIC driver(mlx5_core):

$ dmesg
[    0.750793] pci 0000:02:00.0: [15b3:101e] type 00 class 0x020000
[    0.751438] pci 0000:02:00.0: reg 0x10: [mem 0x00000000-0x001fffff 64bit pref]
[    0.751803] pci 0000:02:00.0: enabling Extended Tags
[    0.752748] pci 0000:02:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:05.0 (capable of 126.024 Gb/s with 16.0 GT/s PCIe x8 link)
[    0.753315] pci 0000:02:00.0: Adding to iommu group 10
[    1.232441] pci 0000:02:00.0: BAR 0: assigned [mem 0xfe600000-0xfe7fffff 64bit pref]
[    1.232584] pcieport 0000:00:05.0: PCI bridge to [bus 02]
[    1.232594] pcieport 0000:00:05.0:   bridge window [io  0x1000-0x1fff]
[    1.233811] pcieport 0000:00:05.0:   bridge window [mem 0xfdc00000-0xfdffffff]
[    1.234660] pcieport 0000:00:05.0:   bridge window [mem 0xfe600000-0xfe7fffff 64bit pref]
[    1.238315] mlx5_core 0000:02:00.0: enabling device (0000 -> 0002)
[    1.239530] mlx5_core 0000:02:00.0: firmware version: 24.35.3502
[    1.405679] mlx5_core 0000:02:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
[    1.418013] mlx5_core 0000:02:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)

eth0 interface was created, after manually configure IP address on it, network works.

In the guest VM, the VFIO device is like:

root@localhost:/# lspci -tv
-[0000:00]-+-00.0  Device 8086:29c0
           +-01.0  Device 1af4:1003
           +-02.0-[01]--
           +-03.0  Device 1af4:1004
           +-04.0  Device 1af4:1005
           +-05.0-[02]----00.0  Device 15b3:101e  # VFIO
           +-06.0-[03]--

root@localhost:/# lspci  -nnk -s 0000:02:00.0
02:00.0 Class [0200]: Device [15b3:101e] (rev 01)
	Subsystem: Device [15b3:0063]
	Kernel driver in use: mlx5_core

For this issue, because VFIO device in guest VM is claimed by expected driver, and according to existing code, looks like update_env_pci() should skip vfio-pci-gk devices. Please kindly correct me if I'm misunderstanding.

@lifupan
Copy link
Member

lifupan commented May 11, 2024

Hi @l8huang

Thanks for this report. I think we can fix this issue by dealing with "pci_fixups.push((host, guestdev));" even for vfio-pci-gk driver device, thus the following update_env_pci can do successfully.

Could you submit a path to address this issue? Thanks.

@l8huang
Copy link
Author

l8huang commented May 15, 2024

Will do.

l8huang pushed a commit to l8huang/kata-containers that referenced this issue May 24, 2024
…device

The `update_env_pci()` function need the PCI address mapping to
translate the host PCI address to guest PCI address in below
environment variables:
- PCIDEVICE_<prefix>_<resource-name>_INFO
- PCIDEVICE_<prefix>_<resource-name>

So collect PCI address mapping for both vfio-pci-gk and
vfio-pci devices.

Fixes kata-containers#9614

Signed-off-by: Lei Huang <leih@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behaviour needs-review Needs to be assessed by the team.
Projects
Issue backlog
  
To do
Development

Successfully merging a pull request may close this issue.

4 participants