Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Sandbox Hotplug API #9619

Open
l8huang opened this issue May 10, 2024 · 12 comments
Open

How to use Sandbox Hotplug API #9619

l8huang opened this issue May 10, 2024 · 12 comments
Labels
question Requires an answer

Comments

@l8huang
Copy link

l8huang commented May 10, 2024

We have a network VFIO device hot-plugged into guest VM and it can be claimed by network driver, after manually configured IP address on it, network works.

Now I'm looking for a way to configure the interface through API. I found Sandbox Hotplug API
and Sandbox Connection Plugin Workflow could be used to do that, but the FetchSandbox() function is not available anymore, it's deleted by PR #518. Looks like the doc is out-dated.

Please kindly let me know how to use the Sandbox Hotplug API.

@l8huang l8huang added the question Requires an answer label May 10, 2024
@lifupan
Copy link
Member

lifupan commented May 14, 2024

First of all, is your network device for the container or the sandbox? If it is for sandbox, how is the device passed to kata? If it is a network device, the device should theoretically be set to the netns of the pod through CNI, including the device and network address. If it is for the container, it should be passed through the device of oci spec? If it is passed through the device, who is responsible for setting your network address?

@l8huang
Copy link
Author

l8huang commented May 15, 2024

The VFIO network device is for the k8s Pod, the ovn-k8s CNI supposes to configure the network interfaces through Sandbox Hotplug API.

Isn't Sandbox Hotplug API for such kind of use case?

@lifupan
Copy link
Member

lifupan commented May 15, 2024

The VFIO network device is for the k8s Pod, the ovn-k8s CNI supposes to configure the network interfaces through Sandbox Hotplug API.

Isn't Sandbox Hotplug API for such kind of use case?

Hi @l8huang

Yes, sandbox hotplug api was used for device hotplug. But for network interface, kata hasn't support netwrok device hotplug. Compared to other devices such as block/gpu etc, kata could get the device type thus it would specify the proper operation in guest. But for network devices, kata not only needs to obtain the type of device, but also needs to know the network address and other information of the device. Therefore, currently kata only goes to pod netns to obtain all network endpoints when starting a pod. According to the obtained The network endpoint type determines how to add network devices to kata.

Regarding your question, what I want to confirm is how the VFIO network device here is passed to kata runtime? Only by knowing how your network device information is transmitted can we decide how to hot-plug your network device.

@lifupan
Copy link
Member

lifupan commented May 15, 2024

@l8huang BTW, do you mean the network device info passed to kata runtime using ENV just as #9605 said?

@l8huang
Copy link
Author

l8huang commented May 16, 2024

Thanks for your reply.

how the VFIO network device here is passed to kata runtime?

Long story short, the VFIO network device info is in OCI spec config.json when Sandbox.CreateContainer() is called, e.g.:

  "linux": {
    "devices": [
      {
        "path": "/dev/vfio/vfio",
        "type": "c",
        "major": 10,
        "minor": 196,
        "fileMode": 438,
        "uid": 0,
        "gid": 0
      },
      {
        "path": "/dev/vfio/130",
        "type": "c",
        "major": 238,
        "minor": 3,
        "fileMode": 384,
        "uid": 0,
        "gid": 0
      }
    ],

Then GetAllVFIODevicesFromIOMMUGroup() is called to get VFIODev when attaching the device. The env PCIDEVICE_<prefix>_<resource-name>_INFO in #9605 is not relevant here.

But for network devices, kata not only needs to obtain the type of device, but also needs to know the network address and other information of the device.

The network address and route config for the VFIO interface are in Pod's annotations:

    k8s.ovn.org/pod-networks: '{
      "default":{
        "ip_addresses":["10.1.2.62/26"],
        "mac_address":"0a:58:0a:c0:02:3e",
        "gateway_ips":["10.1.2.1"],
        "routes":[
          {"dest":"10.1.0.0/16","nextHop":"10.1.2.1"},
          {"dest":"10.2.0.0/16","nextHop":"10.1.2.1"},
          {"dest":"10.3.0.0/16","nextHop":"10.1.2.1"}],
        "mtu":"1500",
        "ip_address":"10.1.2.62/26",
        "gateway_ip":"10.1.2.1"}}'

    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "default/ovn-primary-vfio",
          "interface": "eth0",
          "ips": [
              "10.1.2.62"
          ],
          "mac": "0a:0a:0a:c0:02:3e",
          "default": true,
          "dns": {},
          "device-info": {
              "type": "pci",
              "version": "1.0.0",
              "pci": {
                  "pci-address": "0000:84:02.1"
              }
          }
      }]

If Sandbox Hotplug API is no longer exposed to external controllers, then kata runtime need to interpret above annotations and invokes corresponding APIs to config network in guest VM. In this case, the problem becomes determining which annotations the Kata runtime should look at to gather network configurations(k8s.ovn.org/pod-networks + k8s.ovn.org/pod-networks in case of using ovn-kubernetes.

Another way is expose Sandbox Hotplug API, so CNI can call them to config the network in guest VM.

Please kindly let me know how to proceed, thanks.

@lifupan
Copy link
Member

lifupan commented May 17, 2024

Hi @l8huang

I roughly understand your needs.

The CNI currently supported by Kata creates the network during the create sandbox stage, and sets the network to pod netns. Then, kata scans network devices from netns when creating pause container, and cold-plugs the scanned network devices into the hypervisor.

What I need to confirm is that the VFIO network device and corresponding network address and other information generated by your CNI are all generated when the business container is created? Isn't it generated when creating sandbox?

@l8huang
Copy link
Author

l8huang commented May 17, 2024

Indeed, the allocation of the VFIO network device(done by kubelet and sriov network device plugin) and the corresponding network address(done by CNI) occurs before containerd creates the sandbox. But the VFIO device info presents in the first container's OCI spec when it being created.

We have a mutation webhook to check VFIO device resource requirement based on a Pod's network config, for example, below annotation set Pod's primary network interface to a VFIO device:

    v1.multus-cni.io/default-network: default/ovn-primary-vfio

The mutation webhook parses the annotation, and patches 1st container's resources as below:

    resources:
      limits:
        nvidia.com/bf2_vfio: "1" 
      requests:
        nvidia.com/bf2_vfio: "1"

At the Pod level, there are no resource settings, so the resources setting for the first container is patched, even though the VFIO network device is intended for the Pod's network interface.

@lifupan
Copy link
Member

lifupan commented May 20, 2024

Indeed, the allocation of the VFIO network device(done by kubelet and sriov network device plugin) and the corresponding network address(done by CNI) occurs before containerd creates the sandbox. But the VFIO device info presents in the first container's OCI spec when it being created.

Sounds like you need a network model as #7383 does.
In this directly attachable network(DAN) feature, we let kata supporting attach network device directly instead of scanning them from pod netns. Thus you can create your own special CNI to put your VFIO device info including network info into a dan configure file, which will be parsed by kata runtime and plug the network device into hypervisor.

BTW, this feature was only supported in kata runtime-rs. If you think you can use runtime-rs, then we can also support the vfio network on it to meet your needs.

We have a mutation webhook to check VFIO device resource requirement based on a Pod's network config, for example, below annotation set Pod's primary network interface to a VFIO device:

    v1.multus-cni.io/default-network: default/ovn-primary-vfio

The mutation webhook parses the annotation, and patches 1st container's resources as below:

    resources:
      limits:
        nvidia.com/bf2_vfio: "1" 
      requests:
        nvidia.com/bf2_vfio: "1"

At the Pod level, there are no resource settings, so the resources setting for the first container is patched, even though the VFIO network device is intended for the Pod's network interface.

@l8huang
Copy link
Author

l8huang commented May 20, 2024

what's the long term plan for kata runtime-rs? Will it eventually replace the Go version of Kata Runtime? Or will the feature parity be added to the Go version of Kata Runtime?

runtime-rs is still under heavy development, you should avoid using it in critical system.

We can't make this decision based solely on the networking perspective.

@zvonkok which runtime are you using?

@lifupan
Copy link
Member

lifupan commented May 21, 2024

Hi @l8huang

Yeah, we definitely will replace the Go runtime with rust version and we're working on it. But it's true the rust version is still under heavy development, especially against supporting qemu hypervisor.

@zvonkok
Copy link
Contributor

zvonkok commented May 21, 2024

@l8huang go-runtime since we have pushed all significant fixes for the GPU there and the "full" support for QEMU.

We have a KEP to fix the situation of VFIO devices being available at sandbox creation time see: kubernetes/enhancements#4113

See also: https://docs.google.com/presentation/d/13TDKyASpMfDrVBSRj4JiU6gFeChx0ws4DTenBN1qUnA/edit#slide=id.p. It was presented in the sig-node meeting for k8s. there is a recording.

I've added cold-plug to Kata some time ago but this will not work in k8s properly until we have the KEP and containerd patches, this is independent of go-runtime and runtime-rs.

@l8huang
Copy link
Author

l8huang commented May 21, 2024

@lifupan @zvonkok thank you for your quick reply. The information you shared is very helpful for understanding the situation.

go-runtime since we have pushed all significant fixes for the GPU there and the "full" support for QEMU.

Looks like I need to continue using the Go Kata runtime for now -- @zvonkok and I works in the same company but different teams.

@lifupan do you have plan to add directly attachable network(DAN) feature to Go kata-runtime? I need to evaluate the effort required to enable VFIO device support with DAN in the Go Kata runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Requires an answer
Projects
Issue backlog
  
To do
Development

No branches or pull requests

3 participants