Fail to install driver on openshift 4.12 #720

jf1cloutier · 2023-12-22T13:42:40Z

What happened:
Followed instruction on this page to install operator on openshift:
https://docs.nvidia.com/networking/display/kubernetes2310/network+operator

(Used the nvidia-network-operator fomr certified community operator with NFD operator)

I then created a NicClusterPolicy with following spec to have ofed driver installed on worker.

spec:
  ofedDriver:
    image: mofed
    livenessProbe:
      initialDelaySeconds: 3000
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 1000
      periodSeconds: 30
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 1000
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ''
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      waitForCompletion:
        timeoutSeconds: 0
    version: 23.07-0.5.0.0

What you expected to happen:
I'm expecting a new pod to start on the worker where I have a Mellanox card and to compile driver for kernel used on CoreOS. I see this pod is starting of the appropriate worker but the driver is failing to compile with those logs from the pod:

Unsetting driver ready state
No OFED driver found for kernel 4.18.0-372.41.1.rt7.198.el8_6.x86_64
Enabling RHOCP and EUS RPM repos...
ID="rhcos"
VERSION_ID="4.12"
RHEL_VERSION="8.6"
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda                                            7.8 kB/s | 3.5 kB     00:00
cuda                                            2.6 MB/s | 2.9 MB     00:01
Red Hat OpenShift Container Platform 4.12 for R 9.5 MB/s |  31 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre  12 MB/s |  47 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - BaseOS   14 MB/s |  53 MB     00:03
Red Hat Universal Base Image 8 (RPMs) - BaseOS  6.3 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS  466 kB/s | 717 kB     00:01
Red Hat Universal Base Image 8 (RPMs) - AppStre 9.1 kB/s | 4.2 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 1.6 MB/s | 3.0 MB     00:01
Red Hat Universal Base Image 8 (RPMs) - CodeRea 8.3 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea  74 kB/s | 102 kB     00:01
Metadata cache created.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
Subscription Manager is operating in container mode.
cuda                                             11 kB/s | 3.5 kB     00:00
Red Hat OpenShift Container Platform 4.12 for R 6.4 kB/s | 4.1 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS   19 MB/s |  64 MB     00:03
Red Hat Enterprise Linux 8 for x86_64 - AppStre 7.7 kB/s | 4.5 kB     00:00
Red Hat Enterprise Linux 8 for x86_64 - BaseOS  8.1 kB/s | 4.1 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - BaseOS  7.0 kB/s | 3.8 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - AppStre 6.7 kB/s | 4.2 kB     00:00
Red Hat Universal Base Image 8 (RPMs) - CodeRea 4.9 kB/s | 3.8 kB     00:00
Metadata cache created.
Installing dependencies
Error: Unable to find a match: kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64

Command "dnf -q -y --releasever=8.6 install kernel-4.18.0-372.41.1.rt7.198.el8_6.x86_64" failed with exit code: 1
Terminate event caught
Terminating container
Unsetting driver ready state
Deleting udev rules
rm: cannot remove '/host/etc/udev/rules.d/82-net-setup-link.rules': No such file or directory
rm: cannot remove '/host/etc/udev/mlnx_bf_udev': No such file or directory
rm: cannot remove '/host/etc/infiniband/vf-net-link-name.sh': No such file or directory
Unmounting Mellanox OFED driver rootfs

How to reproduce it (as minimally and precisely as possible):
Try to install driver on openshift 4.12.2. I also tried updating to openshift 4.12.45 with same issue (with different Kernel being used).

Anything else we need to know?:
I tried manual dnf search for the kernel package it tries to install and it fails. It seems DNF has kernel patches package available for few kernel but not the basic kernel module for this version:
kpatch-patch-4_18_0-372_13_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.13.1.el8_6.x86_64
kpatch-patch-4_18_0-372_16_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.16.1.el8_6.x86_64
kpatch-patch-4_18_0-372_19_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.19.1.el8_6.x86_64
kpatch-patch-4_18_0-372_26_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.26.1.el8_6.x86_64
kpatch-patch-4_18_0-372_32_1.x86_64 : Initial empty kpatch-patch for kernel-4.18.0-372.32.1.el8_6.x86_64
kpatch-patch-4_18_0-372_9_1.x86_64 : Live kernel patching module for kernel-4.18.0-372.9.1.el8.x86_64

I tried using different entitlement and I always get same result. Last attempt as using a "Red Hat Developer Subscription for Individuals" entitlement on the cluster

Logs:

NicClusterPolicy CR spec and state:

  k get NicClusterPolicy nic-cluster-policy -o yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  creationTimestamp: "2023-12-21T19:33:13Z"
  generation: 1
  name: nic-cluster-policy
  resourceVersion: "280143"
  uid: 7d28443a-0a48-486f-92ef-7509c3678842
spec:
  ofedDriver:
    image: mofed
    livenessProbe:
      initialDelaySeconds: 3000
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 1000
      periodSeconds: 30
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 1000
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: true
        enable: true
        force: true
        podSelector: ""
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      waitForCompletion:
        timeoutSeconds: 0
    version: 23.07-0.5.0.0
status:
  appliedStates:
  - name: state-pod-security-policy
    state: ignore
  - name: state-multus-cni
    state: ignore
  - name: state-container-networking-plugins
    state: ignore
  - name: state-ipoib-cni
    state: ignore
  - name: state-whereabouts-cni
    state: ignore
  - name: state-OFED
    state: notReady
  - name: state-SRIOV-device-plugin
    state: ignore
  - name: state-RDMA-device-plugin
    state: ignore
  - name: state-ib-kubernetes
    state: ignore
  - name: state-nv-ipam-cni
    state: ignore
  state: notReady

Output of: kubectl -n nvidia-network-operator get -A:

kubectl -n nvidia-network-operator get -A
You must specify the type of resource to get. Use "kubectl api-resources" for a complete list of supported resources.

error: Required resource not specified.
Use "kubectl explain <resource>" for a detailed description of that resource (e.g. kubectl explain pods).
See 'kubectl get -h' for help and examples

Network Operator version: 23.7.0
Logs of Network Operator controller:
Logs of the various Pods in nvidia-network-operator namespace: Provided above
Helm Configuration (if applicable): N/A
Kubernetes' nodes information (labels, annotations and status): kubectl get node -o yaml: I think it is not pertinent for this issue

Environment:

Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4+a34b9e9", GitCommit:"b6d1f054747e9886f61dd85316deac3415e2726f", GitTreeState:"clean", BuildDate:"2023-01-10T15:55:28Z", GoVersion:"go1.19.4", Compiler:"gc", Platform:"linux/amd64"}
Hardware configuration:
4b:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
4b:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
firmware-version: 22.32.2004 (MT_0000000437)
OS (e.g: cat /etc/os-release):
cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202301311551-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202301311551-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202301311551-0"
Kernel (e.g. uname -a): Linux ocp1-worker3 4.18.0-372.41.1.rt7.198.el8_6.x86_64 Fix deployment image #1 SMP PREEMPT_RT Fri Jan 6 15:08:19 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

jf1cloutier added the bug Something isn't working label Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to install driver on openshift 4.12 #720

Fail to install driver on openshift 4.12 #720

jf1cloutier commented Dec 22, 2023

Fail to install driver on openshift 4.12 #720

Fail to install driver on openshift 4.12 #720

Comments

jf1cloutier commented Dec 22, 2023