Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOFED pods: driver install fails on RHEL8.8 #888

Open
gseidlerhpe opened this issue Apr 10, 2024 · 2 comments
Open

MOFED pods: driver install fails on RHEL8.8 #888

gseidlerhpe opened this issue Apr 10, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@gseidlerhpe
Copy link

What happened:
Deploy network operator on RHEL 8.8 hosts with option ofedDriver: deploy: true

ofed driver pods fail dues to error:

Error: Unable to find a match: kernel-4.18.0-477.10.1.el8_8.x86_64

Command "dnf -q -y --releasever=8.8 install kernel-4.18.0-477.10.1.el8_8.x86_64" failed with exit code: 1

What you expected to happen:
ofed driver install succeeds on RHEL 8.8.
Release notes for network operator v23.10.0 state that RHEL 8.8 is supported: https://docs.nvidia.com/networking/display/kubernetes2310/release+notes

How to reproduce it (as minimally and precisely as possible):
Deploy network operator on RHEL 8.8 hosts with valid RHEl subscription.

Anything else we need to know?:
Tried the option to specify private repo: ofedDriver.repoConfig.name
network-operator pod log shows error:

2024-04-10T16:39:52Z ERROR Error while syncing state {"controller": "nicclusterpolicy", "controllerGroup": "mellanox.com", "controllerKind": "NicClusterPolicy", "NicClusterPolicy": {"name":"nic-cluster-policy"}, "namespace": "", "name": "nic-cluster-policy", "reconcileID": "d09bbc74-ce62-4fe4-9ccc-99838b245ed3", "error": "failed to create k8s objects from manifest: failed to get destination directory for custom repo config: distribution not supported", "errorVerbose": "failed to get destination directory for custom repo config: distribution not supported\nfailed to create k8s objects from manifest\ngithub.com/Mellanox/network-operator/pkg/state.(*stateOFED).Sync\n\t/workspace/pkg/state/state_ofed.go:270\ngithub.com/Mellanox/network-operator/pkg/state.(*stateManager).SyncState\n\t/workspace/pkg/state/manager.go:92\ngithub.com/Mellanox/network-operator/controllers.(*NicClusterPolicyReconciler).Reconcile\n\t/workspace/controllers/nicclusterpolicy_controller.go:144\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
github.com/Mellanox/network-operator/pkg/state.(*stateManager).SyncState
/workspace/pkg/state/manager.go:101
github.com/Mellanox/network-operator/controllers.(*NicClusterPolicyReconciler).Reconcile
/workspace/controllers/nicclusterpolicy_controller.go:144
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235

Got the ofed driver to install successfully by patching the mofed-rhel8.8-ds daemonset and adding these volumeMonts/volumes entries:

volumeMounts
- mountPath: /run/secrets/etc-pki-entitlement
name: subscription-config-0
readOnly: true
- mountPath: /run/secrets/redhat.repo
name: subscription-config-1
readOnly: true
- mountPath: /run/secrets/rhsm
name: subscription-config-2
readOnly: true

volumes
- hostPath:
path: /etc/pki/entitlement
type: Directory
name: subscription-config-0
- hostPath:
path: /etc/yum.repos.d/redhat.repo
type: File
name: subscription-config-1
- hostPath:
path: /etc/rhsm
type: Directory
name: subscription-config-2

Logs:

  • NicClusterPolicy CR spec and state:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  annotations:
    meta.helm.sh/release-name: network-operator
    meta.helm.sh/release-namespace: nvidia-network-operator
  creationTimestamp: "2024-04-10T16:33:52Z"
  generation: 2
  labels:
    app.kubernetes.io/managed-by: Helm
  name: nic-cluster-policy
  resourceVersion: "6751693"
  uid: 2896d0ef-de46-44bd-8819-2a6040dc6ff6
spec:
  nicFeatureDiscovery:
    image: nic-feature-discovery
    imagePullSecrets: []
    repository: ghcr.io/mellanox
    version: v0.0.1
  nvIpam:
    enableWebhook: false
    image: nvidia-k8s-ipam
    imagePullSecrets: []
    repository: ghcr.io/mellanox
    version: v0.1.1
  ofedDriver:
    env:
    - name: HTTPS_PROXY
      value: http://proxy-de.its.hpecorp.net:443
    - name: HTTP_PROXY
      value: http://proxy-de.its.hpecorp.net:443
    - name: https_proxy
      value: http://proxy-de.its.hpecorp.net:443
    - name: http_proxy
      value: http://proxy-de.its.hpecorp.net:443
    image: mofed
    imagePullSecrets: []
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
    repoConfig:
      name: repo-config
    repository: nvcr.io/nvidia/mellanox
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    terminationGracePeriodSeconds: 300
    upgradePolicy:
      autoUpgrade: false
      drain:
        deleteEmptyDir: false
        enable: true
        force: false
        podSelector: ""
        timeoutSeconds: 300
      maxParallelUpgrades: 1
    version: 23.10-0.5.5.0
  psp:
    enabled: false
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_a",
            "rdmaHcaMax": 63,
            "selectors": {
              "vendors": [],
              "deviceIDs": [],
              "drivers": [],
              "ifNames": ["ens2f0","ens5f0"],
              "linkTypes": []
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    imagePullSecrets: []
    repository: ghcr.io/mellanox
    version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775
  secondaryNetwork:
    cniPlugins:
      image: plugins
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.2.0-amd64
    multus:
      image: multus-cni
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.9.3
status:
  appliedStates:
  - name: state-pod-security-policy
    state: ignore
  - name: state-multus-cni
    state: ready
  - name: state-container-networking-plugins
    state: ready
  - name: state-ipoib-cni
    state: ignore
  - name: state-whereabouts-cni
    state: ignore
  - name: state-OFED
    state: notReady
  - name: state-SRIOV-device-plugin
    state: ignore
  - name: state-RDMA-device-plugin
    state: ready
  - name: state-ib-kubernetes
    state: ignore
  - name: state-nv-ipam-cni
    state: ready
  - name: state-nic-feature-discovery
    state: ready
  state: notReady
  • Output of: kubectl -n nvidia-network-operator get -A:
kubectl -n nvidia-network-operator get all
NAME                                      READY   STATUS    RESTARTS   AGE
pod/cni-plugins-ds-clll5                  1/1     Running   0          45m
pod/cni-plugins-ds-f7jc5                  1/1     Running   0          45m
pod/cni-plugins-ds-gz5m2                  1/1     Running   0          45m
pod/cni-plugins-ds-lmk7x                  1/1     Running   0          45m
pod/kube-multus-ds-62x59                  1/1     Running   0          45m
pod/kube-multus-ds-fbxkv                  1/1     Running   0          45m
pod/kube-multus-ds-frpfg                  1/1     Running   0          45m
pod/kube-multus-ds-sh7x7                  1/1     Running   0          45m
pod/mofed-rhel8.8-ds-4rxvb                1/1     Running   0          34m
pod/mofed-rhel8.8-ds-cbwh4                1/1     Running   0          34m
pod/mofed-rhel8.8-ds-h5bcr                1/1     Running   0          34m
pod/network-operator-6444bc476f-g22tf     1/1     Running   0          40m
pod/nic-feature-discovery-ds-h2v99        1/1     Running   0          45m
pod/nic-feature-discovery-ds-kv5rx        1/1     Running   0          45m
pod/nic-feature-discovery-ds-r8sxw        1/1     Running   0          45m
pod/nic-feature-discovery-ds-x2kwt        1/1     Running   0          45m
pod/nv-ipam-controller-64c89dcfd5-lfznx   1/1     Running   0          45m
pod/nv-ipam-controller-64c89dcfd5-p87wp   1/1     Running   0          45m
pod/nv-ipam-node-fzrfk                    1/1     Running   0          45m
pod/nv-ipam-node-hv9cp                    1/1     Running   0          45m
pod/nv-ipam-node-mnmfp                    1/1     Running   0          45m
pod/nv-ipam-node-wjk4z                    1/1     Running   0          45m
pod/rdma-shared-dp-ds-6f79n               1/1     Running   0          25m
pod/rdma-shared-dp-ds-jzk6n               1/1     Running   0          25m
pod/rdma-shared-dp-ds-nk8td               1/1     Running   0          25m

NAME                                      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                                                                       AGE
daemonset.apps/cni-plugins-ds             4         4         4       4            4           <none>                                                                                                                                                              45m
daemonset.apps/kube-multus-ds             4         4         4       4            4           <none>                                                                                                                                                              45m
daemonset.apps/mofed-rhel8.8-ds           3         3         3       3            3           feature.node.kubernetes.io/pci-15b3.present=true,feature.node.kubernetes.io/system-os_release.ID=rhel,feature.node.kubernetes.io/system-os_release.VERSION_ID=8.8   45m
daemonset.apps/nic-feature-discovery-ds   4         4         4       4            4           <none>                                                                                                                                                              45m
daemonset.apps/nv-ipam-node               4         4         4       4            4           <none>                                                                                                                                                              45m
daemonset.apps/rdma-shared-dp-ds          3         3         3       3            3           feature.node.kubernetes.io/pci-15b3.present=true,network.nvidia.com/operator.mofed.wait=false                                                                       45m

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/network-operator     1/1     1            1           4d19h
deployment.apps/nv-ipam-controller   2/2     2            2           45m

NAME                                            DESIRED   CURRENT   READY   AGE
replicaset.apps/network-operator-5cbb6ccd74     0         0         0       4d19h
replicaset.apps/network-operator-6444bc476f     1         1         1       4d15h
replicaset.apps/network-operator-76b9994f84     0         0         0       4d19h
replicaset.apps/nv-ipam-controller-64c89dcfd5   2         2         2       45m
nfd:
  enabled: false
  deployNodeFeatureRules: true

operator:
  tolerations: []
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 10
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/worker"
                operator: In
                values: [""]
        - weight: 1
          preference:
            matchExpressions:
              - key: "hpe.com/dataplatform"
                operator: NotIn
                values: ["true"]
        - weight: 1
          preference:
            matchExpressions:
              - key: "node-role.kubernetes.io/control-plane"
                operator: In
                values: [ "" ]

sriovNetworkOperator:
  enabled: false

# NicClusterPolicy CR values:
deployCR: true

nvPeerDriver:
  deploy: false

rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      ifNames: [ens2f0, ens5f0]

secondaryNetwork:
  deploy: true
  multus:
    deploy: true
  cniPlugins:
    deploy: true
  ipamPlugin:
    deploy: false

nvIpam:
  deploy: true

sriovDevicePlugin:
  deploy: false

ofedDriver:
  deploy: true
  repoConfig:
    name: repo-config
  env:
  - name: HTTPS_PROXY
    value: http://proxy-de.its.hpecorp.net:443
  - name: HTTP_PROXY
    value: http://proxy-de.its.hpecorp.net:443
  - name: https_proxy
    value: http://proxy-de.its.hpecorp.net:443
  - name: http_proxy
    value: http://proxy-de.its.hpecorp.net:443

nicFeatureDiscovery:
  deploy: true
  • Kubernetes' nodes information (labels, annotations and status): kubectl get node -o yaml:

Environment:

  • Kubernetes version (use kubectl version): v1.27.10
  • Hardware configuration:
    • Network adapter model and firmware version:

26:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
26:00.1 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)
9f:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
9f:00.1 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)
b4:00.0 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
b4:00.1 Ethernet controller [0200]: Mellanox Technologies MT43244 BlueField-3 integrated ConnectX-7 network controller [15b3:a2dc] (rev 01)
b4:00.2 DMA controller [0801]: Mellanox Technologies MT43244 BlueField-3 SoC Management Interface [15b3:c2d5] (rev 01)

  • OS (e.g: cat /etc/os-release):

NAME="Red Hat Enterprise Linux"
VERSION="8.8 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.8 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.8"

  • Kernel (e.g. uname -a):
    4.18.0-477.10.1.el8_8.x86_64
  • Others:
@gseidlerhpe gseidlerhpe added the bug Something isn't working label Apr 10, 2024
@rollandf
Copy link
Member

Thanks for the report.

Which CRI are you using?
For RHEL8/RHEL9, only CRIO is supported

@krembu
Copy link

krembu commented Apr 15, 2024

on rhel you will have to use CRIO + containers-common installed to have the entitlement mounted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants