Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MOFED Pod CrashLoopBackOff State #830

Open
ruta-04 opened this issue Feb 23, 2024 · 8 comments
Open

MOFED Pod CrashLoopBackOff State #830

ruta-04 opened this issue Feb 23, 2024 · 8 comments

Comments

@ruta-04
Copy link

ruta-04 commented Feb 23, 2024

I am using Nvidia Network Operator 23.10.0 on the Openshift Container Platform (rhcos 4.14).

While creating nicclusterpolicy, it spins up MOFED pods which are crashing continuously (running but in notReady state)

`
[ansible@csah-pri entitlement]$ oc logs mofed-rhcos4.14-ds-ncj2n
Unsetting driver ready state
No OFED driver found for kernel 5.14.0-284.40.1.el9_2.x86_64
Enabling RHOCP and EUS RPM repos...
ID="rhcos"
VERSION_ID="4.14"
RHEL_VERSION="9.2"
Updating Subscription Management repositories.
Unable to read consumer identity
subscription-manager is operating in container mode.
Updating Subscription Management repositories.
Unable to read consumer identity
subscription-manager is operating in container mode.
cuda 589 B/s | 3.5 kB 00:06
cuda 294 kB/s | 1.2 MB 00:04
Red Hat Enterprise Linux 9 for x86_64 - AppStre 0.0 B/s | 0 B 00:10
Errors during downloading metadata for repository 'rhel-9-for-x86_64-appstream-rpms':

  • Curl error (6): Couldn't resolve host name for https://cdn.redhat.com/content/dist/rhel9/9.2/x86_64/appstream/os/repodata/repomd.xml [Could not resolve host: cdn.redhat.com]
    Error: Failed to download metadata for repo 'rhel-9-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
    Updating Subscription Management repositories.
    Unable to read consumer identity
    subscription-manager is operating in container mode.
    Updating Subscription Management repositories.
    Unable to read consumer identity
    subscription-manager is operating in container mode.
    Updating Subscription Management repositories.
    Unable to read consumer identity
    subscription-manager is operating in container mode.
    cuda 589 B/s | 3.5 kB 00:06
    Red Hat Enterprise Linux 9 for x86_64 - AppStre 5.0 MB/s | 24 MB 00:04
    Red Hat Enterprise Linux 9 for x86_64 - BaseOS 13 MB/s | 16 MB 00:01
    Red Hat Enterprise Linux 9 for x86_64 - BaseOS 11 MB/s | 14 MB 00:01
    Red Hat Universal Base Image 9 (RPMs) - BaseOS 3.1 kB/s | 3.8 kB 00:01
    Red Hat Universal Base Image 9 (RPMs) - BaseOS 802 kB/s | 515 kB 00:00
    Red Hat Universal Base Image 9 (RPMs) - AppStre 22 kB/s | 4.2 kB 00:00
    Red Hat Universal Base Image 9 (RPMs) - AppStre 2.3 MB/s | 1.8 MB 00:00
    Red Hat Universal Base Image 9 (RPMs) - CodeRea 22 kB/s | 4.2 kB 00:00
    Red Hat Universal Base Image 9 (RPMs) - CodeRea 345 kB/s | 192 kB 00:00
    Metadata cache created.
    Installing dependencies
    Error in POSTTRANS scriptlet in rpm package kernel-core

Installed:
cryptsetup-libs-2.6.0-3.el9.x86_64
device-mapper-9:1.02.195-3.el9.x86_64
device-mapper-libs-9:1.02.195-3.el9.x86_64
dracut-057-44.git20230822.el9.x86_64
kbd-2.4.0-9.el9.x86_64
kbd-legacy-2.4.0-9.el9.noarch
kbd-misc-2.4.0-9.el9.noarch
kernel-5.14.0-284.40.1.el9_2.x86_64
kernel-core-5.14.0-284.40.1.el9_2.x86_64
kernel-modules-5.14.0-284.40.1.el9_2.x86_64
kernel-modules-core-5.14.0-284.40.1.el9_2.x86_64
kpartx-0.8.7-22.el9.x86_64
libkcapi-1.3.1-3.el9.x86_64
libkcapi-hmaccalc-1.3.1-3.el9.x86_64
linux-firmware-20230310-138.el9_2.noarch
linux-firmware-whence-20230310-138.el9_2.noarch
pigz-2.5-4.el9.x86_64
systemd-udev-252-18.el9.x86_64

Downgraded:
elfutils-0.188-3.el9.x86_64
elfutils-debuginfod-client-0.188-3.el9.x86_64
elfutils-libelf-0.188-3.el9.x86_64
elfutils-libs-0.188-3.el9.x86_64
Installed:
createrepo_c-0.20.1-1.el9.x86_64
createrepo_c-libs-0.20.1-1.el9.x86_64
elfutils-libelf-devel-0.188-3.el9.x86_64
kernel-rpm-macros-185-12.el9.noarch
numactl-libs-2.0.16-1.el9.x86_64
zlib-devel-1.2.11-40.el9.x86_64
Installing Linux kernel headers...
Error: Unable to find a match: kernel-headers-5.14.0-284.40.1.el9_2.x86_64 kernel-devel-5.14.0-284.40.1.el9_2.x86_64

Command "dnf -q -y --releasever=9.2 install kernel-headers-5.14.0-284.40.1.el9_2.x86_64 kernel-devel-5.14.0-284.40.1.el9_2.x86_64" failed with exit code: 1
Terminate event caught
Terminating container
Unsetting driver ready state
Keeping currently loaded Mellanox OFED Driver...`

I can't seem to pinpoint the issue here. is it that os's kernel version is not currently supported by OFED driver?

@rollandf
Copy link
Member

@ruta-04
Copy link
Author

ruta-04 commented Feb 26, 2024

@rollandf

yes, I added the cluster-wide-entitlement and ran the test pod provided in the instructions which gave the following output. It matches the example output.

[ansible@csah-pri entitlement]$ oc logs cluster-entitled-build-pod -n default Updating Subscription Management repositories. Unable to read consumer identity subscription-manager is operating in container mode. Red Hat Enterprise Linux 9 for x86_64 - BaseOS 14 MB/s | 17 MB 00:01 Red Hat Enterprise Linux 9 for x86_64 - AppStre 25 MB/s | 29 MB 00:01 Red Hat Universal Base Image 9 (RPMs) - BaseOS 481 kB/s | 515 kB 00:01 Red Hat Universal Base Image 9 (RPMs) - AppStre 2.2 MB/s | 1.8 MB 00:00 Red Hat Universal Base Image 9 (RPMs) - CodeRea 321 kB/s | 192 kB 00:00 ====================== Name Exactly Matched: kernel-devel ====================== kernel-devel-5.14.0-70.13.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.17.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.22.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.26.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-70.30.1.el9_0.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.6.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.23.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.12.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.22.2.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-162.18.1.el9_1.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.11.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.25.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.18.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-284.30.1.el9_2.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.8.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.13.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel kernel-devel-5.14.0-362.18.1.el9_3.x86_64 : Development package for building kernel modules to match the kernel ========================== Name Matched: kernel-devel ========================== kernel-devel-matched-5.14.0-70.13.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.17.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.26.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.22.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-70.30.1.el9_0.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.6.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.23.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.22.2.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.12.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-162.18.1.el9_1.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.11.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.25.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.18.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.8.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-284.30.1.el9_2.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.13.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel kernel-devel-matched-5.14.0-362.18.1.el9_3.x86_64 : Meta package to install matching core and devel packages for a given kernel

@ruta-04
Copy link
Author

ruta-04 commented Mar 6, 2024

any update on this?

@rollandf
Copy link
Member

Looks that entitlement is setup OK. Did you restart MOFED after setting up the entitlement?

BTW, in 24.1 version, entitlement is not needed as compilation is done with DTK.

@ruta-04
Copy link
Author

ruta-04 commented Mar 11, 2024 via email

@rollandf
Copy link
Member

@e0ne
I remember that we encountered an issue where the Red Hat subscription should be of a certain type.
Do you recall something related?

@ruta-04
Copy link
Author

ruta-04 commented Apr 16, 2024 via email

@rollandf
Copy link
Member

Will you be able to use more recent Network Operator release?
From 24.1, we no longer require cluster wide entitlement in OpenShift.
Latest:
https://docs.nvidia.com/networking/display/kubernetes2411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants