Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCGM exporter doesn't work on the latest version of Bottlerocket AMI #34

Open
peter-volkov opened this issue May 8, 2024 · 2 comments

Comments

@peter-volkov
Copy link

I'm not sure what is the correct place to report this, please direct me if this is not the correct place.

Goal:
I want to have EKS cluster with working observability, Bottlerocket AMI and GPU-nodes (g5* instances)
I use this helm chart by enabling amazon-cloudwatch-observability EKS add-on for my cluster.

Steps to reproduce:

  1. I create latest version of EKS, GPU nodes with the last version of Bottlerocket AMI.
  2. I use the latest nvidia-device-plugin ( https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0 )
  3. I enable the latest version of the amazon-cloudwatch-observability EKS add-on (dcgm-exporter image 602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/observability/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04 is used )
  4. All related daemonsets except for dcgm-exporter work well.
  5. DCGM-exporter containers has this in output:
time="2024-05-08T11:22:02Z" level=info msg="Starting dcgm-exporter"
Error: Failed to initialize NVML
time="2024-05-08T11:22:02Z" level=fatal msg="Error starting nv-hostengine: DCGM initialization error"

I guess this is some version incompatibility issue for the DCGM and nvidia driver (being installed to nodes via k8s-device-plugin ).
What should I do to make DCGM exporter work?

@mitali-salvi
Copy link
Contributor

Hey @peter-volkov,
could you provide the full Bottlerocket AMI so that we re-create this issue on our end ?

@peter-volkov
Copy link
Author

peter-volkov commented May 21, 2024

I appreciate your help.
I'm just creating a node group with ami_type = "BOTTLEROCKET_x86_64_NVIDIA"
specified in Terraform config. It takes the latest version of the image family during the initial creation.
Currently I have release_version = "1.19.5-64049ba8"
imageId=ami-0f3f964e4f939bbd0

But I do not really care about the version. If you can successfully run DCGM export as a part of amazon-cloudwatch-observability EKS add-on with g5.xlarge on any BottleRocket image -- It will be enough for me. Then I will consider the issue to be my own problem and will debug it myself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants