Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inputs.intel_powerstat probably not honoring HOST_MOUNT_PREFIX #14881

Open
divStar opened this issue Feb 22, 2024 · 7 comments
Open

inputs.intel_powerstat probably not honoring HOST_MOUNT_PREFIX #14881

divStar opened this issue Feb 22, 2024 · 7 comments
Assignees
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes

Comments

@divStar
Copy link

divStar commented Feb 22, 2024

Relevant telegraf.conf

[[inputs.intel_powerstat]]
  interval = "10s"
  ## The user can choose which package metrics are monitored by the plugin with
  ## the package_metrics setting:
  ## - The default, will collect "current_power_consumption",
  ##   "current_dram_power_consumption" and "thermal_design_power"
  ## - Leaving this setting empty means no package metrics will be collected
  ## - Finally, a user can specify individual metrics to capture from the
  ##   supported options list
  ## Supported options:
  ##   "current_power_consumption", "current_dram_power_consumption",
  ##   "thermal_design_power", "max_turbo_frequency", "uncore_frequency",
  ##   "cpu_base_frequency"
  package_metrics = ["current_power_consumption", "current_dram_power_consumption"]

  ## The user can choose which per-CPU metrics are monitored by the plugin in
  ## cpu_metrics array.
  ## Empty or missing array means no per-CPU specific metrics will be collected
  ## by the plugin.
  ## Supported options:
  ##   "cpu_frequency", "cpu_c0_state_residency", "cpu_c1_state_residency",
  ##   "cpu_c6_state_residency", "cpu_busy_cycles", "cpu_temperature",
  ##   "cpu_busy_frequency"
  ## ATTENTION: cpu_busy_cycles is DEPRECATED - use cpu_c0_state_residency
  cpu_metrics = ["cpu_frequency", "cpu_c0_state_residency", "cpu_c1_state_residency","cpu_c6_state_residency", "cpu_busy_frequency"]

Logs from Telegraf

2024-02-22T15:53:22Z I! Loading config: /etc/telegraf/telegraf.conf

2024-02-22T15:53:22Z I! Starting Telegraf 1.29.5 brought to you by InfluxData the makers of InfluxDB

2024-02-22T15:53:22Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores

2024-02-22T15:53:22Z I! Loaded inputs: intel_powerstat mqtt_consumer

2024-02-22T15:53:22Z I! Loaded aggregators: 

2024-02-22T15:53:22Z I! Loaded processors: 

2024-02-22T15:53:22Z I! Loaded secretstores: 

2024-02-22T15:53:22Z I! Loaded outputs: influxdb_v2

2024-02-22T15:53:22Z I! Tags enabled: host=233d22a54fc0

2024-02-22T15:53:22Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"233d22a54fc0", Flush Interval:10s

2024-02-22T15:53:52Z I! [inputs.mqtt_consumer] Connected [tcp://mqtt.my.family:1883]

2024-02-22T15:53:52Z W! [inputs.intel_powerstat] Plugin started with errors: PowerTelemetry instance initialized with errors: failed to initialize msr: invalid MSR base path "/dev/cpu": file "/dev/cpu" does not exist; failed to initialize rapl: invalid base path of rapl control zone: file "/sys/devices/virtual/powercap/intel-rapl" does not exist

2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to update MSR time-related metrics: module "msr" is not initialized

2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to get "current_power_consumption": module "rapl" is not initialized

2024-02-22T15:54:00Z E! [inputs.intel_powerstat] Error in plugin: failed to get "current_dram_power_consumption": module "rapl" is not initialized

System info

Ubuntu 22.04.03, Telegraf 1.29.5, Docker (Server Version) 25.0.3

Docker

version: '3'

services:
  telegraf:
    image: telegraf:latest
    container_name: telegraf
    restart: unless-stopped
    environment:
      INFLUX_TOKEN: "<token redacted>"
      HOST_ETC: "/hostfs/etc"
      HOST_PROC: "/hostfs/proc"
      HOST_SYS: "/hostfs/sys"
      HOST_VAR: "/hostfs/var"
      HOST_RUN: "/hostfs/run"
      HOST_MOUNT_PREFIX: "/hostfs"
    volumes:
      - '<host-path>/telegraf.conf:/etc/telegraf/telegraf.conf'
      - '/:/hostfs:ro'
    # depends_on:
    #  - influxdb
    networks:
      - services-network

networks:
  services-network:
    external: true

Steps to reproduce

  1. Ensure your system supports Intel MSR and/or RAPL and that the appropriate kernel modules have been loaded (e.g. using lsmod | grep rapl).
  2. Ensure your system has cpuid installed (sudo apt-get install -y cpuid)-
  3. Set up a network (in my example it's called services-network and is a bridge-type network).
  4. Create a docker-compose.yaml with just Telegraf - as mentioned in the docker part above.
  5. Configure it to use input.intel_powerstat.
  6. Run the docker-compose.yaml file.
  7. Wait about 20 seconds.

Expected behavior

I expect the plugin to look for PowerTelemtry inside /hostfs/sys/... or /hostfs/dev/... etc., to not throw any errors and ultimately grab the corresponding values.

Actual behavior

As 2024-02-22T15:53:52Z W! [inputs.intel_powerstat] Plugin started with errors: PowerTelemetry instance initialized with errors: failed to initialize msr: invalid MSR base path "/dev/cpu": file "/dev/cpu" does not exist; failed to initialize rapl: invalid base path of rapl control zone: file "/sys/devices/virtual/powercap/intel-rapl" does not exist states, the plug in does not find the corresponding folders.
/hostfs/dev/cpu and `/hostfs/sys/devices/virtual/powercap/intel-rapl" do indeed exist, but they seem to not be found.

Additional info

I've checked out the project and tried looking around, but I cannot find where (if at all) HOST_MOUNT_PREFIX or any of the HOST_* environment variables would be used. They are used to some extent in other plugins it seems, but not in this one.

Edit: I also figured the following: when installing Telegraf locally - even though MSR and RAPL are available - I had to do a couple of things before I could use it locally, namely this:

sudo chmod -R a+r /sys/devices/virtual/powercap/
sudo setcap cap_sys_rawio=ep /usr/bin/telegraf
sudo systemctl restart telegraf

After that, Telegraf started working locally and sending values to my InfluxDB in the container as I'd expect it to.

In the containerized Telegraf instance though, even mounting to /sys and /dev directly (not /hostfs/sys and /hostfs/dev) and even using privileged: true and user: "0:0", I could not get it to work.

I probably could help create a PR if someone was to show me where to dig.

@divStar divStar added the bug unexpected problem or unintended behavior label Feb 22, 2024
@powersj
Copy link
Contributor

powersj commented Feb 22, 2024

I believe the plugin now depends on github.com/intel/powertelemetry for reading and collecting all the necessary data points. That means that it would need to learn about this environment variable.

@zak-pawel - is my understanding correct? If so, is this something you would consider adding?

@powersj powersj added waiting for response waiting for response from contributor upstream bug or issues that rely on dependency fixes labels Feb 22, 2024
@divStar
Copy link
Author

divStar commented Feb 22, 2024

Yeah, I just figured that out looking deeper into the code. The inputs.intel_powerstat plugin depends on github.com/shirou/gopsutil/v3, which in turn depends on github.com/intel/powertelemetry, which isn't aware of the environment variables and does not seem to have any setters for such (in order to prepend the path to e.g. /dev/cpu).

Edit: I'm sorry the waiting for response label got removed :(.
Edit2: actually the dependency you mentioned is directly imported. Sorry for the confusion.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Feb 22, 2024
@p-zak
Copy link
Collaborator

p-zak commented Feb 22, 2024

@divStar Right, in the initial version of https://github.com/intel/powertelemetry, we did not implement support for HOST_* environment variables, but it is in our backlog, and as soon as we implement it in the library, the Telegraf Powerstat plugin will be updated.

@divStar
Copy link
Author

divStar commented Feb 22, 2024

@p-zak - thank you very much! I currently found a workaround by running one instance of Telegraf locally and submitting the values to the same InfluxDB. It's not a pretty solution, but it works.

Is there anything I can help with? I am a software engineer with some experience, albeit mostly in Java (but also have done many other languages).

Also anyone: please feel free to close it, especially if there is an issue we could link to.

@p-zak
Copy link
Collaborator

p-zak commented Feb 22, 2024

@divStar
You can try to mount:

"/sys/devices:/sys/devices:ro"
"/dev/cpu:/dev/cpu:ro"

In newest versions of docker, you will need the "privileged" flag as well.

@powersj
Copy link
Contributor

powersj commented May 7, 2024

@divStar can you please open an upsteram issue in the https://github.com/intel/powertelemetry repo please to add this and link the issue here.

@divStar
Copy link
Author

divStar commented May 10, 2024

I've created the issue, mostly copying the issue description from here (I am not sure if I should have reduced the description, but I added a TL;DR section).

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

No branches or pull requests

3 participants