Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create instant_per_process_cpu_mem_usage.sh #157

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

deajan
Copy link

@deajan deajan commented Jun 3, 2023

Hello,

I've built this (very tiny footprint) script that allows to get non named per process CPU metrics.
Most other tools out there require to setup process group names in order to catch process metrics, other solutions provide with non instant cpu metrics given by ps.

Getting instant per process cpu usage is really useful for admins to quickly find a culprit.

I did understand that you discourage shell scripts in favor of the Python client, but this one is mostly a big oneliner, and would only get a less tiny footprint if being rewritten in Python.

Would you mind merging this one ?
I've found no other solution out there to achieve the same, so I normally did not reinvent the wheel ;)

I can also provide the corresponding Grafana dashboard of course:
image

Hope this will help any other admins ;)
Best regards.

This file adds a quick wrapper around top.
Most process CPU time collecting tools revolve around ps which does not provide instant cpu metrics, but % cpu spent since process was launched.

Signed-off-by: Orsiris de Jong <ozy@netpower.fr>
@dswarbrick
Copy link
Member

How does this compare to what process_exporter can do?

@deajan
Copy link
Author

deajan commented Jun 3, 2023

process_exporter needs to be configured to pickup processes by names or groups, so you have to know beforehand which processes you want to monitor.

This one just picks up whatever uses CPU or RAM.
So if a new process shows up, it will be reported by the script as long as it uses resources.

@dswarbrick
Copy link
Member

dswarbrick commented Jun 3, 2023

process_exporter matches process names by regular expression, which can be as concise or as vague as you like. The example in the README would match any process:

process_names:
  - name: "{{.Comm}}"
    cmdline:
      - '.+'

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour, which could lead to metrics appearing and disappearing as they oscillated in and out of the selection criteria (e.g. cpu-hungry, memory-hungry) of the collector. This tends to cause issues with Prometheus' default look-behind interval of 5m, resulting in apparently stale metrics.

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Using PID or StartTime is discouraged: this is almost never what you want, and is likely to result in high cardinality metrics which Prometheus will have trouble with.

@deajan
Copy link
Author

deajan commented Jun 4, 2023

process_exporter matches process names by regular expression, which can be as concise or as vague as you like

I've actually played with process_exporter before trying to reinvent the wheel.

On the good side of process_explorer:

  • It shows details per thread
  • Can be triggerd by simply requestring /metrics
  • It has way more functionality like io/ context switches / page faults ...
    Caveats I found:
  • You don't get to know the process' arguments
    • Example: if you run a python script (or cockpit, or ansible, or whatever runs as a python invoked script) you will not know which python program creates the cpu usage spike, you'll only get to know that it's python
    • Example for setroubleshootd eating 100% cpu shown by my script
top_process_cpu_usage{pid="2584",process="/usr/bin/python3",sanitized_args=" -Es /usr/sbin/tuned -l -P"} 0.2
top_process_cpu_usage{pid="15501",process="/usr/bin/python3",sanitized_args=" -s /usr/sbin/firewalld --nofork --nopid"} 0.1
top_process_cpu_usage{pid="23299",process="python3",sanitized_args=" test.py"} 5.0
top_process_cpu_usage{pid="45921",process="/usr/bin/python3",sanitized_args="  -Es /usr/sbin/setroubleshootd -f"} 99.7
- Example for setroubleshootd eating 100% cpu shown by process_explorer
namedprocess_namegroup_context_switches_total{ctxswitchtype="voluntary",groupname="python3"} 525551
namedprocess_namegroup_cpu_seconds_total{groupname="python3",mode="system"} 27.43
namedprocess_namegroup_cpu_seconds_total{groupname="python3",mode="user"} 18.41
namedprocess_namegroup_major_page_faults_total{groupname="python3"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="proportionalResident"} 4.164608e+06
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="proportionalSwapped"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="resident"} 7.090176e+06
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="swapped"} 0
namedprocess_namegroup_memory_bytes{groupname="python3",memtype="virtual"} 9.658368e+06
namedprocess_namegroup_minor_page_faults_total{groupname="python3"} 877
namedprocess_namegroup_num_procs{groupname="python3"} 1
namedprocess_namegroup_num_threads{groupname="python3"} 1
namedprocess_namegroup_oldest_start_time_seconds{groupname="python3"} 1.685868499e+09
namedprocess_namegroup_open_filedesc{groupname="python3"} 3
namedprocess_namegroup_read_bytes_total{groupname="python3"} 0
namedprocess_namegroup_states{groupname="python3",state="Other"} 0
namedprocess_namegroup_states{groupname="python3",state="Running"} 0
namedprocess_namegroup_states{groupname="python3",state="Sleeping"} 1
namedprocess_namegroup_states{groupname="python3",state="Waiting"} 0
namedprocess_namegroup_states{groupname="python3",state="Zombie"} 0
namedprocess_namegroup_threads_wchan{groupname="python3",wchan="do_select"} 1
namedprocess_namegroup_worst_fd_ratio{groupname="python3"} 0.0029296875
namedprocess_namegroup_write_bytes_total{groupname="python3"} 0
  • In process_explorer, cpu usage is shown in seconds since process start (condensed into one metric for all similar named processes). While you can achieve to calculate cpu usage percentage, it involves knowing two more variables, total available cpu time in seconds and number of cpu cores , the latter not being easy. Using something like 100 -irate(cpu_total_seconds[5m]) * 100) would just provide cpu usage percent on max cpu used, not capacity

Therefore the tool is not meant for the same job, this one is a "record my top command output" alike, with process name and command line, without grouping anything, which is exactly what some people could need for diagnostics.

Generally, a textfile collector should not overlap with functionality provided by promql. That includes topk-like behaviour

What do you mean ? My script doesn't "aggregate" anything like topk would do.
Do you mean it should keep zero values to avoid stale metrics ?

Another thing that I would consider an absolute no-no is including process IDs as labels, since they are pretty much by definition high entropy, and would also result in similar problems as described above. The process_exporter README highlights that also:

Makes sense. I'll have the PIDs removed, even if I still do think it makes sense at an admin's diagnostic level to know whether process python /some/script.py is the same process as python /some/script.py ten minutes before, or if it's a new instance with a different PID.

Signed-off-by: Orsiris de Jong <ozy@netpower.fr>
Signed-off-by: Orsiris de Jong <ozy@netpower.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants