Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/hostmetrics/cpuscraper] Windows - CTX timeout, use CountsWithContext instead and make it configurable #32133

Open
dloucasfx opened this issue Apr 3, 2024 · 4 comments
Labels
bug Something isn't working receiver/hostmetrics

Comments

@dloucasfx
Copy link
Contributor

dloucasfx commented Apr 3, 2024

Component(s)

receiver/hostmetrics

What happened?

Description

The cpu.Counts gopsutil func, which is called by the cpu scraper, does not set a deadline/timeout on its context, which forces WMIQueryWithContext to set it using the hardcoded timeout value of 3 seconds.
In large busy env or/and low resourced, the wmi call can take longer than 3 seconds, which will lead to a context deadline exceeded error and fail to get the CPU counts.

Steps to Reproduce

Find a windows host where the wmi calls take longer than 3 seconds and run the hostmetrics receiver with the cpu scraper.

Expected Result

Get all the metrics, including the physical and logical CPU counts

Actual Result

CPU counts are missing and we see this error in the logs

4670103 Mar 29 00:13 Error       splunk-otel-collector          3 1.7116855975244713e+09        error
                                                                  scraperhelper/scrapercontroller.go:200        Error
                                                                  scraping metrics      {"kind": "receiver", "name":
                                                                  "hostmetrics", "data_type": "metrics", "error":
                                                                  "context deadline exceeded", "scraper": "cpu"}
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).scrapeMetricsAndReport
                                                                        go.opentelemetry.io/collector/receiver@v0.95.0/scrap
                                                                  erhelper/scrapercontroller.go:200
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).startScraping.func1
                                                                        go.opentelemetry.io/collector/receiver@v0.95.0/scrap
                                                                  erhelper/scrapercontroller.go:176

Collector version

v0.95.0

Environment information

Environment

host_cpu_cores:"2"
host_cpu_model:"Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz"
host_mem_total:"8080924"
host_os_name":"Microsoft Windows Server 2016 Datacenter",

OpenTelemetry Collector configuration

No response

Log output

4670103 Mar 29 00:13 Error       splunk-otel-collector          3 1.7116855975244713e+09        error
                                                                  scraperhelper/scrapercontroller.go:200        Error
                                                                  scraping metrics      {"kind": "receiver", "name":
                                                                  "hostmetrics", "data_type": "metrics", "error":
                                                                  "context deadline exceeded", "scraper": "cpu"}
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).scrapeMetricsAndReport
                                                                        go.opentelemetry.io/collector/receiver@v0.95.0/scrap
                                                                  erhelper/scrapercontroller.go:200
                                                                  go.opentelemetry.io/collector/receiver/scraperhelper.
                                                                  (*controller).startScraping.func1
                                                                        go.opentelemetry.io/collector/receiver@v0.95.0/scrap
                                                                  erhelper/scrapercontroller.go:176

Additional context

Suggestion is to use CountsWithContext instead of Counts and introduce a wmi_timeout option for cpuscraper

cc: @atoulme who helped with the RCA

@dloucasfx dloucasfx added bug Something isn't working needs triage New item requiring triage labels Apr 3, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dloucasfx
Copy link
Contributor Author

This PR "does not" fix the main ask here, but improves and avoid the issue when the metric is not enabled

dmitryax pushed a commit that referenced this issue Apr 11, 2024
…etric is enabled (#32173)

**Description:** 
As described in
#32133
, in windows, the CPU count results of a wmi call with a hardcoded
context timeout of 3 seconds.
This leads to an error when the wmi is slow or system is under heavy
load, causing all the collected metrics to not be emitted.

The CPU count metrics, logical and physical, are not enabled by default
and there is no reason to calculate it unless it's enabled.

**Link to tracking Issue:** #32133

**Testing:** unit test has been validated
**Documentation:** <Describe the documentation added.>

Signed-off-by: Dani Louca <dlouca@splunk.com>
Co-authored-by: Antoine Toulme <antoine@lunar-ocean.com>
@crobert-1
Copy link
Member

Removing needs triage based on original PR being merged, I'll defer to code owners for more discussion on the potentially adding configuration options here.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Apr 15, 2024
@cforce
Copy link

cforce commented May 3, 2024

Is this only an issue for windows or also Linux?

rimitchell pushed a commit to rimitchell/opentelemetry-collector-contrib that referenced this issue May 8, 2024
…etric is enabled (open-telemetry#32173)

**Description:** 
As described in
open-telemetry#32133
, in windows, the CPU count results of a wmi call with a hardcoded
context timeout of 3 seconds.
This leads to an error when the wmi is slow or system is under heavy
load, causing all the collected metrics to not be emitted.

The CPU count metrics, logical and physical, are not enabled by default
and there is no reason to calculate it unless it's enabled.

**Link to tracking Issue:** open-telemetry#32133

**Testing:** unit test has been validated
**Documentation:** <Describe the documentation added.>

Signed-off-by: Dani Louca <dlouca@splunk.com>
Co-authored-by: Antoine Toulme <antoine@lunar-ocean.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/hostmetrics
Projects
None yet
Development

No branches or pull requests

3 participants