Log warning when failing to parse openmetrics response #17514

Nevon · 2024-05-03T14:48:10Z

What does this PR do?

Log a warning when the plain text prometheus metric parsing fails, including the line that it failed to parse. For example:

2024-05-15 14:58:11 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:131 in LogMessage) | openmetrics:c2c_prometheus:21fa576d88f0d1f8 | (mixins.py:490) | Failed to parse metric response 'histogram_bucket{le="0.01"💣} 0': Invalid metric name: histogram_bucket{le="0.01"💣}

Motivation

This issue

Currently the text_fd_to_metric_families just raises generic errors that don't provide enough context to know what it was that couldn't be parsed, which is problematic when deployed into environments where a single agent is parsing metrics from many different sources via autodiscovery.

Additional Notes

I'm very much not a Python developer, so let me know if what I've done is very unidiomatic. I considered wrapping the exception in my own ParserException, but from reading the other checks that didn't seem like a common thing to do. I also wasn't able to actually run the integration tests locally, because they seem to only work with Docker and not docker-compatible clients like nerdctl, but I'm assuming the CI pipeline will.

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Changelog entries must be created for modifications to shipped code
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

codecov · 2024-05-03T14:53:48Z

Codecov Report

Attention: Patch coverage is 70.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 89.95%. Comparing base (2a6ec26) to head (9663a00).
Report is 87 commits behind head on master.

Additional details and impacted files

Flag	Coverage Δ
active_directory	`100.00% <ø> (+17.64%)`	⬆️
activemq_xml	`?`
aerospike	`87.17% <ø> (+0.32%)`	⬆️
airflow	`92.20% <ø> (ø)`
amazon_msk	`88.91% <ø> (+37.86%)`	⬆️
ambari	`?`
apache	`95.08% <ø> (?)`
arangodb	`?`
aspdotnet	`100.00% <ø> (?)`
avi_vantage	`?`
azure_iot_edge	`?`
cacti	`87.90% <ø> (?)`
cassandra_nodetool	`93.16% <ø> (?)`
ceph	`91.07% <ø> (ø)`
cert_manager	`77.41% <ø> (?)`
cilium	`77.72% <ø> (?)`
cisco_aci	`?`
citrix_hypervisor	`?`
clickhouse	`94.18% <ø> (-1.47%)`	⬇️
cloudera	`?`
cockroachdb	`93.19% <ø> (ø)`
confluent_platform	`81.96% <ø> (?)`
consul	`91.82% <ø> (?)`
coredns	`94.61% <ø> (?)`
couch	`94.50% <ø> (+10.28%)`	⬆️
couchbase	`84.89% <ø> (ø)`
crio	`?`
datadog_checks_base	`89.63% <70.00%> (+1.34%)`	⬆️
datadog_checks_dev	`74.79% <ø> (?)`
datadog_checks_downloader	`80.78% <ø> (-0.62%)`	⬇️
datadog_cluster_agent	`90.19% <ø> (ø)`
dcgm	`?`
ddev	`87.28% <ø> (-0.59%)`	⬇️
disk	`82.30% <ø> (?)`
dns_check	`90.97% <ø> (-2.36%)`	⬇️
dotnetclr	`94.33% <ø> (+35.19%)`	⬆️
druid	`97.70% <ø> (ø)`
ecs_fargate	`?`
elastic	`93.36% <ø> (ø)`
envoy	`?`
etcd	`95.56% <ø> (?)`
exchange_server	`96.96% <ø> (+48.15%)`	⬆️
external_dns	`?`
fluentd	`94.48% <ø> (?)`
fluxcd	`88.31% <ø> (?)`
foundationdb	`83.83% <ø> (?)`
gearmand	`78.26% <ø> (+1.24%)`	⬆️
gitlab	`91.95% <ø> (?)`
gitlab_runner	`92.10% <ø> (ø)`
glusterfs	`80.09% <ø> (+0.90%)`	⬆️
go_expvar	`92.73% <ø> (?)`
gunicorn	`92.10% <ø> (+5.63%)`	⬆️
haproxy	`95.13% <ø> (+10.53%)`	⬆️
harbor	`89.65% <ø> (ø)`
hdfs_datanode	`89.74% <ø> (?)`
hdfs_namenode	`86.72% <ø> (?)`
http_check	`95.32% <ø> (+2.02%)`	⬆️
hudi	`73.91% <ø> (?)`
ibm_ace	`92.25% <ø> (?)`
ibm_db2	`86.87% <ø> (-8.29%)`	⬇️
ibm_mq	`91.28% <ø> (?)`
ibm_was	`96.08% <ø> (ø)`
iis	`93.98% <ø> (+39.04%)`	⬆️
istio	`78.14% <ø> (?)`
kafka	`64.70% <ø> (?)`
kafka_consumer	`93.07% <ø> (?)`
karpenter	`?`
kong	`87.62% <ø> (ø)`
kube_apiserver_metrics	`?`
kube_controller_manager	`?`
kube_dns	`?`
kube_metrics_server	`?`
kube_proxy	`?`
kube_scheduler	`97.92% <ø> (ø)`
kubernetes_state	`89.50% <ø> (ø)`
kyototycoon	`?`
lighttpd	`83.64% <ø> (?)`
linkerd	`85.22% <ø> (?)`
linux_proc_extras	`96.22% <ø> (ø)`
mapr	`82.42% <ø> (ø)`
mapreduce	`82.12% <ø> (?)`
marathon	`?`
marklogic	`96.09% <ø> (ø)`
mcache	`?`
mesos_master	`89.75% <ø> (?)`
mesos_slave	`93.33% <ø> (ø)`
mongo	`95.60% <ø> (?)`
mysql	`87.84% <ø> (?)`
nagios	`?`
network	`?`
nfsstat	`95.20% <ø> (+2.99%)`	⬆️
nginx	`95.07% <ø> (+0.53%)`	⬆️
nginx_ingress_controller	`?`
nvidia_triton	`?`
openldap	`96.33% <ø> (?)`
openmetrics	`98.08% <ø> (+28.66%)`	⬆️
openstack	`55.19% <ø> (?)`
openstack_controller	`93.96% <ø> (?)`
oracle	`88.33% <ø> (ø)`
pdh_check	`97.36% <ø> (+1.71%)`	⬆️
pgbouncer	`91.35% <ø> (ø)`
php_fpm	`90.53% <ø> (+9.46%)`	⬆️
postfix	`?`
postgres	`92.76% <ø> (-0.03%)`	⬇️
powerdns_recursor	`96.65% <ø> (?)`
process	`?`
prometheus	`94.17% <ø> (ø)`
proxysql	`98.97% <ø> (?)`
rabbitmq	`95.73% <ø> (+20.95%)`	⬆️
ray	`96.45% <ø> (ø)`
redisdb	`88.07% <ø> (+2.98%)`	⬆️
rethinkdb	`97.93% <ø> (?)`
riak	`99.21% <ø> (?)`
riakcs	`93.61% <ø> (ø)`
sap_hana	`?`
scylla	`98.29% <ø> (ø)`
silk	`?`
snmp	`95.76% <ø> (+3.81%)`	⬆️
snowflake	`?`
solr	`56.25% <ø> (?)`
sonarqube	`98.24% <ø> (ø)`
spark	`94.14% <ø> (+0.27%)`	⬆️
sqlserver	`88.82% <ø> (+0.61%)`	⬆️
squid	`100.00% <ø> (?)`
ssh_check	`91.58% <ø> (+6.43%)`	⬆️
statsd	`?`
strimzi	`89.78% <ø> (ø)`
supervisord	`89.78% <ø> (?)`
system_core	`?`
system_swap	`?`
tcp_check	`92.92% <ø> (+1.34%)`	⬆️
teamcity	`88.74% <ø> (+18.81%)`	⬆️
tekton	`82.30% <ø> (ø)`
temporal	`100.00% <ø> (ø)`
tomcat	`60.41% <ø> (?)`
twemproxy	`?`
twistlock	`?`
varnish	`84.39% <ø> (+0.26%)`	⬆️
vault	`95.53% <ø> (+25.76%)`	⬆️
vertica	`98.34% <ø> (+12.56%)`	⬆️
voltdb	`96.85% <ø> (+6.07%)`	⬆️
vsphere	`96.00% <ø> (+5.68%)`	⬆️
weaviate	`76.27% <ø> (ø)`
win32_event_log	`82.56% <ø> (-3.81%)`	⬇️
windows_performance_counters	`100.00% <ø> (+1.63%)`	⬆️
windows_service	`93.86% <ø> (?)`
wmi_check	`97.50% <ø> (?)`
yarn	`89.52% <ø> (?)`
zk	`82.64% <ø> (+1.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

github-actions · 2024-05-03T15:51:47Z

Test Results

1 046 files 1 046 suites 6h 38m 45s ⏱️
7 053 tests 6 976 ✅ 75 💤 1 ❌ 1 🔥
32 084 runs 26 588 ✅ 5 494 💤 1 ❌ 1 🔥

For more details on these failures and errors, see this check.

Results for commit 9663a00.

♻️ This comment has been updated with latest results.

iliakur

@Nevon thanks a bunch for your contribution!!

Apologies for the delay in response. Here are my thoughts.

Let's bake the logging functionality into text_fd_to_metric_families.
We'd need to make the following changes.
Move the parsing logic as-is to some private function in the same module as text_fd_to_metric_families, eg _parse_payload.

Then change text_fd_to_metric_families to be something along the following lines:

from itertools import tee
...
def text_fd_to_metric_families(fd):
    raw_lines, input_lines = tee(fd, n=2)
    try:
        for raw_line, metric_family in zip(raw_lines, _parse_payloads(input_lines):
            yield metric_family
    except Exception as e:
        raise ValueError("Failed to parse the metric response {}: {}".format(raw_line, e))

A few notes:

using itertools.tee is less conceputal overhead imo than a custom class
Let's just raise an exception with the relevant info. This is a crash, logging can be taken care of.

Nevon · 2024-05-17T16:37:03Z

Let's bake the logging functionality into text_fd_to_metric_families.

Are you sure? The reason I didn't do that in there, which would have been simpler to do, is because it's a straight up copy from the Python prometheus client, with some minor change to avoid a backwards incompatible change made in the upstream (I didn't look much into it). If I change more things in there it'll become harder to get rid of this technical debt and use the upstream directly.

I'm fine with making those changes, but I just want confirmation that you're aware of this before I do.

iliakur · 2024-05-20T09:14:21Z

@Nevon good point! I still think we can proceed with the "baked in" approach:

It's been 4 years since that code was committed. Not sure we're ever getting rid of that debt.
Note how in the sketch I posted we don't modify the parsing logic but rather wrap it? This is on purpose, I too don't want to mix the book-keeping with the parsing. We can put a nice fat comment next to or inside the _parse function to make it explicit that we don't want to touch this parsing logic.

Nevon requested review from a team as code owners May 3, 2024 14:48

datadog-agent-integrations-bot bot added base_package team/container-integrations team/agent-integrations labels May 3, 2024

Nevon force-pushed the log-openmetrics-parsing-error branch from 7ba302a to f64f690 Compare May 3, 2024 14:48

Nevon force-pushed the log-openmetrics-parsing-error branch 3 times, most recently from 1815a41 to 40ac319 Compare May 6, 2024 08:18

Log warning when failing to parse openmetrics response

9663a00

Nevon force-pushed the log-openmetrics-parsing-error branch from 40ac319 to 9663a00 Compare May 6, 2024 08:41

This was referenced May 16, 2024

Openmetrics check fails to parse metrics with spaces in label values #17582

Closed

prometheus_client/parser.py unhelpful stacktrace #13037

Open

iliakur requested changes May 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log warning when failing to parse openmetrics response #17514

Log warning when failing to parse openmetrics response #17514

Nevon commented May 3, 2024 •

edited

codecov bot commented May 3, 2024 •

edited

github-actions bot commented May 3, 2024 •

edited

iliakur left a comment

Nevon commented May 17, 2024

iliakur commented May 20, 2024 •

edited

Log warning when failing to parse openmetrics response #17514

Are you sure you want to change the base?

Log warning when failing to parse openmetrics response #17514

Conversation

Nevon commented May 3, 2024 • edited

What does this PR do?

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

codecov bot commented May 3, 2024 • edited

Codecov Report

github-actions bot commented May 3, 2024 • edited

Test Results

iliakur left a comment

Choose a reason for hiding this comment

Nevon commented May 17, 2024

iliakur commented May 20, 2024 • edited

Nevon commented May 3, 2024 •

edited

codecov bot commented May 3, 2024 •

edited

github-actions bot commented May 3, 2024 •

edited

iliakur commented May 20, 2024 •

edited