Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log warning when failing to parse openmetrics response #17514

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Nevon
Copy link
Contributor

@Nevon Nevon commented May 3, 2024

What does this PR do?

Log a warning when the plain text prometheus metric parsing fails, including the line that it failed to parse. For example:

2024-05-15 14:58:11 UTC | CORE | WARN | (pkg/collector/python/datadog_agent.go:131 in LogMessage) | openmetrics:c2c_prometheus:21fa576d88f0d1f8 | (mixins.py:490) | Failed to parse metric response 'histogram_bucket{le="0.01"💣} 0': Invalid metric name: histogram_bucket{le="0.01"💣}

Motivation

This issue

Currently the text_fd_to_metric_families just raises generic errors that don't provide enough context to know what it was that couldn't be parsed, which is problematic when deployed into environments where a single agent is parsing metrics from many different sources via autodiscovery.

Additional Notes

I'm very much not a Python developer, so let me know if what I've done is very unidiomatic. I considered wrapping the exception in my own ParserException, but from reading the other checks that didn't seem like a common thing to do. I also wasn't able to actually run the integration tests locally, because they seem to only work with Docker and not docker-compatible clients like nerdctl, but I'm assuming the CI pipeline will.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Changelog entries must be created for modifications to shipped code
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link

codecov bot commented May 3, 2024

Codecov Report

Attention: Patch coverage is 70.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 89.95%. Comparing base (2a6ec26) to head (9663a00).
Report is 87 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
active_directory 100.00% <ø> (+17.64%) ⬆️
activemq_xml ?
aerospike 87.17% <ø> (+0.32%) ⬆️
airflow 92.20% <ø> (ø)
amazon_msk 88.91% <ø> (+37.86%) ⬆️
ambari ?
apache 95.08% <ø> (?)
arangodb ?
aspdotnet 100.00% <ø> (?)
avi_vantage ?
azure_iot_edge ?
cacti 87.90% <ø> (?)
cassandra_nodetool 93.16% <ø> (?)
ceph 91.07% <ø> (ø)
cert_manager 77.41% <ø> (?)
cilium 77.72% <ø> (?)
cisco_aci ?
citrix_hypervisor ?
clickhouse 94.18% <ø> (-1.47%) ⬇️
cloudera ?
cockroachdb 93.19% <ø> (ø)
confluent_platform 81.96% <ø> (?)
consul 91.82% <ø> (?)
coredns 94.61% <ø> (?)
couch 94.50% <ø> (+10.28%) ⬆️
couchbase 84.89% <ø> (ø)
crio ?
datadog_checks_base 89.63% <70.00%> (+1.34%) ⬆️
datadog_checks_dev 74.79% <ø> (?)
datadog_checks_downloader 80.78% <ø> (-0.62%) ⬇️
datadog_cluster_agent 90.19% <ø> (ø)
dcgm ?
ddev 87.28% <ø> (-0.59%) ⬇️
disk 82.30% <ø> (?)
dns_check 90.97% <ø> (-2.36%) ⬇️
dotnetclr 94.33% <ø> (+35.19%) ⬆️
druid 97.70% <ø> (ø)
ecs_fargate ?
elastic 93.36% <ø> (ø)
envoy ?
etcd 95.56% <ø> (?)
exchange_server 96.96% <ø> (+48.15%) ⬆️
external_dns ?
fluentd 94.48% <ø> (?)
fluxcd 88.31% <ø> (?)
foundationdb 83.83% <ø> (?)
gearmand 78.26% <ø> (+1.24%) ⬆️
gitlab 91.95% <ø> (?)
gitlab_runner 92.10% <ø> (ø)
glusterfs 80.09% <ø> (+0.90%) ⬆️
go_expvar 92.73% <ø> (?)
gunicorn 92.10% <ø> (+5.63%) ⬆️
haproxy 95.13% <ø> (+10.53%) ⬆️
harbor 89.65% <ø> (ø)
hdfs_datanode 89.74% <ø> (?)
hdfs_namenode 86.72% <ø> (?)
http_check 95.32% <ø> (+2.02%) ⬆️
hudi 73.91% <ø> (?)
ibm_ace 92.25% <ø> (?)
ibm_db2 86.87% <ø> (-8.29%) ⬇️
ibm_mq 91.28% <ø> (?)
ibm_was 96.08% <ø> (ø)
iis 93.98% <ø> (+39.04%) ⬆️
istio 78.14% <ø> (?)
kafka 64.70% <ø> (?)
kafka_consumer 93.07% <ø> (?)
karpenter ?
kong 87.62% <ø> (ø)
kube_apiserver_metrics ?
kube_controller_manager ?
kube_dns ?
kube_metrics_server ?
kube_proxy ?
kube_scheduler 97.92% <ø> (ø)
kubernetes_state 89.50% <ø> (ø)
kyototycoon ?
lighttpd 83.64% <ø> (?)
linkerd 85.22% <ø> (?)
linux_proc_extras 96.22% <ø> (ø)
mapr 82.42% <ø> (ø)
mapreduce 82.12% <ø> (?)
marathon ?
marklogic 96.09% <ø> (ø)
mcache ?
mesos_master 89.75% <ø> (?)
mesos_slave 93.33% <ø> (ø)
mongo 95.60% <ø> (?)
mysql 87.84% <ø> (?)
nagios ?
network ?
nfsstat 95.20% <ø> (+2.99%) ⬆️
nginx 95.07% <ø> (+0.53%) ⬆️
nginx_ingress_controller ?
nvidia_triton ?
openldap 96.33% <ø> (?)
openmetrics 98.08% <ø> (+28.66%) ⬆️
openstack 55.19% <ø> (?)
openstack_controller 93.96% <ø> (?)
oracle 88.33% <ø> (ø)
pdh_check 97.36% <ø> (+1.71%) ⬆️
pgbouncer 91.35% <ø> (ø)
php_fpm 90.53% <ø> (+9.46%) ⬆️
postfix ?
postgres 92.76% <ø> (-0.03%) ⬇️
powerdns_recursor 96.65% <ø> (?)
process ?
prometheus 94.17% <ø> (ø)
proxysql 98.97% <ø> (?)
rabbitmq 95.73% <ø> (+20.95%) ⬆️
ray 96.45% <ø> (ø)
redisdb 88.07% <ø> (+2.98%) ⬆️
rethinkdb 97.93% <ø> (?)
riak 99.21% <ø> (?)
riakcs 93.61% <ø> (ø)
sap_hana ?
scylla 98.29% <ø> (ø)
silk ?
snmp 95.76% <ø> (+3.81%) ⬆️
snowflake ?
solr 56.25% <ø> (?)
sonarqube 98.24% <ø> (ø)
spark 94.14% <ø> (+0.27%) ⬆️
sqlserver 88.82% <ø> (+0.61%) ⬆️
squid 100.00% <ø> (?)
ssh_check 91.58% <ø> (+6.43%) ⬆️
statsd ?
strimzi 89.78% <ø> (ø)
supervisord 89.78% <ø> (?)
system_core ?
system_swap ?
tcp_check 92.92% <ø> (+1.34%) ⬆️
teamcity 88.74% <ø> (+18.81%) ⬆️
tekton 82.30% <ø> (ø)
temporal 100.00% <ø> (ø)
tomcat 60.41% <ø> (?)
twemproxy ?
twistlock ?
varnish 84.39% <ø> (+0.26%) ⬆️
vault 95.53% <ø> (+25.76%) ⬆️
vertica 98.34% <ø> (+12.56%) ⬆️
voltdb 96.85% <ø> (+6.07%) ⬆️
vsphere 96.00% <ø> (+5.68%) ⬆️
weaviate 76.27% <ø> (ø)
win32_event_log 82.56% <ø> (-3.81%) ⬇️
windows_performance_counters 100.00% <ø> (+1.63%) ⬆️
windows_service 93.86% <ø> (?)
wmi_check 97.50% <ø> (?)
yarn 89.52% <ø> (?)
zk 82.64% <ø> (+1.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Copy link

github-actions bot commented May 3, 2024

Test Results

 1 046 files   1 046 suites   6h 38m 45s ⏱️
 7 053 tests  6 976 ✅    75 💤 1 ❌ 1 🔥
32 084 runs  26 588 ✅ 5 494 💤 1 ❌ 1 🔥

For more details on these failures and errors, see this check.

Results for commit 9663a00.

♻️ This comment has been updated with latest results.

@Nevon Nevon force-pushed the log-openmetrics-parsing-error branch 3 times, most recently from 1815a41 to 40ac319 Compare May 6, 2024 08:18
Copy link
Contributor

@iliakur iliakur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nevon thanks a bunch for your contribution!!

Apologies for the delay in response. Here are my thoughts.

Let's bake the logging functionality into text_fd_to_metric_families.
We'd need to make the following changes.
Move the parsing logic as-is to some private function in the same module as text_fd_to_metric_families, eg _parse_payload.

Then change text_fd_to_metric_families to be something along the following lines:

from itertools import tee
...
def text_fd_to_metric_families(fd):
    raw_lines, input_lines = tee(fd, n=2)
    try:
        for raw_line, metric_family in zip(raw_lines, _parse_payloads(input_lines):
            yield metric_family
    except Exception as e:
        raise ValueError("Failed to parse the metric response {}: {}".format(raw_line, e))

A few notes:

  • using itertools.tee is less conceputal overhead imo than a custom class
  • Let's just raise an exception with the relevant info. This is a crash, logging can be taken care of.

@Nevon
Copy link
Contributor Author

Nevon commented May 17, 2024

Let's bake the logging functionality into text_fd_to_metric_families.

Are you sure? The reason I didn't do that in there, which would have been simpler to do, is because it's a straight up copy from the Python prometheus client, with some minor change to avoid a backwards incompatible change made in the upstream (I didn't look much into it). If I change more things in there it'll become harder to get rid of this technical debt and use the upstream directly.

I'm fine with making those changes, but I just want confirmation that you're aware of this before I do.

@iliakur
Copy link
Contributor

iliakur commented May 20, 2024

@Nevon good point! I still think we can proceed with the "baked in" approach:

  1. It's been 4 years since that code was committed. Not sure we're ever getting rid of that debt.
  2. Note how in the sketch I posted we don't modify the parsing logic but rather wrap it? This is on purpose, I too don't want to mix the book-keeping with the parsing. We can put a nice fat comment next to or inside the _parse function to make it explicit that we don't want to touch this parsing logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants