Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect agent scrape problems in pipeline status #976

Open
9 tasks
a-thaler opened this issue Apr 16, 2024 · 0 comments
Open
9 tasks

Reflect agent scrape problems in pipeline status #976

a-thaler opened this issue Apr 16, 2024 · 0 comments
Labels
area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.

Comments

@a-thaler
Copy link
Collaborator

a-thaler commented Apr 16, 2024

Description
Following up on #425, user problems in the metric agent are currently neglected. The typical user problems happening in the agent are:

  • A scrape target for the prometheus or istio input is down (not reachable because of networkPolicy for example)
  • A scrape target for the prometheus or istio input is returning too much data exceeding the sample limit

Goals:

  • Reflect these problems as warnings in the pipeline status
  • Make the diagnostic metrics indicating scrape problems accessible for operations

Criterias

  • The MetricPipeline status is reflecting the two mentioned problems, either in the existing dataFlow or in a new agent specific ScrapeHealthiness condition
  • There is a troubleshooting section describing the typical reasons for the problems
  • The condition reason list the top 5 scrape problems, not only a single problem
  • The status will reflect the situation with delay but dynamically without any restart
  • All diagnostic metrics of a scrape loop are accessible via a port of the agent for troubleshooting but not for the user

Implementation Ideas
The used prometheusreceiver provides diagnosticMetrics which can be enabled by the user already. However, they are not available for operations and also are not accessible by the self-monitor. So we could introduce a new otel-collector pipeline in the metric agent (enabled only if there is a prometheusreceiver) which has all prometheusreceivers as input, filters by relevant metrics only (maybe even unhealthy ones to save timeseries) and exports them under a new dedicated port using the prometheusexporter. Then configure the self-monitor to scrape the new endpoint. For troubleshooting the self-monitor dashboard can be used to introspect the selected metrics or the new port can be accessed directly to introspect all scrape related metrics.

Potential metrics interesting for realizing the goal are:

scrape_samples_scraped: The number of samples the target exposed
scrape_samples_post_metric_relabeling: The number of samples remaining after metric relabeling was applied
scrape_series_added: The approximate number of new series in this scrape
up: The scraping was successful

Items

  • Preparation
    • Understand which metrics need to get collected and how the alert rules must look like to have status available for the described situations
    • Have a PoC in place proving the idea E2E
  • Implementation

Reasons

Attachments

Release Notes


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics MetricPipeline kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant