Grafana dashboards do not work on a fresh helm install #7120

tmacam · 2023-10-30T23:47:42Z

In what area(s)?

/area test-and-release

What version of Dapr?

1.12.0

Expected Behavior

Grafana dashboards work out-of-the box when imported.

Actual Behavior

The current grafana dashboards do not work in fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see 1, 2). They refer to metrics that are not available in
such install.

Namely, the expect the following metrics to be renamed

Name expected on templates	Existing metric name
kubernetes_name	service
kubernetes_namespace	namespace
kubernetes_node	node
kubernetes_pod_name	pod

This issue is also mentioned in dapr/test-infra#204

Steps to Reproduce the Problem

Install Dapr on a new cluster, install the test applications from dapr/test-infra, install prometeus and install grafana following Dapr documentation and import grafana-sidecar-dashboard.json. No metrics are available.

Release Note

RELEASE NOTE: FIX Broken grafana dashboards

The current grafana dashboards do not work in fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

tmacam · 2023-10-31T22:13:35Z

Reproducing the bug-report from dapr/test-infra#204

Regarding item 3 (missing Prometheus metrics), seems there is a major difference in how Prometheus is configured out of the box (be it the Azure managed one or from a fresh Helm setup) and how it is configured right now in the release clusters. This distinction is also encoded in the grafana dashboards we saved in dapr/dapr, which refer to metrics by names that only exists in the release longaul prometheus setup.

As an example, I am pasting a diff of what one would find in a helm-installed grafana and what we have in release longhaul:

--- fresh-from-helm-prometheus.yaml	2023-10-01 14:33:45.782910959 -0700
+++ release-prometheus.yaml	2023-10-01 14:33:45.793744284 -0700
@@ -1,4 +1,4 @@
-issue6946-prometheus.yml
+release-prometheus.yml
 global:
   evaluation_interval: 1m
   scrape_interval: 1m
@@ -64,8 +64,7 @@
   tls_config:
     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     insecure_skip_verify: true
-- honor_labels: true
-  job_name: kubernetes-service-endpoints
+- job_name: kubernetes-service-endpoints
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -73,10 +72,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -88,7 +83,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -102,17 +97,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-service-endpoints-slow
+    target_label: kubernetes_node
+- job_name: kubernetes-service-endpoints-slow
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -131,7 +125,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -145,15 +139,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
+    target_label: kubernetes_node
   scrape_interval: 5m
   scrape_timeout: 30s
 - honor_labels: true
@@ -165,8 +159,7 @@
     regex: pushgateway
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_probe
-- honor_labels: true
-  job_name: kubernetes-services
+- job_name: kubernetes-services
   kubernetes_sd_configs:
   - role: service
   metrics_path: /probe
@@ -190,12 +183,11 @@
     regex: __meta_kubernetes_service_label_(.+)
   - source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
-- honor_labels: true
-  job_name: kubernetes-pods
+    target_label: kubernetes_name
+- job_name: kubernetes-pods
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -203,10 +195,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_pod_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -218,18 +206,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -239,21 +220,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-pods-slow
+- job_name: kubernetes-pods-slow
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -272,18 +248,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -293,19 +262,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
   scrape_interval: 5m
   scrape_timeout: 30s
 alerting:
@@ -319,12 +284,15 @@
     - source_labels: [__meta_kubernetes_namespace]
       regex: dapr-monitoring
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
-      regex: dapr-prom
+    - source_labels: [__meta_kubernetes_pod_label_app]
+      regex: prometheus
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
+    - source_labels: [__meta_kubernetes_pod_label_component]
       regex: alertmanager
       action: keep
+    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
+      regex: .*
+      action: keep
     - source_labels: [__meta_kubernetes_pod_container_port_number]
       regex: "9093"
       action: keep

The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes dapr#7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

* Fix Grafana dashboards. The current grafana dashboards do not work in a fresh cluster where prometheus and grafana are installed using helm following Dapr Docs (see [1], [2]). They refer to metrics that are not available in such install. In short, based on bug-report from dapr/test-infra#204, the proposed fix can be summed by: ```bash sed -i \ -e 's/\bkubernetes_name\b/service/g' \ -e 's/\bkubernetes_namespace\b/namespace/g' \ -e 's/\bkubernetes_node\b/node/g' \ -e 's/\bkubernetes_pod_name\b/pod/g' \ *.json ``` Additionally: * Removes refresh rates smaller than 1 minute. * Sets default interval range to 14 days in the past to now * Sets default template values to match the longhaul clusters. Fixes #7120 [1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes [2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org> * Remove longhaul related settings. Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org> --------- Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

tmacam added the kind/bug Something isn't working label Oct 30, 2023

tmacam mentioned this issue Oct 30, 2023

Fix Grafana dashboards. #7121

Merged

7 tasks

mukundansundar added this to the v1.13 milestone Nov 4, 2023

mukundansundar assigned tmacam Nov 4, 2023

mukundansundar closed this as completed in #7121 Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana dashboards do not work on a fresh helm install #7120

Grafana dashboards do not work on a fresh helm install #7120

tmacam commented Oct 30, 2023 •

edited

tmacam commented Oct 31, 2023

Grafana dashboards do not work on a fresh helm install #7120

Grafana dashboards do not work on a fresh helm install #7120

Comments

tmacam commented Oct 30, 2023 • edited

In what area(s)?

What version of Dapr?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Release Note

tmacam commented Oct 31, 2023

tmacam commented Oct 30, 2023 •

edited