Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana dashboards do not work on a fresh helm install #7120

Closed
tmacam opened this issue Oct 30, 2023 · 1 comment · Fixed by #7121
Closed

Grafana dashboards do not work on a fresh helm install #7120

tmacam opened this issue Oct 30, 2023 · 1 comment · Fixed by #7121
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@tmacam
Copy link
Contributor

tmacam commented Oct 30, 2023

In what area(s)?

/area test-and-release

What version of Dapr?

1.12.0

Expected Behavior

Grafana dashboards work out-of-the box when imported.

Actual Behavior

The current grafana dashboards do not work in fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see 1, 2). They refer to metrics that are not available in
such install.

Namely, the expect the following metrics to be renamed

Name expected on templates Existing metric name
kubernetes_name service
kubernetes_namespace namespace
kubernetes_node node
kubernetes_pod_name pod

This issue is also mentioned in dapr/test-infra#204

Steps to Reproduce the Problem

Install Dapr on a new cluster, install the test applications from dapr/test-infra, install prometeus and install grafana following Dapr documentation and import grafana-sidecar-dashboard.json. No metrics are available.

Release Note

RELEASE NOTE: FIX Broken grafana dashboards

@tmacam tmacam added the kind/bug Something isn't working label Oct 30, 2023
tmacam added a commit to tmacam/dapr that referenced this issue Oct 30, 2023
The current grafana dashboards do not work in fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>
@tmacam tmacam mentioned this issue Oct 30, 2023
7 tasks
tmacam added a commit to tmacam/dapr that referenced this issue Oct 31, 2023
The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>
@tmacam
Copy link
Contributor Author

tmacam commented Oct 31, 2023

Reproducing the bug-report from dapr/test-infra#204


Regarding item 3 (missing Prometheus metrics), seems there is a major difference in how Prometheus is configured out of the box (be it the Azure managed one or from a fresh Helm setup) and how it is configured right now in the release clusters. This distinction is also encoded in the grafana dashboards we saved in dapr/dapr, which refer to metrics by names that only exists in the release longaul prometheus setup.

As an example, I am pasting a diff of what one would find in a helm-installed grafana and what we have in release longhaul:

--- fresh-from-helm-prometheus.yaml	2023-10-01 14:33:45.782910959 -0700
+++ release-prometheus.yaml	2023-10-01 14:33:45.793744284 -0700
@@ -1,4 +1,4 @@
-issue6946-prometheus.yml
+release-prometheus.yml
 global:
   evaluation_interval: 1m
   scrape_interval: 1m
@@ -64,8 +64,7 @@
   tls_config:
     ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
     insecure_skip_verify: true
-- honor_labels: true
-  job_name: kubernetes-service-endpoints
+- job_name: kubernetes-service-endpoints
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -73,10 +72,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_service_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -88,7 +83,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -102,17 +97,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-service-endpoints-slow
+    target_label: kubernetes_node
+- job_name: kubernetes-service-endpoints-slow
   kubernetes_sd_configs:
   - role: endpoints
   relabel_configs:
@@ -131,7 +125,7 @@
     - __meta_kubernetes_service_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (.+?)(?::\d+)?;(\d+)
+    regex: ([^:]+)(?::\d+)?;(\d+)
     replacement: $1:$2
     source_labels:
     - __address__
@@ -145,15 +139,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
+    target_label: kubernetes_name
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_node_name
-    target_label: node
+    target_label: kubernetes_node
   scrape_interval: 5m
   scrape_timeout: 30s
 - honor_labels: true
@@ -165,8 +159,7 @@
     regex: pushgateway
     source_labels:
     - __meta_kubernetes_service_annotation_prometheus_io_probe
-- honor_labels: true
-  job_name: kubernetes-services
+- job_name: kubernetes-services
   kubernetes_sd_configs:
   - role: service
   metrics_path: /probe
@@ -190,12 +183,11 @@
     regex: __meta_kubernetes_service_label_(.+)
   - source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - source_labels:
     - __meta_kubernetes_service_name
-    target_label: service
-- honor_labels: true
-  job_name: kubernetes-pods
+    target_label: kubernetes_name
+- job_name: kubernetes-pods
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -203,10 +195,6 @@
     regex: true
     source_labels:
     - __meta_kubernetes_pod_annotation_prometheus_io_scrape
-  - action: drop
-    regex: true
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_scrape_slow
   - action: replace
     regex: (https?)
     source_labels:
@@ -218,18 +206,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -239,21 +220,16 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
-- honor_labels: true
-  job_name: kubernetes-pods-slow
+- job_name: kubernetes-pods-slow
   kubernetes_sd_configs:
   - role: pod
   relabel_configs:
@@ -272,18 +248,11 @@
     - __meta_kubernetes_pod_annotation_prometheus_io_path
     target_label: __metrics_path__
   - action: replace
-    regex: (\d+);(([A-Fa-f0-9]{1,4}::?){1,7}[A-Fa-f0-9]{1,4})
-    replacement: '[$2]:$1'
-    source_labels:
-    - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
-    target_label: __address__
-  - action: replace
-    regex: (\d+);((([0-9]+?)(\.|$)){4})
-    replacement: $2:$1
+    regex: ([^:]+)(?::\d+)?;(\d+)
+    replacement: $1:$2
     source_labels:
+    - __address__
     - __meta_kubernetes_pod_annotation_prometheus_io_port
-    - __meta_kubernetes_pod_ip
     target_label: __address__
   - action: labelmap
     regex: __meta_kubernetes_pod_annotation_prometheus_io_param_(.+)
@@ -293,19 +262,15 @@
   - action: replace
     source_labels:
     - __meta_kubernetes_namespace
-    target_label: namespace
+    target_label: kubernetes_namespace
   - action: replace
     source_labels:
     - __meta_kubernetes_pod_name
-    target_label: pod
+    target_label: kubernetes_pod_name
   - action: drop
     regex: Pending|Succeeded|Failed|Completed
     source_labels:
     - __meta_kubernetes_pod_phase
-  - action: replace
-    source_labels:
-    - __meta_kubernetes_pod_node_name
-    target_label: node
   scrape_interval: 5m
   scrape_timeout: 30s
 alerting:
@@ -319,12 +284,15 @@
     - source_labels: [__meta_kubernetes_namespace]
       regex: dapr-monitoring
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_instance]
-      regex: dapr-prom
+    - source_labels: [__meta_kubernetes_pod_label_app]
+      regex: prometheus
       action: keep
-    - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
+    - source_labels: [__meta_kubernetes_pod_label_component]
       regex: alertmanager
       action: keep
+    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
+      regex: .*
+      action: keep
     - source_labels: [__meta_kubernetes_pod_container_port_number]
       regex: "9093"
       action: keep

tmacam added a commit to tmacam/dapr that referenced this issue Nov 3, 2023
The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes dapr#7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>
@mukundansundar mukundansundar added this to the v1.13 milestone Nov 4, 2023
mukundansundar pushed a commit that referenced this issue Nov 4, 2023
* Fix Grafana dashboards.

The current grafana dashboards do not work in a fresh cluster where
prometheus and grafana are installed using helm following Dapr Docs
(see [1], [2]). They refer to metrics that are not available in
such install.

In short, based on bug-report from dapr/test-infra#204, the proposed
fix can be summed by:

```bash
sed -i \
    -e 's/\bkubernetes_name\b/service/g' \
    -e 's/\bkubernetes_namespace\b/namespace/g' \
    -e 's/\bkubernetes_node\b/node/g' \
    -e 's/\bkubernetes_pod_name\b/pod/g' \
    *.json
```

Additionally:

* Removes refresh rates smaller than 1 minute.
* Sets default interval range to 14 days in the past to now
* Sets default template values to match the longhaul clusters.

Fixes #7120

[1]: https://docs.dapr.io/operations/observability/metrics/prometheus/#setup-prometheus-on-kubernetes
[2]: https://docs.dapr.io/operations/observability/metrics/grafana/#setup-on-kubernetes

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

* Remove longhaul related settings.

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>

---------

Signed-off-by: Tiago Alves Macambira <tmacam@burocrata.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants