Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarize pagerduty alerts with AlertmanagerConfig CRD. #6463

Open
dezka opened this issue Apr 2, 2024 · 1 comment
Open

Summarize pagerduty alerts with AlertmanagerConfig CRD. #6463

dezka opened this issue Apr 2, 2024 · 1 comment

Comments

@dezka
Copy link

dezka commented Apr 2, 2024

What happened?

Description

The summary of our alerts (shown in the title and as the Slack message) include all of the labels in the message, making it very hard to see what the real problem is. An example:

[FIRING:1] NodeMemoryMajorPagesFaults dev node-exporter http-metrics 10.92.10.18:9100 node-exporter monitoring kube-prometheus-stack-prometheus-node-exporter-n4gq8 monitoring/kube-prometheus-stack-prometheus kube-prometheus-stack-prometheus-node-exporter warning infra
Labels:
 - alertname = NodeMemoryMajorPagesFaults
 - cluster_id = dev
 - container = node-exporter
 - endpoint = http-metrics
 - instance = 10.92.10.18:9100
 - job = node-exporter
 - namespace = monitoring
 - pod = kube-prometheus-stack-prometheus-node-exporter-n4gq8
 - prometheus = monitoring/kube-prometheus-stack-prometheus
 - service = kube-prometheus-stack-prometheus-node-exporter
 - severity = warning
 - team = infra
Annotations:
 - description = Memory major pages are occurring at very high rate at 10.92.10.18:9100, 500 major page faults per second for the last 15 minutes, is currently at 1400.70.
Please check that there is enough memory available at this instance.

 - runbook_url = https://runbooks.prometheus-operator.dev/runbooks/node/nodememorymajorpagesfaults
 - summary = Memory major page faults are occurring at very high rate.
Source: https://prometheus.ourdomain/graph?g0.expr=rate%28node_vmstat_pgmajfault%7Bjob%3D%22node-exporter%22%7D%5B5m%5D%29+%3E+500&g0.tab=1

I've tried to summarize in the AlertManagerConfig like so:

        - details:
          - key: summary
            value: '{{ `{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}` }}'

Full CRD:

---
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: {{ .Release.Name }}-global
spec:
  inhibitRules:
    - equal:
        - namespace
        - alertname
      sourceMatch:
        - name: severity
          matchType: =
          value: critical
      targetMatch:
        - name: severity
          matchType: =~
          value: warning|info
    - equal:
        - namespace
        - alertname
      sourceMatch:
        - name: severity
          matchType: =
          value: warning
      targetMatch:
        - name: severity
          matchType: =
          value: info
    - equal:
        - namespace
      sourceMatch:
        - name: alertname
          matchType: =
          value: InfoInhibitor
      targetMatch:
        - name: severity
          matchType: =
          value: info
  route:
    groupBy: ["..."]
    groupWait: 2s
    repeatInterval: 12h
    receiver: "null"
    routes:
      - matchers:
          - name: severity
            matchType: =
            value: warning
          - name: team
            matchType: =
            value: infra
        receiver: {{ .Values.alertmanager.routes.cluster_warning }}
      - matchers:
          - name: severity
            matchType: =
            value: critical
          - name: team
            matchType: =
            value: infra
        receiver: {{ .Values.alertmanager.routes.cluster_critical }}
      - matchers:
          - name: alertname
            matchType: =~
            value: InfoInhibitor|Watchdog
        receiver: "null"
  receivers:
    - name: "null"
  {{- range .Values.alertmanager.receivers.pagerduty }}
    - name: {{ . }}
      pagerdutyConfigs:
        - routingKey:
            name: alertmanager-{{ $.Release.Name }}-pagerduty
            key: {{ . }}
        - details:
          - key: summary
            value: '{{ `{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}` }}'
  {{- end }}

This doesn't change anything however. I'm pretty much out of ideas, and I would like to keep trying to use this new CRD rather than our older method. Any help is appreciated.

Steps to Reproduce

Expected Result

The summary displayed on PagerDuty and in Slack should actually be Memory major page faults are occurring at very high rate. as shown above.

Actual Result

Prometheus Operator Version

Name:                   kube-prometheus-stack-operator
Namespace:              monitoring
CreationTimestamp:      Wed, 19 Jan 2022 20:16:22 -0500
Labels:                 app=kube-prometheus-stack-operator
                        app.kubernetes.io/instance=kube-prometheus-stack
                        app.kubernetes.io/managed-by=Helm
                        app.kubernetes.io/part-of=kube-prometheus-stack
                        app.kubernetes.io/version=51.0.3
                        chart=kube-prometheus-stack-51.0.3
                        heritage=Helm
                        release=kube-prometheus-stack
Annotations:            argocd.argoproj.io/tracking-id: dev-kube-prometheus-stack:apps/Deployment:monitoring/kube-prometheus-stack-operator
                        deployment.kubernetes.io/revision: 22
Selector:               app=kube-prometheus-stack-operator,release=kube-prometheus-stack
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:           app=kube-prometheus-stack-operator
                    app.kubernetes.io/instance=kube-prometheus-stack
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/part-of=kube-prometheus-stack
                    app.kubernetes.io/version=51.0.3
                    chart=kube-prometheus-stack-51.0.3
                    heritage=Helm
                    release=kube-prometheus-stack
  Service Account:  kube-prometheus-stack-operator
  Containers:
   kube-prometheus-stack:
    Image:      584508078187.dkr.ecr.us-east-1.amazonaws.com/quay/prometheus-operator/prometheus-operator:v0.68.0
    Port:       10250/TCP
    Host Port:  0/TCP
    Args:
      --kubelet-service=kube-system/kube-prometheus-stack-kubelet
      --localhost=127.0.0.1
      --prometheus-config-reloader=584508078187.dkr.ecr.us-east-1.amazonaws.com/quay/prometheus-operator/prometheus-config-reloader:v0.68.0
      --config-reloader-cpu-request=200m
      --config-reloader-cpu-limit=200m
      --config-reloader-memory-request=50Mi
      --config-reloader-memory-limit=50Mi
      --thanos-default-base-image=quay.io/thanos/thanos:v0.32.2
      --secret-field-selector=type!=kubernetes.io/dockercfg,type!=kubernetes.io/service-account-token,type!=helm.sh/release.v1
      --web.enable-tls=true
      --web.cert-file=/cert/cert
      --web.key-file=/cert/key
      --web.listen-address=:10250
      --web.tls-min-version=VersionTLS13
    Environment:  <none>
    Mounts:
      /cert from tls-secret (ro)
  Volumes:
   tls-secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-prometheus-stack-admission
    Optional:    false

Kubernetes Version

clientVersion:
  buildDate: "2023-10-18T11:33:16Z"
  compiler: gc
  gitCommit: a8a1abc25cad87333840cd7d54be2efaf31a3177
  gitTreeState: clean
  gitVersion: v1.28.3
  goVersion: go1.20.10
  major: "1"
  minor: "28"
  platform: darwin/arm64
kustomizeVersion: v5.0.4-0.20230601165947-6ce0bf390ce3
serverVersion:
  buildDate: "2024-01-29T20:59:05Z"
  compiler: gc
  gitCommit: e99f7c75641f738090d483d988dc4a70001e01cf
  gitTreeState: clean
  gitVersion: v1.27.10-eks-508b6b3
  goVersion: go1.20.13
  major: "1"
  minor: 27+
  platform: linux/amd64

Kubernetes Cluster Type

EKS

How did you deploy Prometheus-Operator?

helm chart:prometheus-community/kube-prometheus-stack

Manifests

No response

prometheus-operator log output

No output related to this is displayed.

Anything else?

No response

@dezka dezka added kind/support needs-triage Issues that haven't been triaged yet labels Apr 2, 2024
@dezka dezka changed the title Summarize pagerduty alerts with AlertManagerConfig CRD. Summarize pagerduty alerts with AlertmanagerConfig CRD. Apr 2, 2024
@simonpasquier
Copy link
Contributor

can you share the rendered AlertmanagerConfig manifest (not the Helm version)?

@simonpasquier simonpasquier removed the needs-triage Issues that haven't been triaged yet label Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants