Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One AlertmanagerConfig failing to sync, blocks all others #6532

Open
1 task done
Daniel-Vaz opened this issue Apr 19, 2024 · 4 comments · May be fixed by #6585
Open
1 task done

One AlertmanagerConfig failing to sync, blocks all others #6532

Daniel-Vaz opened this issue Apr 19, 2024 · 4 comments · May be fixed by #6585

Comments

@Daniel-Vaz
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Description

In our multi-tenant clusters, we have many users deploying their own AlertmanagerConfig objects in their namespaces, and they all get synced to a central Alertmanager.

If in namespace we create a AlertmanagerConfig Object that for some reason fails to get synced to the managed Alertmanager, then the Operator will NOT ignore it and try and load other newly created AlertmanagerConfig Objects coming in from other namespaces.

For example, we heavily work with Slack receivers. If for the URL secret, the value has any issues the Prometheus Operator fails to sync the AlertmanagerConfig reporting the following error:

level=error ts=2024-04-19T07:26:08.458451375Z caller=klog.go:126 component=k8s_client_runtime func=ErrorDepth msg="sync \"monitoring-system/kps-alertmanager\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig test/slack-receiver: SlackConfig[0]: invalid URL \"'https://hooks.slack.com/services/XXX/XXX/XXX'\" in key \"url\" from secret \"slackapiurl-secret\": validate url from string failed for 'https://hooks.slack.com/services/XXX/XXX/XXX': parse \"'https://hooks.slack.com/services/XXX/XXX/XXX'\": first path segment in URL cannot contain colon"

This error seems to block Prometheus Operator from syncing any other valid AlertmanagerConfig Objects.

Steps to Reproduce

  1. Create namespace "test";
kubectl create ns test
  1. Create a secret inside that namespace with a improperly formatted Slack URL endpoint
kubectl -n test create secret generic  slackapiurl-secret --from-literal=url=\'https://hooks.slack.com/services/XXX/XXX/XXX\'
  1. Create a standard AlertmanagerConfig with a Slack Receiver\Route using the above created secret
cat <<EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: slack-receiver
  namespace: test
spec:
  inhibitRules:
  - equal:
    - alertname
    sourceMatch:
    - name: 'severity'
      value: 'critical'
      matchType: '='
    targetMatch:
    - name: 'severity'
      value: 'warning'
      matchType: '='
  receivers:
  - name: SlackAlerts
    slackConfigs:
      - channel: '#slack-channel-example'
        apiURL:
          name: slackapiurl-secret
          key: url
        sendResolved: true
  route:
    receiver: 'SlackAlerts'
    groupBy: [cluster_short, alertname]
    groupWait: 60s
    groupInterval: 15m
    repeatInterval: 4h
    continue: true
EOF
  1. Check the Operator Logs and confirm that indeed it's failing to sync this new config
kubectl -n monitoring-system logs -l app.kubernetes.io/component=prometheus-operator | grep level=error

level=error ts=2024-04-19T07:26:08.458451375Z caller=klog.go:126 component=k8s_client_runtime func=ErrorDepth msg="sync \"monitoring-system/kps-alertmanager\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig test/slack-receiver: SlackConfig[0]: invalid URL \"'https://hooks.slack.com/services/XXX/XXX/XXX'\" in key \"url\" from secret \"slackapiurl-secret\": validate url from string failed for 'https://hooks.slack.com/services/XXX/XXX/XXX': parse \"'https://hooks.slack.com/services/XXX/XXX/XXX'\": first path segment in URL cannot contain colon"
  1. Repeat steps 1 to 3 on a new namespace, but this time, on step 2, create a valid secret. For example:
kubectl create ns test2 
kubectl -n test2 create secret generic  slackapiurl-secret --from-literal=url=https://hooks.slack.com/services/XXX/XXX/XXX
  1. Check The Operator logs and we wont see any reference of the proper AlertmanagerConfig Object getting synced to Alertmanager. Going to the Web Ui Status page also confirms that no new configuration got generated.

Expected Result

The Operator should "ignore" or bypass the failing AlertmanagerConfig Object sync, and proceed with syncing other valid available resources.

Actual Result

The Operator fails to sync any other AlertmanagerConfig Object as soon as a single one fails to get synced properly into Alertmanager.

Prometheus Operator Version

**Operator Image being used**: quay.io/prometheus-operator/prometheus-operator:v0.73.1

Kubernetes Version

v1.28.5

Kubernetes Cluster Type

kubeadm

How did you deploy Prometheus-Operator?

helm chart:prometheus-community/kube-prometheus-stack

Manifests

No response

prometheus-operator log output

level=error ts=2024-04-19T07:26:08.458451375Z caller=klog.go:126 component=k8s_client_runtime func=ErrorDepth msg="sync \"monitoring-system/kps-alertmanager\" failed: provision alertmanager configuration: failed to generate Alertmanager configuration: AlertmanagerConfig test/slack-receiver: SlackConfig[0]: invalid URL \"'https://hooks.slack.com/services/XXX/XXX/XXX'\" in key \"url\" from secret \"slackapiurl-secret\": validate url from string failed for 'https://hooks.slack.com/services/XXX/XXX/XXX': parse \"'https://hooks.slack.com/services/XXX/XXX/XXX'\": first path segment in URL cannot contain colon"

Anything else?

No response

@Daniel-Vaz Daniel-Vaz added kind/bug needs-triage Issues that haven't been triaged yet labels Apr 19, 2024
@simonpasquier simonpasquier added help wanted and removed needs-triage Issues that haven't been triaged yet labels Apr 29, 2024
@simonpasquier
Copy link
Contributor

It should be a bug in the operator then: the expectation is that invalid AlertmanagerConfig objects are rejected before generating the final config.

@codeknight03
Copy link
Contributor

Been exploring the Alertmanager side of things on the operator for the past month. I'll check this one out.

@simonpasquier
Copy link
Contributor

In practice, the object with invalid reference should be detected (and rejected) here:

// checkAlertmanagerConfigResource verifies that an AlertmanagerConfig object is valid
// for the given Alertmanager version and has no missing references to other objects.
func checkAlertmanagerConfigResource(ctx context.Context, amc *monitoringv1alpha1.AlertmanagerConfig, amVersion semver.Version, store *assets.StoreBuilder) error {
if err := validationv1alpha1.ValidateAlertmanagerConfig(amc); err != nil {
return err
}
if err := checkReceivers(ctx, amc, store, amVersion); err != nil {
return err
}
if err := checkRoute(ctx, amc.Spec.Route, amVersion); err != nil {
return err
}
return checkInhibitRules(amc, amVersion)
}

@codeknight03
Copy link
Contributor

codeknight03 commented May 1, 2024

Approach to solve this issue and Status :

  • Test the steps to reproduce on a local kind cluster to understand exactly when the issue comes up.
  • Working on the functions in validation to find out which one would be causing the sync not to work.
  • Finding the fix for the particular function.

Just to give you guys an outline on what I am doing and where I am with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants