Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhooks for firing alerts contain zeroed EndsAt #3351

Open
grobinson-grafana opened this issue May 3, 2023 · 9 comments
Open

Webhooks for firing alerts contain zeroed EndsAt #3351

grobinson-grafana opened this issue May 3, 2023 · 9 comments

Comments

@grobinson-grafana
Copy link
Contributor

What did you do?

It appears that webhooks for firing alerts contain zeroed EndsAt timestamps 0001-01-01T00:00:00Z instead of the actual EndsAt timestamp of the alert as can be seen in /v2/alerts.

Having looked into this further the code that causes this can be found on Line 507 of dispatch.go:

for _, alert := range alerts {
	a := *alert
	// Ensure that alerts don't resolve as time move forwards.
	if !a.ResolvedAt(now) {
		a.EndsAt = time.Time{}
	}
	alertsSlice = append(alertsSlice, &a)
}

I think the intention of this code is to prevent alerts from becoming resolved during subsequent stages of the receiver such as WaitStage and DedupStage. This can happen when a firing alert enters the stages at time t1 but doesn't finish the stages until time t2 > EndsAt.

What did you expect to see?

I expect to see either EndsAt contain a non-zero timestamp or for the EndsAt column to be omitted from the JSON for firing alerts.

What did you see instead? Under which circumstances?

Related: #3341

@grobinson-grafana
Copy link
Contributor Author

I found the original commit by Brian here.

I think the need to do this is a consequence of how Alertmanager always uses EndsAt - now to determine if an alert is firing or resolved, so as time moves forward the state of the alert can change between successive retries. This shouldn't be possible, but would require refactoring how the state of the alert is propagated to notifiers.

An immediate fix would be to instead remove the EndsAt field from the JSON for firing alerts, instead only including it for resolved alerts. This would at least remove some confusion from the user's perspective even if it doesn't fix the bigger issue.

@gotjosh
Copy link
Member

gotjosh commented May 3, 2023

The more I think about it, the more I feel like the semantics of flushing and time are broken.

If we assume that once we've determined that an alert needs to flush and it enters the subsequent stages, it should not be time-constrained? Instead, it should just be forced flush. I'd argue that modifying endsAt is not something we wanna do.

@grobinson-grafana
Copy link
Contributor Author

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This means that instead of calculating if the alert is firing or not based on the end time, it just reads the new field called Status. The timestamps at this point and purely metadata and have no observable effect on the state of the alert.

@vLabayen
Copy link

vLabayen commented May 8, 2023

I think that given that an alert can reach the notifier before it's resolve date, so it's state is computed as firing, but the notification is not guaranteed to reach the user in time (for example, a SMTP server can cause some delay), making a snapshot of the alert's data at the time of determining if the alert should be flushed is perfectly fine.

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This means that instead of calculating if the alert is firing or not based on the end time, it just reads the new field called Status. The timestamps at this point and purely metadata and have no observable effect on the state of the alert.

That should work for me.

@simonpasquier
Copy link
Member

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This looks the right approach to me.

@grobinson-grafana
Copy link
Contributor Author

I started on a fix this issue and it appears that EndsAt is removed in two places, not one as I had first thought!

  1. Line 507 of dispatch.go
for _, alert := range alerts {
	a := *alert
	// Ensure that alerts don't resolve as time move forwards.
	if !a.ResolvedAt(now) {
		a.EndsAt = time.Time{}
	}
	alertsSlice = append(alertsSlice, &a)
}
  1. Line 338 of types.go
func Alerts(alerts ...*Alert) model.Alerts {
	res := make(model.Alerts, 0, len(alerts))
	for _, a := range alerts {
		v := a.Alert
		// If the end timestamp is not reached yet, do not expose it.
		if !a.Resolved() {
			v.EndsAt = time.Time{}
		}
		res = append(res, &v)
	}
	return res
}

If I remove just one then the webhook still contains a zeroed EndsAt, and makes me believe that the code in dispatch.go is redundant given that the same logic is also applied in types.go.

Second, I was thinking about changing the code in dispatch.go to copy the alert into a new struct called AlertWithStatus that looks like this:

type AlertWithStatus {
	Alert
	Status string
}
for _, alert := range alerts {
	a := types.AlertWithStatus{
		Alert: *alert,
		Status: "", // status goes here
	}
	alertsSlice = append(alertsSlice, &a)
}

This seems like a reasonable approach to changing alerts from dispatch onwards, without having to make changes to data before that. The downside is there are a lot of function signatures that will need to be changed, so expect a lot of noise in the diff.

@vcill
Copy link

vcill commented Apr 9, 2024

Hi guys, any progress on this? I am also experiencing this issue, I am forwarding alerts to Alerta and Alerta immediately marks them as closed since the invalid endsAt time.
My alerts contain a valid endsAt but alertmanager overwrites it, which is a plain bug - I don't see why is this labeled as enhancement.

grobinson-grafana added a commit to grobinson-grafana/prometheus-common that referenced this issue Apr 12, 2024
This commit adds the StatusAt method for the Alert struct. It calls
the ResolvedAt method while Status calls the Resolved method.

This method will be used in Alertmanager to fix issue
prometheus/alertmanager#3351.

Signed-off-by: George Robinson <george.robinson@grafana.com>
@grobinson-grafana
Copy link
Contributor Author

I think I found a less intrusive fix for this. Instead of calling Resolved() and Status() which use time.Now(), I can instead call ResolvedAt(ts time.Time) and StatusAt(ts time.Time) (see prometheus/common#618) where ts is the timestamp of the flush. This means we can remove the following code:

// If the end timestamp is not reached yet, do not expose it.
if !a.Resolved() {
	v.EndsAt = time.Time{}
}

and also solve the issue that #1686 was intended to fix.

@grobinson-grafana
Copy link
Contributor Author

Here is the PR #3805.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants