Webhooks for firing alerts contain zeroed EndsAt #3351

grobinson-grafana · 2023-05-03T09:25:46Z

What did you do?

It appears that webhooks for firing alerts contain zeroed EndsAt timestamps 0001-01-01T00:00:00Z instead of the actual EndsAt timestamp of the alert as can be seen in /v2/alerts.

Having looked into this further the code that causes this can be found on Line 507 of dispatch.go:

for _, alert := range alerts {
	a := *alert
	// Ensure that alerts don't resolve as time move forwards.
	if !a.ResolvedAt(now) {
		a.EndsAt = time.Time{}
	}
	alertsSlice = append(alertsSlice, &a)
}

I think the intention of this code is to prevent alerts from becoming resolved during subsequent stages of the receiver such as WaitStage and DedupStage. This can happen when a firing alert enters the stages at time t1 but doesn't finish the stages until time t2 > EndsAt.

What did you expect to see?

I expect to see either EndsAt contain a non-zero timestamp or for the EndsAt column to be omitted from the JSON for firing alerts.

What did you see instead? Under which circumstances?

Related: #3341

The text was updated successfully, but these errors were encountered:

grobinson-grafana · 2023-05-03T09:37:37Z

I found the original commit by Brian here.

I think the need to do this is a consequence of how Alertmanager always uses EndsAt - now to determine if an alert is firing or resolved, so as time moves forward the state of the alert can change between successive retries. This shouldn't be possible, but would require refactoring how the state of the alert is propagated to notifiers.

An immediate fix would be to instead remove the EndsAt field from the JSON for firing alerts, instead only including it for resolved alerts. This would at least remove some confusion from the user's perspective even if it doesn't fix the bigger issue.

gotjosh · 2023-05-03T11:13:49Z

The more I think about it, the more I feel like the semantics of flushing and time are broken.

If we assume that once we've determined that an alert needs to flush and it enters the subsequent stages, it should not be time-constrained? Instead, it should just be forced flush. I'd argue that modifying endsAt is not something we wanna do.

grobinson-grafana · 2023-05-03T15:22:48Z

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This means that instead of calculating if the alert is firing or not based on the end time, it just reads the new field called Status. The timestamps at this point and purely metadata and have no observable effect on the state of the alert.

vLabayen · 2023-05-08T17:47:18Z

I think that given that an alert can reach the notifier before it's resolve date, so it's state is computed as firing, but the notification is not guaranteed to reach the user in time (for example, a SMTP server can cause some delay), making a snapshot of the alert's data at the time of determining if the alert should be flushed is perfectly fine.

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This means that instead of calculating if the alert is firing or not based on the end time, it just reads the new field called Status. The timestamps at this point and purely metadata and have no observable effect on the state of the alert.

That should work for me.

simonpasquier · 2023-05-16T14:15:09Z

I discussed with Josh and we agreed that an appropriate fix would be instead of calculating if the alert was firing or not by checking Resolved() indirectly in template.NewData(), the status is instead represented as a new field Status in the alert that is passed to template.NewData().

This looks the right approach to me.

grobinson-grafana · 2023-05-29T17:20:04Z

I started on a fix this issue and it appears that EndsAt is removed in two places, not one as I had first thought!

Line 507 of dispatch.go

for _, alert := range alerts {
	a := *alert
	// Ensure that alerts don't resolve as time move forwards.
	if !a.ResolvedAt(now) {
		a.EndsAt = time.Time{}
	}
	alertsSlice = append(alertsSlice, &a)
}

Line 338 of types.go

func Alerts(alerts ...*Alert) model.Alerts {
	res := make(model.Alerts, 0, len(alerts))
	for _, a := range alerts {
		v := a.Alert
		// If the end timestamp is not reached yet, do not expose it.
		if !a.Resolved() {
			v.EndsAt = time.Time{}
		}
		res = append(res, &v)
	}
	return res
}

If I remove just one then the webhook still contains a zeroed EndsAt, and makes me believe that the code in dispatch.go is redundant given that the same logic is also applied in types.go.

Second, I was thinking about changing the code in dispatch.go to copy the alert into a new struct called AlertWithStatus that looks like this:

type AlertWithStatus {
	Alert
	Status string
}

for _, alert := range alerts {
	a := types.AlertWithStatus{
		Alert: *alert,
		Status: "", // status goes here
	}
	alertsSlice = append(alertsSlice, &a)
}

This seems like a reasonable approach to changing alerts from dispatch onwards, without having to make changes to data before that. The downside is there are a lot of function signatures that will need to be changed, so expect a lot of noise in the diff.

vcill · 2024-04-09T10:14:46Z

Hi guys, any progress on this? I am also experiencing this issue, I am forwarding alerts to Alerta and Alerta immediately marks them as closed since the invalid endsAt time.
My alerts contain a valid endsAt but alertmanager overwrites it, which is a plain bug - I don't see why is this labeled as enhancement.

This commit adds the StatusAt method for the Alert struct. It calls the ResolvedAt method while Status calls the Resolved method. This method will be used in Alertmanager to fix issue prometheus/alertmanager#3351. Signed-off-by: George Robinson <george.robinson@grafana.com>

grobinson-grafana · 2024-04-12T11:50:27Z

I think I found a less intrusive fix for this. Instead of calling Resolved() and Status() which use time.Now(), I can instead call ResolvedAt(ts time.Time) and StatusAt(ts time.Time) (see prometheus/common#618) where ts is the timestamp of the flush. This means we can remove the following code:

// If the end timestamp is not reached yet, do not expose it.
if !a.Resolved() {
	v.EndsAt = time.Time{}
}

and also solve the issue that #1686 was intended to fix.

grobinson-grafana · 2024-04-12T12:11:08Z

Here is the PR #3805.

simonpasquier added kind/enhancement component/notify labels Aug 4, 2023

grobinson-grafana mentioned this issue Apr 12, 2024

Add StatusAt method for Alert struct prometheus/common#618

Merged

grobinson-grafana mentioned this issue Apr 13, 2024

Fix zeroed EndsAt timestamp #3805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhooks for firing alerts contain zeroed EndsAt #3351

Webhooks for firing alerts contain zeroed EndsAt #3351

grobinson-grafana commented May 3, 2023

grobinson-grafana commented May 3, 2023

gotjosh commented May 3, 2023

grobinson-grafana commented May 3, 2023

vLabayen commented May 8, 2023

simonpasquier commented May 16, 2023

grobinson-grafana commented May 29, 2023

vcill commented Apr 9, 2024 •

edited

grobinson-grafana commented Apr 12, 2024

grobinson-grafana commented Apr 12, 2024

Webhooks for firing alerts contain zeroed EndsAt #3351

Webhooks for firing alerts contain zeroed EndsAt #3351

Comments

grobinson-grafana commented May 3, 2023

grobinson-grafana commented May 3, 2023

gotjosh commented May 3, 2023

grobinson-grafana commented May 3, 2023

vLabayen commented May 8, 2023

simonpasquier commented May 16, 2023

grobinson-grafana commented May 29, 2023

vcill commented Apr 9, 2024 • edited

grobinson-grafana commented Apr 12, 2024

grobinson-grafana commented Apr 12, 2024

vcill commented Apr 9, 2024 •

edited