Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Timestamps to Health Status #16972

Closed
mkieweg opened this issue Jan 24, 2024 · 6 comments · Fixed by #18660
Closed

Add Timestamps to Health Status #16972

mkieweg opened this issue Jan 24, 2024 · 6 comments · Fixed by #18660
Labels
bug/enhancement component:application-sets Bulk application management related enhancement New feature or request type:enhancement

Comments

@mkieweg
Copy link
Contributor

mkieweg commented Jan 24, 2024

Summary

Currently there is no way to notify on a certain health status being active longer than a set period of time. I propose to add a lastUpdateTime timestamp to app.status.health indicating how long the current status is active for.

Motivation

Since an Application with an erroneous configuration might be Progressing forever, I want to be able to leverage ArgoCD notifications to solve this issue. Currently the only workaround I'm aware (Slack discussion) of is to cross reference app.status.operationState.startedAt and app.status.health.status like this:

trigger.on-progressing-stuck: |
  - description: Application stuck progressing
    send:
    - app-stuck-progressing
    when: app.status.health.status == 'Progressing'
      and time.Now().Sub(time.Parse(app.status.operationState.startedAt)).Minutes() > 60 and time.Now().Sub(time.Parse(app.status.operationState.startedAt)).Minutes() < 62

which is prone to over-alerting as it triggers any instance of Progressing within this time span. So it's impossible to distinguish bad timing of a pod eviction from an app actually stuck on Progressing

Proposal

The lastUpdateTime field should be added to the respective struct and be set to the current time when the status changes.

I'm happy to take on the implementation.

@mkieweg mkieweg added the enhancement New feature or request label Jan 24, 2024
@svghadi
Copy link
Contributor

svghadi commented Jan 25, 2024

This would also help to stablize on-deployed notification trigger. I am looking for something similar to address false positive for on-deployed notifications. Having a field like since or lastUpdateTime under health status could help minimize false positives.
Ref: #9070 (comment)

@mkieweg
Copy link
Contributor Author

mkieweg commented Jan 25, 2024

Having a field like since or lastUpdateTime under health status could help minimize false positives.

That sentence made me realise that since might be confusing because of time.Since() having a different semantic. Changed the proposal accordingly.

@svghadi
Copy link
Contributor

svghadi commented Feb 1, 2024

Hi @mkieweg, I discussed this enhancement in today's contributors meeting, and there were no objections. It's a green light for implementation. Please let me know if you have the bandwidth to take this up, otherwise, I am happy to proceed with it.

@mkieweg
Copy link
Contributor Author

mkieweg commented Feb 1, 2024

Hi @svghadi, happy to hear that. I'll take on the implementation, I've allocated time for it next week :)

@svghadi
Copy link
Contributor

svghadi commented Feb 1, 2024

Cool. Here is something I was trying out last week for lastUpdateTime. I have not tested it out extensively. Just sharing for reference.
svghadi@d8a5321

@hiddeco
Copy link

hiddeco commented Jun 25, 2024

While I am super supportive of this being added, I have an edge case that I think falls into this category but is not covered by just this addition.

Within Argo CD, the sync operations are completely separated from the application health. However, in certain scenarios, you do want to know if the current state of the health check is the outcome of a previous sync operation.

With the proposal, this would become possible by comparing the .finishedAt from the .status.operationState against the .lastUpdateTime. This, however, only works when you sync workload resources, which means that it will not help in scenarios where you only have non-workload resources, which will always end up Healthy without causing a transition.

To also cover this, and while taking into account the comments[1][2], I wonder if it would be an idea to introduce a "heartbeat" kind of timestamp that updates at most every 30 seconds (to not cause any API pressure) to indicate the "freshness" of the observation even without it transitioning.

ishitasequeira pushed a commit that referenced this issue Dec 6, 2024
* add lastTransitionTime to health status

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* address first feedback

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* set transition time if health status is unknown

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* extend health improvement tests

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* add apoplication controller test

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* use require for NoError

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* more extensive tests for health state changes

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* Apply suggestions from code review

Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com>
Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>

* Code review suggestions

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* remove obsolete assert

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* Change LastTransitionTime field to pointer type

Due to implementation limitations, setting LastTransitionTime at the resource level is challenging.
Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing
in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known
use case right now.

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Resolve rebase conflicts

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Address review comment

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Trigger CI

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

---------

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com>
Co-authored-by: Siddhesh Ghadi <sghadi1203@gmail.com>
revitalbarletz pushed a commit to revitalbarletz/argo-cd that referenced this issue Jan 20, 2025
* add lastTransitionTime to health status

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* address first feedback

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* set transition time if health status is unknown

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* extend health improvement tests

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* add apoplication controller test

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* use require for NoError

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* more extensive tests for health state changes

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* Apply suggestions from code review

Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com>
Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>

* Code review suggestions

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* remove obsolete assert

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>

* Change LastTransitionTime field to pointer type

Due to implementation limitations, setting LastTransitionTime at the resource level is challenging.
Converting it to a pointer type allows it to be skipped at the resource level and prevents it from appearing
in .status.resources of the Application CR. Additionally, it doesn’t provide much value or have a known
use case right now.

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Resolve rebase conflicts

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Address review comment

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

* Trigger CI

Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>

---------

Signed-off-by: Manuel Kieweg <mail@manuelkieweg.de>
Signed-off-by: Manuel Kieweg <2939765+mkieweg@users.noreply.github.com>
Signed-off-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Co-authored-by: Blake Pettersson <blake.pettersson@gmail.com>
Co-authored-by: Siddhesh Ghadi <sghadi1203@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/enhancement component:application-sets Bulk application management related enhancement New feature or request type:enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants