Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit code whitelisting #809

Open
quentinus95 opened this issue Mar 24, 2023 · 6 comments
Open

Exit code whitelisting #809

quentinus95 opened this issue Mar 24, 2023 · 6 comments

Comments

@quentinus95
Copy link

Hello, would it be possible to have a feature that allows some non-0 exit codes to be whitelisted and considered as a success (or a warning for instance)?

I have some scripts that can end with a non-0 exit code that is not critical. It would be nice to be able to allow them and still consider execution as successful.

@cuu508
Copy link
Member

cuu508 commented Jul 14, 2023

Thanks for the suggestion. Technically possible of course, but I'm not sure how widely applicable this would be – is it common for to have scripts that return non-zero exit codes in success scenarios, and there is also no way to influence this either by passing parameters, by editing the scripts, or by using wrapper scripts with additional conditional logic?

@quentinus95
Copy link
Author

Hello @cuu508, here is one example I have in mind: when rsync performs a copy of a folder (e.g., a backup) and some files are deleted before they are copied (rsync performs a scan of the files, then runs the backup), it may return a non 0. In some (most?) scenarios, it is fine to ignore that specific error code because it can be related to some logs that were rotated, or some lock files that were removed (which is fine when performing a snapshot).

In such situations, it would be nice to have a warning state, rather than a failure. It would allow in the previous example to say “maybe you're backing up some folders or files that should be ignored”. Those situations are fine and can be investigated later (very different from a backup that failed to execute and might require immediate action).

@davidtorosyan
Copy link

I have a similar use case, also backup related.

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If I want to retain "start", then I'd need a way to signal that a run is canceled. Using an allow-listed non-zero status code could work for that.

@cuu508
Copy link
Member

cuu508 commented Aug 28, 2023

@davidtorosyan a couple of questions, so I understand your use case:

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

Why does it bail out on manual triggers? Do manual and automatic triggers launch the job differently? Or does the backup job somehow recognize that "it's not the right time for me to run"?

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If the job does what it is supposed to do (which may be "nothing" in some cases), why not count it as success?

If I want to retain "start", then I'd need a way to signal that a run is canceled.

At the time when you send the "start" signal, you do not yet know if the job will be cancelled / bail out, correct? Like, the script starts up, then recognizes that some condition is not met, and bails out? What is that condition?

If you could detect the bail out condition near the start of the script, perhaps you could send the "start" signal only after it is clear the script will [attempt to] run fully?

@davidtorosyan
Copy link

@cuu508 good questions! Let me try and answer with pseudocode:

/* backup script, to be run daily */

// start for timing
http.post("hc.com/backup/start")

// expensive call, ideally happens after start
data = readData()

// the data only changes every 6 hours, so this will bail out if we run more frequently
// this is neither success nor failure, but a no-op.
// if we count this as success, then we won't be alerted if the data starts never changing (which is unexpected)
if ! data.changedSinceLastBackup {
  exit
}

try {
  data.backup()
  http.post("hc.com/backup/success")
catch {
  http.post("hc.com/backup/fail")
}

I see an additional solution I didn't before - solving this with two health checks. One for the backup script, and one for successful backup itself. That way I'd have a signal for the backup script running (and succeeding even in the bail out case) and for an actual backup being done with a daily frequency.

@davidtorosyan
Copy link

After thinking about it more, I think I might be doing to much with healthchecks.

From what I can tell, healthchecks is best at making sure that a job is running with a given schedule (i.e. the backup job runs daily), not validating arbitrary conditions (i.e. the data that's backed up is the data I want).

That said I still do have a need for the latter, so maybe what I'll do is something like this:

/* append this to backup script described in previous comment */

backups = getBackups()
if backups.latest > ago(1d) {
  http.post("hc.com/backups_healthy/success")
} else {
  http.post("hc.com/backups_healthy/fail")
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants