Exit code whitelisting #809

quentinus95 · 2023-03-24T12:52:03Z

Hello, would it be possible to have a feature that allows some non-0 exit codes to be whitelisted and considered as a success (or a warning for instance)?

I have some scripts that can end with a non-0 exit code that is not critical. It would be nice to be able to allow them and still consider execution as successful.

cuu508 · 2023-07-14T17:18:24Z

Thanks for the suggestion. Technically possible of course, but I'm not sure how widely applicable this would be – is it common for to have scripts that return non-zero exit codes in success scenarios, and there is also no way to influence this either by passing parameters, by editing the scripts, or by using wrapper scripts with additional conditional logic?

quentinus95 · 2023-07-20T19:16:25Z

Hello @cuu508, here is one example I have in mind: when rsync performs a copy of a folder (e.g., a backup) and some files are deleted before they are copied (rsync performs a scan of the files, then runs the backup), it may return a non 0. In some (most?) scenarios, it is fine to ignore that specific error code because it can be related to some logs that were rotated, or some lock files that were removed (which is fine when performing a snapshot).

In such situations, it would be nice to have a warning state, rather than a failure. It would allow in the previous example to say “maybe you're backing up some folders or files that should be ignored”. Those situations are fine and can be investigated later (very different from a backup that failed to execute and might require immediate action).

davidtorosyan · 2023-08-27T17:49:44Z

I have a similar use case, also backup related.

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If I want to retain "start", then I'd need a way to signal that a run is canceled. Using an allow-listed non-zero status code could work for that.

cuu508 · 2023-08-28T08:40:24Z

@davidtorosyan a couple of questions, so I understand your use case:

I expect my backup script to run successfully once a day. However, if it runs more frequently (say due to manual triggers), it'll bail out without actually doing anything.

Why does it bail out on manual triggers? Do manual and automatic triggers launch the job differently? Or does the backup job somehow recognize that "it's not the right time for me to run"?

I don't want to count this as success, but I don't want to alert on the failure either. So right now the only thing I can think to do is omit the "start" ping.

If the job does what it is supposed to do (which may be "nothing" in some cases), why not count it as success?

If I want to retain "start", then I'd need a way to signal that a run is canceled.

At the time when you send the "start" signal, you do not yet know if the job will be cancelled / bail out, correct? Like, the script starts up, then recognizes that some condition is not met, and bails out? What is that condition?

If you could detect the bail out condition near the start of the script, perhaps you could send the "start" signal only after it is clear the script will [attempt to] run fully?

davidtorosyan · 2023-08-28T16:11:24Z

@cuu508 good questions! Let me try and answer with pseudocode:

/* backup script, to be run daily */

// start for timing
http.post("hc.com/backup/start")

// expensive call, ideally happens after start
data = readData()

// the data only changes every 6 hours, so this will bail out if we run more frequently
// this is neither success nor failure, but a no-op.
// if we count this as success, then we won't be alerted if the data starts never changing (which is unexpected)
if ! data.changedSinceLastBackup {
  exit
}

try {
  data.backup()
  http.post("hc.com/backup/success")
catch {
  http.post("hc.com/backup/fail")
}

I see an additional solution I didn't before - solving this with two health checks. One for the backup script, and one for successful backup itself. That way I'd have a signal for the backup script running (and succeeding even in the bail out case) and for an actual backup being done with a daily frequency.

davidtorosyan · 2023-08-28T17:14:13Z

After thinking about it more, I think I might be doing to much with healthchecks.

From what I can tell, healthchecks is best at making sure that a job is running with a given schedule (i.e. the backup job runs daily), not validating arbitrary conditions (i.e. the data that's backed up is the data I want).

That said I still do have a need for the latter, so maybe what I'll do is something like this:

/* append this to backup script described in previous comment */

backups = getBackups()
if backups.latest > ago(1d) {
  http.post("hc.com/backups_healthy/success")
} else {
  http.post("hc.com/backups_healthy/fail")
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exit code whitelisting #809

Exit code whitelisting #809

quentinus95 commented Mar 24, 2023

cuu508 commented Jul 14, 2023

quentinus95 commented Jul 20, 2023

davidtorosyan commented Aug 27, 2023

cuu508 commented Aug 28, 2023 •

edited

davidtorosyan commented Aug 28, 2023

davidtorosyan commented Aug 28, 2023

Exit code whitelisting #809

Exit code whitelisting #809

Comments

quentinus95 commented Mar 24, 2023

cuu508 commented Jul 14, 2023

quentinus95 commented Jul 20, 2023

davidtorosyan commented Aug 27, 2023

cuu508 commented Aug 28, 2023 • edited

davidtorosyan commented Aug 28, 2023

davidtorosyan commented Aug 28, 2023

cuu508 commented Aug 28, 2023 •

edited