Don't fail if rds files are corrupted #471

richfitz · 2021-04-16T17:20:04Z

This is a somewhat unsatisfactory PR as I can't replicate this reliably.

As seen in #451 and #315 if the R process is killed when writing the Rds file it will create a corrupted rds file that will break the covr::package_coverage(). I've seen this rarely on gha builds (e.g. https://github.com/mrc-ide/rrq/pull/40/checks?check_run_id=2363889569) when using callr::r_bg fairly extensively, but have been unable to trigger in a MWE.

The downside is that the coverage report may be somewhat partial (and a message could be added during reading to indicate that) but continuing despite failure might be more graceful for relatively little additional complexity.

CLAassistant · 2021-04-16T17:20:09Z

All committers have signed the CLA.

jimhester · 2021-04-16T17:29:22Z

I am pretty sure this is due to processx using SIGKILL in a number of places and the finalizer that is writing the coverage is interrupted.

@gaborcsardi would know better than I if there is a way to have the improved behavior for this use case, the current behavior is unfortunate.

gaborcsardi · 2021-04-16T17:40:01Z

Sometimes you have to use sigkill, because R in the subprocess is not interruptible.

jimhester · 2021-04-16T18:35:34Z

Sure, but if the process is writing the coverage out in an exit finalizer the process is already in the process of exiting, so there must be some case where the SIGKILL is too eager.

gaborcsardi · 2021-04-16T19:03:39Z

@richfitz You can add some $signal() calls to your event loop layer if you want a graceful shutdown. processx is a bit too low level to implement this, without a standard event loop.

Or, if you don't mind sequential shutdown, then you can probably add a finalizer to your processes, that sends a SIGTERM first and then after a timeout a SIGKILL.

On Windows there is no SIGTERM, but you can $interrupt() first.

richfitz · 2021-04-19T09:44:33Z

@richfitz You can add some $signal() calls to your event loop layer if you want a graceful shutdown. processx is a bit too low level to implement this, without a standard event loop.

I don't think that this is the high level parts, unfortunately - I think that this is probably where a task has been spawned and is later cleaned up? I can't easily replicate unfortunately, and after replacing all the $kill calls I had (which I see uses SIGKILL after SIGTERM + grace period) I am still seeing intermittent failures - perhaps 1/3 to 1/4 of the time, so enough to be hard to work with but not enough to know if random fixes are helping!

I've not seen anything running covr::package_coverage() locally so I expect this is an issue with the speed at which the process is closed in the context of fairly slow GHA runners?

I do think that the current PR as it stands may be of use, but as always up to you for if you agree :)

Don't fail if rds files are corrupt

460f695

richfitz marked this pull request as ready for review April 19, 2021 07:48

M-Kusumgar mentioned this pull request Mar 1, 2024

Add basic env variables to every worker mrc-ide/rrq#110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't fail if rds files are corrupted #471

Don't fail if rds files are corrupted #471

richfitz commented Apr 16, 2021

CLAassistant commented Apr 16, 2021 •

edited

jimhester commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

jimhester commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

richfitz commented Apr 19, 2021

Don't fail if rds files are corrupted #471

Are you sure you want to change the base?

Don't fail if rds files are corrupted #471

Conversation

richfitz commented Apr 16, 2021

CLAassistant commented Apr 16, 2021 • edited

jimhester commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

jimhester commented Apr 16, 2021

gaborcsardi commented Apr 16, 2021

richfitz commented Apr 19, 2021

CLAassistant commented Apr 16, 2021 •

edited