Avoid SpoolReader/SpoolWriter races with synchronized (atomic-like) updates to spool files #983

zmousm · 2018-07-06T23:24:09Z

munin-asyncd by default queries munin-node in turn for each plugin and writes to spool files. With --fork this happens in parallel. On the other hand munin-async iterates through spool files and scans them for new data.

Nothing stops spoolreader from reading from a file while (or before) spoolwriter writes to it in a particular round of updates. After such a spoolfetch, the master will have missed any updates written to spool files after spoolreader visited them. If it did fetch data that had already been written in that round, the master will also have bumped the spoolfetch timestamp and therefore it will skip over such data in subsequent spoolfetch, which means data loss.

This race condition is particularly evident when spoolfetch and spoolwriter coincide. For example if munin-update runs on the master every 5 minutes and munin-asyncd wakes up at the same time, the nodes visited by munin-update in the first 30-60 seconds are most likely to exhibit this problem. munin-asyncd --fork does help, but again if some munin plugins take a long time to run the respective services will be susceptible to data loss.

The proposed patch addresses the problem by emulating atomic-like updates to spool files. SpoolWriter writes updates to copies of spool files and moves (renames) them over in one go at the end. In fork mode, this involves pipe IPC where munin-asyncd reads files to be committed from its' children processes stdout.

This serves to protect against SpoolWriter/SpoolReader race conditions: Copy spool files, update them and move (rename) them over in one go at the end. In fork mode, this involves pipe IPC.

steveschnepp · 2018-08-14T10:02:02Z

while I do agree with the issue, I'm not really at ease with heavy file copying.

It would be much more convenient to add some kind of locking via flock().

I also thought about enhancing the protocol inside the spoolfetch files to be able for the reader to detect a half-wrote dataset.

kimheino · 2018-12-16T09:06:42Z

Commit a48026a tries to fix this (or similar case?).

Well... I doesn't fix this, it is just a workaround to hide this problem.

2.0.48 The following changes are omitted: * #1176: Ignore malformed lines when parsing "datafile" * #983: Let the server use max timestamp rather than now for non-timestamped dirtyconfig values * this PR should be checked, whether it still applies for master

SpoolWriter: Synchronize updates (atomic-like) to spool files

95de183

This serves to protect against SpoolWriter/SpoolReader race conditions: Copy spool files, update them and move (rename) them over in one go at the end. In fork mode, this involves pipe IPC.

sumpfralle added [todo] needs design decision [affects] 2.0-stable labels Nov 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid SpoolReader/SpoolWriter races with synchronized (atomic-like) updates to spool files #983

Avoid SpoolReader/SpoolWriter races with synchronized (atomic-like) updates to spool files #983

zmousm commented Jul 6, 2018

steveschnepp commented Aug 14, 2018

kimheino commented Dec 16, 2018

Avoid SpoolReader/SpoolWriter races with synchronized (atomic-like) updates to spool files #983

Are you sure you want to change the base?

Avoid SpoolReader/SpoolWriter races with synchronized (atomic-like) updates to spool files #983

Conversation

zmousm commented Jul 6, 2018

steveschnepp commented Aug 14, 2018

kimheino commented Dec 16, 2018