in_tail pos_file_compaction_interval corrupts position files #2918

juliantaylor · 2020-03-27T11:15:15Z

Describe the bug
when using the in_tail position file compaction via pos_file_compaction_interval the position files become corrupted

To Reproduce
Reproduced on a kubernetes cluster running compaction every 24 seconds and several pods in crashloop backoff so there was stuff to compact.

Expected behavior
no corruption

Your Environment

Fluentd or td-agent version: 1.9.2 td-agent 3.6.0
Kernel version: uname -r

Your Configuration
standard in_tail configuration with pos_file_compaction_interval 24

Your Error Log
The corrupted position file looks like this, note the ffffffffffffffff part with the missing space

/var/log/containers/container-1c15005e5e6978f4652c9bcce679f416654fb6d9e304a0d5d3b476b3c4bfd734.log	0000000000031a96	00000000008f589b
ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log	0000000000016215	00000000008f58ca
/var/log/containers/container-319527df0aaf8a2c698c775340981126709f67e05e29566ab7689d72001a7b43.log	00000000000330ed	00000000008f5879

there are actually lots of null bytes added in the problematic spot

\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00ffffffffffffffff448/var/log/containers/container-3f06aac20b6a5b1eadee53f4a891fcfbc3f9c365fcf649c6a457b063bbb73671.log\t0000000000016215\t00000000008f58ca\n

The problem is likely caused by a race condition in try_compact during the fetch_compacted_entries call
https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail/position_file.rb#L90

that call is performed outside of the lock but it reads the file, this means it can read file currently being modified by another thread and the writes of the position file are not atomic on the filesystem level.

this could either to move the fetch into the mutex lock or make the position file writes atomic via `rename´

The text was updated successfully, but these errors were encountered:

ganmacs · 2020-03-31T08:30:12Z

#2922 should fix this issue.
As I wrote in #2805 (comment), I believe race condition cannot happen during fetch_compacted_entries call .

This was referenced Mar 27, 2020

Impl posfile auto-compaction in in_tail #2805

Merged

Add a feature to clean up position informations on in_tail #1126

Closed

ganmacs self-assigned this Mar 31, 2020

ganmacs mentioned this issue Mar 31, 2020

Update seek position of position file entry #2922

Merged

ganmacs closed this as completed in #2922 Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_tail pos_file_compaction_interval corrupts position files #2918

in_tail pos_file_compaction_interval corrupts position files #2918

juliantaylor commented Mar 27, 2020 •

edited

ganmacs commented Mar 31, 2020

in_tail pos_file_compaction_interval corrupts position files #2918

in_tail pos_file_compaction_interval corrupts position files #2918

Comments

juliantaylor commented Mar 27, 2020 • edited

ganmacs commented Mar 31, 2020

juliantaylor commented Mar 27, 2020 •

edited