Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

Closed
jgehrcke opened this issue Mar 2, 2021 · 5 comments · Fixed by #3275
Closed

1.12.0: tail plugin crashes worker upon ENOENT (cannot stat() file) #3274

jgehrcke opened this issue Mar 2, 2021 · 5 comments · Fixed by #3275

Comments

@jgehrcke
Copy link
Contributor

jgehrcke commented Mar 2, 2021

Describe the bug

The in_tail plugin might discover a new log file, and then crash the worker upon the stat() system call due to ENOENT:

[info]: gem 'fluentd' version '1.12.0'

...

[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/root_agent.rb:200:in `block in start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:234:in `start'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:353:in `refresh_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:390:in `each_value'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `block in start_watchers'
[error]: #0 /usr/local/bundle/gems/fluentd-1.12.0/lib/fluent/plugin/in_tail.rb:409:in `stat'
[error]: #0 unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_file_s_stat - /var/log/containers/foo-55fc795779-l66hb_foo_foo-ee7d580da4f6fac071b1e0fa8533fcada239a88404ed2fcde16480fbcd4a39fb.log"

I think this is a classical race condition. As this is part of the inotify-based watch system, of course there's the chance that between being delivered the signal that a new file is there and the code accessing it the file might already have disappeared again.

To Reproduce
Run in an environment that has log files popping up and quickly disappear thereafter (I guess, no clear repro on our side).

Expected behavior

At the system call boundary between application and OS the individual system calls like stat() / open() should always be wrapped with local error handling, expecting the system call to fail.

This can then be retried in graceful degradation fashion, w/o affecting the other operations performed by the worker process.

An appropriate action would be to log a warning/error showing how the system call failed.

Your Environment

Fluentd 1.12.0 on Linux

Not retried with 1.12.1 -- same problem seems to be there, based on commits and changelog.

Your Configuration

...

Your Error Log

see above

Additional context

The stat() call that crashes the worker is https://github.com/fluent/fluentd/blame/65a9edf4e05cf64c7ed0de56e12b3c9a0774e0d6/lib/fluent/plugin/in_tail.rb#L409

Looks like even before the last refactor this stat() wasn't protected:
https://github.com/fluent/fluentd/pull/3196/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R409

The worker subsequently restarts. As this affects the super early worker startup phase and results in an immediate crash, potentially a crash loop, I am not sure if this might result in data loss. #3224 has more detail about that.

@jgehrcke
Copy link
Contributor Author

jgehrcke commented Mar 2, 2021

Attempt to fix: #3275 -- quick review and feedback appreciated. Thanks!

@chadlwilson
Copy link

Thanks for your work here @jgehrcke - we noticed lost logs after 1.12 upgrade due to issues with in_tail and immediately reverted. Didn't have time/depth to dig into what the problem was, but race conditions on log file create/delete seems possible.

We have a number of very short-lived cron jobs whose logs I imagine can be created and then disappear before fluentd has time to tail them.

ashie added a commit that referenced this issue Mar 4, 2021
in_tail: expect ENOENT during stat (try to fix #3274 and #3224)
ashie added a commit to ashie/fluentd that referenced this issue Mar 4, 2021
ashie added a commit to ashie/fluentd that referenced this issue Mar 12, 2021
@mk23w-vmware
Copy link

Is there a timeline for when this bug fix will be released in a Fluentd Ruby Gem(https://rubygems.org/gems/fluentd)?

We're hitting this issue regularly and are eager to pick up the fix.

Also, since we're using https://github.com/fluent/fluentd-docker-image, we'd prefer a Fluentd Ruby Gem than simply applying a patch to our existing Fluentd instance.

@ashie
Copy link
Member

ashie commented Mar 18, 2021

We are planning to release 1.12.2 until end of this month.

@ashie
Copy link
Member

ashie commented Apr 2, 2021

We released v1.12.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants