Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail take long time to found rotated files #2478

Closed
metayd opened this issue Jul 3, 2019 · 8 comments · Fixed by #3390
Closed

in_tail take long time to found rotated files #2478

metayd opened this issue Jul 3, 2019 · 8 comments · Fixed by #3390
Labels

Comments

@metayd
Copy link

metayd commented Jul 3, 2019

related to issue #2477

Will there be any risk if I change read_more = true to read_more = false, such as losing record or other issues, just in my copy of in_tail source code?

We came across a problem here. We have ten log files, our log-generator writes record(0.5KB per record) to them by 1000 lines/s each and rotate it by renaming the file to another name(the inode changed but filename stay the same) if the size of the file reaches 100MB. We use in_tail to collect these files.

We found two problems.

  1. when the file is rotated, it takes a long time for in_tail to discover the new inode of the logfile, and after several rotate, it seems in_tail just lost some of the rotated files(rotate is quite fast)
  2. when starting the fluentd, I found it takes a long time for in_tail to start_watchers for these files when I look at the log of fluentd(I add some debug log after setup_watcher). And with the rotate issues, it seems fluentd finally lost some of the log files.

I'm not familiar with in_tail's performance, so maybe it's not in_tail's problem but the way I use it. The usage of CPU is not high and the flush speed is ok.

For the moment, I solve these two problems with change read_more = true to read_more = false(it take effects), but I'm not sure if there will be any risk with this change.

It would be very kind of you if you can give me some advice with these two problems.

@metayd metayd added the bug label Jul 3, 2019
@metayd
Copy link
Author

metayd commented Jul 3, 2019

It seems if we don't break the read loop at https://github.com/fluent/fluentd/blob/master/lib/fluent/plugin/in_tail.rb#L741, in_tail would take very long time to setup_watcher and read content from other file if log files is big.

@repeatedly
Copy link
Member

repeatedly commented Jul 5, 2019

  1. when the file is rotated, it takes a long time for in_tail to discover the new inode of the logfile, and after several rotate, it seems in_tail just lost some of the rotated files(rotate is quite fast)

Does this mean detecting new file takes longer time, not reading data, right?

How about this parameter? https://docs.fluentd.org/input/tail#skip_refresh_on_startup

For the moment, I solve these two problems with change read_more = true to read_more = false(it take effects), but I'm not sure if there will be any risk with this change.

I see. I understood the backgroud of previous issue. Current implementation is mainly for hit EOF frequently so your approach is considerable.

@metayd
Copy link
Author

metayd commented Jul 5, 2019

@repeatedly

Does this mean detecting new file takes longer time, not reading data, right?

Yes. But there is also another problem, if read_more = true, it seems in_tail read only one file of logfiles for a long time(maybe several minutes) before it starts to read another file(I found that by watching the interval between pos file entry update), looks like in_tail is stuck to one of them. If read_more = false, it seems in_tail read from all files evenly.

Changing read_more = true to read_more = false is the only way I found to solve these two problems, I wish there is another way.

How about this parameter? https://docs.fluentd.org/input/tail#skip_refresh_on_startup

already set to be true

I see. I understood the backgroud of previous issue. Current implementation is mainly for hit EOF frequently so your approach is considerable.

The modified version of in_tail has been running for 2 days, and it looks fine(I'm not sure if there are potential issues). I want to know what kind of risk it would be if I change read_more=true to read_more=false, record loss? high CPU or memory? or any other?

I don't get the meaning of Current implementation is mainly for hit EOF frequently. Does this mean the risk of losing some records at the end of log files if log files are rotated? I think in_tail would finally reach the end of the file if the file is never rotated. How about only set read_more = true after the file is rotated?

@repeatedly
Copy link
Member

repeatedly commented Jul 16, 2019

The modified version of in_tail has been running for 2 days, and it looks fine(I'm not sure if there are potential issues). I want to know what kind of risk it would be if I change read_more=true to read_more=false, record loss? high CPU or memory? or any other?

Good to hear that no problem on your production. we check the several situations with your suggestion and if no problem, the change will be included in v1.7.0.

@metayd
Copy link
Author

metayd commented Jul 17, 2019

@repeatedly

Good to hear that no problem on your production. we check the several situations with your suggestion and if no problem, the change will be included in v1.7.0.

I'm sorry there is bad news.

I found that there may be some record lost. I add some log to print the file size and the fd pos when the io handler is closed and I found that sometimes the io handler is closed before fd.pos reach the end of the file, this means some records are lost forever.

Current implementation is mainly for hit EOF frequently.

I think I get the meaning of that now.

When rotate happens, in_tail would wait just 5 seconds according to the default rotate_wait setting. I guess when read_more = true, handle_notify would always hit the end of the file(not very sure). But when read_more=false, in_tail would only continue read the file for rotate_wait seconds and read 8192 byte (according to io.readpartial(8192, @iobuf)) for each time. So if fd.pos is far away from the file size or the flush speed of the output is slow, when the file is rotated, records at the end of the file are lost.

I resolve this by setting rotate_wait to a bigger value(the value is determined by testing in_tail against out log generate speed and output flush speed). But I don't know if it's suitable for fluentd.

We found two problems.
1. when the file is rotated, it takes a long time for in_tail to discover the new inode of the logfile, and after several rotate, it seems in_tail just lost some of the rotated files(rotate is quite fast)
2. when starting the fluentd, I found it takes a long time for in_tail to start_watchers for these files when I look at the log of fluentd(I add some debug log after setup_watcher). And with the rotate issues, it seems fluentd finally lost some of the log files.

These two problems remain, and the root cause of them is not clear. Changing read_more=true to read_more=false maybe not a good solution.

repeatedly added a commit that referenced this issue Jul 26, 2019
Signed-off-by: Masahiro Nakagawa <repeatedly@gmail.com>
@Projoke
Copy link

Projoke commented Aug 6, 2019

I aslo modify read_more to false ago; In my test environment, this change will reduce the CPU usage of fluentd process but will aslo decrese the performance。

@ashie
Copy link
Member

ashie commented May 20, 2021

#3185 (with #3364 and #3379) has been merged, which is similar fix with #2525.
read_bytes_limit_per_second 8192 will take same effect with #2525.
It's included in v1.13.0 which will be released at the end of this month (May 2021).
Please try it.

TODO:

But when read_more=false, in_tail would only continue read the file for rotate_wait seconds and read 8192 byte (according to io.readpartial(8192, @iobuf)) for each time. So if fd.pos is far away from the file size or the flush speed of the output is slow, when the file is rotated, records at the end of the file are lost.

It's not addressed in #3185 yet. Probably it should keep reading until reached to EOF.
I'm considering add a fix for it.

ashie added a commit to ashie/fluentd that referenced this issue May 25, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue May 25, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue May 25, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue May 26, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue May 26, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie
Copy link
Member

ashie commented Jun 1, 2021

I'm considering add a fix for it.

WIP: #3390

ashie added a commit to ashie/fluentd that referenced this issue Jun 4, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue Jun 10, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
ashie added a commit to ashie/fluentd that referenced this issue Jul 9, 2021
Fix fluent#2478

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants