in_tail is not picking up all log files in /var/log/containers/*.log #3357

davidbhoward · 2021-04-27T23:02:06Z

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug
in_tain plugin isn't following all files on start of Fluentd

To Reproduce
I was troubleshooting why we weren't getting logs from a specific pod. I killed the container using kubectl delete pod command. When it comes back up, it isn't following all of the files that exist in the /var/log/container/ dir. It followed 5 logs out of 43 which currently exist in /var/log/containers/.

Expected behavior
Follow all logs that exist in the /var/log/containers/ directory if there is a file (symlink) there.

Your Environment

Fluentd or td-agent version: fluentd --version or td-agent --version: fluentd 1.12.2
More specifically 2613fcb (we build from source)
Ruby: 2.7.2
Operating system: cat /etc/os-release: CentOS 7
Kernel version: uname -r: 4.18.0-240.15.1.el8_3.x86_64

If you hit the problem with older fluentd version, try latest version first. Done.

Your Configuration

  <source>
    @type tail
    @id source_tail_container_logs
    @log_level "debug"
    path "/var/log/containers/*.log"
    pos_file "/var/log/fluentd-containers.log.pos"
    tag "kubernetes.*"
    @label @ALL
    read_from_head true
    follow_inodes true
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      unmatched_lines
      time_type string
    </parse>
  </source>

Your Error Log

2021-04-27 15:38:13 -0700 [info]: starting fluentd worker pid=1 ppid=0 worker=0
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service1.log
2021-04-27 15:38:13 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service2.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service3.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service4.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service5.log
2021-04-27 15:38:13 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:38:42 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 2, namespace_cache_miss: 23, pod_cache_api_updates: 2, id_cache_miss: 2
2021-04-27 15:38:52 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:39:13 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 48, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:39:43 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 73, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:40:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 97, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:40:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 121, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:41:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 147, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:41:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 172, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:42:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 196, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:42:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 220, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:43:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 245, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:43:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 269, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:44:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 293, pod_cache_api_updates: 3, id_cache_miss: 3

Additional context

The last log here, service5.log is one of our larger log files, which at the time was about 350MB and I am wondering if this is a potential cause for this issue. Our limit before rotation is 500MB though. I couldn't imagine this is enough information to really dig deep here so please feel free to engage me in the Fluentd slack channel and I can provide more context and/or know what better information can help here.

The text was updated successfully, but these errors were encountered:

davidbhoward · 2021-04-27T23:59:54Z

I have done some additional testing, by evicting that pod called service5, which had a large log file, then restarting the Fluentd pod, it seems that all log files are now picked up. I suspect this comes down to a few questions in regards to how big is too large for Fluentd in_tail and are there any optimization that can be made to help support this scenario?

I am scouring https://docs.fluentd.org/input/tail and the only thing I see related to large files is https://docs.fluentd.org/input/tail#read_from_head which we do have enabled.

I also see https://docs.fluentd.org/input/tail#enable_watch_timer which I need to do a bit more research on, but this potentially could help us since we are using the default value of true here.

Is there anything else that you would recommend here?

ashie · 2021-04-28T01:40:43Z

Is it reproducible?
If so, could you try removing follow_inodes?
It's introduced at v1.12 so it might not proofed enough yet.

davidbhoward · 2021-04-28T02:22:11Z

I went to the K8s node where that evicted pod, service5, landed and it looks like everything is working as expected. I have also deleted the Fluentd pod again to see if I could reproduce the behavior seen before, but things are still working.

@ashie, if this happens again, is there any better information I could capture that would help troubleshoot this? Just let me know what I can capture if this happens again and I will do so. Thanks for looking at this.

ashie · 2021-04-28T02:43:24Z

Similar issue?: #3239

davidbhoward · 2021-04-28T03:06:28Z

I wanted to see if Fluentd was able to read the full file and move on, and indeed I can see the following log lines in Fluentd:

2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service3.log
2021-04-27 19:13:09 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service4.log
2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service5.log
2021-04-27 19:13:09 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:11 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:15 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:19 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:38 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 23, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:14:08 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 48, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:14:39 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 76, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:15:09 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 101, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:15:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 129, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:16:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 154, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:16:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 181, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:17:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 206, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:17:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 233, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:18:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 258, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:18:41 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 286, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:19:12 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 313, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:19:42 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 339, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:20:12 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 365, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:20:43 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 392, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:21:13 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 418, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:21:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 445, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:22:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 472, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:22:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 498, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:23:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 525, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:23:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 551, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:24:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 577, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:24:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 603, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:25:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 631, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:25:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 658, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:26:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 684, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:26:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 710, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:27:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 736, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:27:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 762, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:28:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 788, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:28:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 814, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:29:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 841, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:29:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 867, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:30:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 893, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:30:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 920, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 946, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 972, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service6.log
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service7.log
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service8.log

This would indicate that it took 18 min to read that file because once it was done reading that file, it moved on. I suspect the cause here is a large log file. This means that I can't reproduce the same thing that I saw before. I will dig more into this to see if it is related to #3239. Thanks, @ashie.

davidbhoward · 2021-04-28T03:37:03Z

I am going to close this out and we can talk on #3239 if I find any relationship there. Thank you again, @ashie, for digging into this with me.

ashie · 2021-05-11T01:15:10Z

Now I've found that probably #2478 is a almost same issue with this.

davidbhoward closed this as completed Apr 28, 2021

ashie mentioned this issue Apr 28, 2021

Fluentd not picking new log files #3239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

in_tail is not picking up all log files in /var/log/containers/*.log #3357

in_tail is not picking up all log files in /var/log/containers/*.log #3357

davidbhoward commented Apr 27, 2021

davidbhoward commented Apr 27, 2021

ashie commented Apr 28, 2021

davidbhoward commented Apr 28, 2021 •

edited

ashie commented Apr 28, 2021

davidbhoward commented Apr 28, 2021 •

edited

davidbhoward commented Apr 28, 2021

ashie commented May 11, 2021

in_tail is not picking up all log files in /var/log/containers/*.log #3357

in_tail is not picking up all log files in /var/log/containers/*.log #3357

Comments

davidbhoward commented Apr 27, 2021

davidbhoward commented Apr 27, 2021

ashie commented Apr 28, 2021

davidbhoward commented Apr 28, 2021 • edited

ashie commented Apr 28, 2021

davidbhoward commented Apr 28, 2021 • edited

davidbhoward commented Apr 28, 2021

ashie commented May 11, 2021

davidbhoward commented Apr 28, 2021 •

edited

davidbhoward commented Apr 28, 2021 •

edited