Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail is not picking up all log files in /var/log/containers/*.log #3357

Closed
davidbhoward opened this issue Apr 27, 2021 · 7 comments
Closed

Comments

@davidbhoward
Copy link

Check CONTRIBUTING guideline first and here is the list to help us investigate the problem.

Describe the bug
in_tain plugin isn't following all files on start of Fluentd

To Reproduce
I was troubleshooting why we weren't getting logs from a specific pod. I killed the container using kubectl delete pod command. When it comes back up, it isn't following all of the files that exist in the /var/log/container/ dir. It followed 5 logs out of 43 which currently exist in /var/log/containers/.

Expected behavior
Follow all logs that exist in the /var/log/containers/ directory if there is a file (symlink) there.

Your Environment

  • Fluentd or td-agent version: fluentd --version or td-agent --version: fluentd 1.12.2
    More specifically 2613fcb (we build from source)
  • Ruby: 2.7.2
  • Operating system: cat /etc/os-release: CentOS 7
  • Kernel version: uname -r: 4.18.0-240.15.1.el8_3.x86_64

If you hit the problem with older fluentd version, try latest version first. Done.

Your Configuration

  <source>
    @type tail
    @id source_tail_container_logs
    @log_level "debug"
    path "/var/log/containers/*.log"
    pos_file "/var/log/fluentd-containers.log.pos"
    tag "kubernetes.*"
    @label @ALL
    read_from_head true
    follow_inodes true
    <parse>
      @type "json"
      time_format "%Y-%m-%dT%H:%M:%S.%NZ"
      unmatched_lines
      time_type string
    </parse>
  </source>

Your Error Log

2021-04-27 15:38:13 -0700 [info]: starting fluentd worker pid=1 ppid=0 worker=0
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service1.log
2021-04-27 15:38:13 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service2.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service3.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service4.log
2021-04-27 15:38:13 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service5.log
2021-04-27 15:38:13 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:38:42 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 2, namespace_cache_miss: 23, pod_cache_api_updates: 2, id_cache_miss: 2
2021-04-27 15:38:52 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 15:39:13 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 48, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:39:43 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 73, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:40:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 97, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:40:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 121, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:41:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 147, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:41:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 172, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:42:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 196, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:42:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 220, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:43:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 245, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:43:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 269, pod_cache_api_updates: 3, id_cache_miss: 3
2021-04-27 15:44:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 3, namespace_cache_miss: 293, pod_cache_api_updates: 3, id_cache_miss: 3

Additional context

The last log here, service5.log is one of our larger log files, which at the time was about 350MB and I am wondering if this is a potential cause for this issue. Our limit before rotation is 500MB though. I couldn't imagine this is enough information to really dig deep here so please feel free to engage me in the Fluentd slack channel and I can provide more context and/or know what better information can help here.

@davidbhoward
Copy link
Author

I have done some additional testing, by evicting that pod called service5, which had a large log file, then restarting the Fluentd pod, it seems that all log files are now picked up. I suspect this comes down to a few questions in regards to how big is too large for Fluentd in_tail and are there any optimization that can be made to help support this scenario?

I am scouring https://docs.fluentd.org/input/tail and the only thing I see related to large files is https://docs.fluentd.org/input/tail#read_from_head which we do have enabled.

I also see https://docs.fluentd.org/input/tail#enable_watch_timer which I need to do a bit more research on, but this potentially could help us since we are using the default value of true here.

Is there anything else that you would recommend here?

@ashie
Copy link
Member

ashie commented Apr 28, 2021

Is it reproducible?
If so, could you try removing follow_inodes?
It's introduced at v1.12 so it might not proofed enough yet.

@davidbhoward
Copy link
Author

davidbhoward commented Apr 28, 2021

I went to the K8s node where that evicted pod, service5, landed and it looks like everything is working as expected. I have also deleted the Fluentd pod again to see if I could reproduce the behavior seen before, but things are still working.

@ashie, if this happens again, is there any better information I could capture that would help troubleshoot this? Just let me know what I can capture if this happens again and I will do so. Thanks for looking at this.

@ashie
Copy link
Member

ashie commented Apr 28, 2021

Similar issue?: #3239

@davidbhoward
Copy link
Author

davidbhoward commented Apr 28, 2021

I wanted to see if Fluentd was able to read the full file and move on, and indeed I can see the following log lines in Fluentd:

2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service3.log
2021-04-27 19:13:09 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service4.log
2021-04-27 19:13:09 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service5.log
2021-04-27 19:13:09 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:11 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:15 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:19 -0700 [info]: disable filter chain optimization because [Fluent::Plugin::KubernetesMetadataFilter, Fluent::Plugin::ConcatFilter] uses `#filter_stream` method.
2021-04-27 19:13:38 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 23, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:14:08 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 48, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:14:39 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 76, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:15:09 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 101, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:15:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 129, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:16:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 154, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:16:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 181, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:17:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 206, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:17:40 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 233, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:18:10 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 258, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:18:41 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 286, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:19:12 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 313, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:19:42 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 339, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:20:12 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 365, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:20:43 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 392, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:21:13 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 418, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:21:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 445, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:22:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 472, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:22:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 498, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:23:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 525, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:23:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 551, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:24:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 577, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:24:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 603, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:25:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 631, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:25:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 658, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:26:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 684, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:26:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 710, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:27:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 736, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:27:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 762, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:28:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 788, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:28:44 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 814, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:29:14 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 841, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:29:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 867, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:30:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 893, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:30:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 920, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:15 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 946, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:45 -0700 [info]: [filter_kube_metadata] stats - namespace_cache_size: 0, pod_cache_size: 9, namespace_cache_miss: 972, pod_cache_api_updates: 9, id_cache_miss: 9
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service6.log
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service7.log
2021-04-27 19:31:54 -0700 [info]: [source_tail_container_logs] following tail of /var/log/containers/service8.log

This would indicate that it took 18 min to read that file because once it was done reading that file, it moved on. I suspect the cause here is a large log file. This means that I can't reproduce the same thing that I saw before. I will dig more into this to see if it is related to #3239. Thanks, @ashie.

@davidbhoward
Copy link
Author

I am going to close this out and we can talk on #3239 if I find any relationship there. Thank you again, @ashie, for digging into this with me.

@ashie
Copy link
Member

ashie commented May 11, 2021

Now I've found that probably #2478 is a almost same issue with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants