Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_exec with out_exec in one config create file descriptors leak #3627

Open
LukasJerabek opened this issue Feb 9, 2022 · 22 comments · Fixed by #3844
Open

in_exec with out_exec in one config create file descriptors leak #3627

LukasJerabek opened this issue Feb 9, 2022 · 22 comments · Fixed by #3844

Comments

@LukasJerabek
Copy link

LukasJerabek commented Feb 9, 2022

Describe the bug

Hello, I think I have found the core issue of my problems with fluentd configuration, could someone have look on this please?
I have set everything according to https://docs.fluentd.org/installation/before-install.
Using configuration and two scripts below I can notice that if I do these for a while:

cd /proc/<fluentd PID>/ 
find fd | wc -l

Then I can see that number of file descriptors is rising infinitely eventually resulting in memory problems and "too many open files" error because of reaching fd limit (ulimit -n). This only occures if both in_exec and out_exec are used, since if I do following swaps, the number of file descriptors stabilizes. I can do either of these and the problems will disappear.

File descriptors count is stabilized if I swap match section for this:

<match hardware.memory>
  @type file
  append true
  path {{ home }}/fluentd/test_bug2.log
  add_path_suffix false
  <buffer []>
    @type memory
    flush_mode immediate
  </buffer>
  <format>
    @type json
  </format>
</match>

or swap source and filter section for this:

<source>
  @type sample
  sample {"mem_available":"1.5","swap_total":"1.5","swap_free":"1.2","time_local":"2022-02-09T10:00:00+0100"}
  tag hardware.memory
</source>

To Reproduce

Install fluentd:
gem install fluentd --version 1.14.5 --no-doc
Create folder "fluentd" in home folder
Exchange {{ home }} or {{ user }} for your real values in every file below.
Put fluentd.conf (with configuration below), get_memory_metrics.sh, psql_memory.sh into {{ home }}/fluentd/
do chmod +x on get_memory_metrics.sh, psql_memory.sh
run {{ home }}/bin/fluentd 1.14.5 -c {{ home }}/fluentd/fluentd.conf -vv

Now start checking the file descriptors in /proc/<fluentd PID>/ with:
find fd | wc -l
Keep cheking it for some time, youll see the count of file descriptors never stops rising.

Hopefully I did not miss anything.

Expected behavior

Number of file descriptors is not rising infinitely and not causing memory leaks.

Your Environment

- Fluentd version: 1.14.5 (installed as gem)
- TD Agent version: -
- Operating system: Rocky Linux 8
- Kernel version: 4.18.0-348.12.2.el8_5.x86_64
- Ruby version: 2.5.9

Your Configuration

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command {{ home }}/fluentd/get_memory_metrics.sh
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<filter hardware.memory>
  @type record_transformer
  enable_ruby
  <record>
    time_local ${time.strftime('%Y-%m-%dT%H:%M:%S%z')}
  </record>
</filter>

<match hardware.memory>
  @type exec
  command {{ home }}/fluentd/psql_memory.sh first_argument second_argument
  <format>
    @type csv
    force_quotes false
    delimiter |
    fields time_local,mem_available,swap_total,swap_free
  </format>
  <buffer time>
    @type file
    path {{ home }}/fluentd/hardware.memory.buff
    timekey 50s
    timekey_wait 0s
    flush_mode lazy
    total_limit_size 200MB
    retry_type periodic  # exponential backoff is broken https://github.com/fluent/fluentd/issues/3609
    retry_max_times 0  # need to go to secondary as soon as possible https://github.com/fluent/fluentd/issues/3610
  </buffer>
  <secondary>
    @type secondary_file
    directory {{ home }}/fluentd/secondary/
    basename dump.${chunk_id}
  </secondary>
</match>

Your Error Log

No errors in fluentd.log using --v

Additional context

get_memory_metrics.sh

#!/bin/bash
mem_info=$(awk '{ \
    if (/MemAvailable:/) {mem_available=$2}; \
    if (/SwapTotal:/) {swap_total=$2}; \
    if (/SwapFree:/) {swap_free=$2}; \
    if (mem_available && swap_total && swap_free) \
      {printf "%.3f %.3f %.3f", mem_available/1024/1024, swap_total/1024/1024, swap_free/1024/1024;exit} \
  }' /proc/meminfo)
read -r mem_available swap_total swap_free<<<"${mem_info}"

printf '{"mem_available":"%s","swap_total":"%s","swap_free":"%s"}' \
"$mem_available" \
"$swap_total" \
"$swap_free"

psql_memory.sh

#!/bin/sh
echo $1 $2 $3
@LukasJerabek LukasJerabek changed the title in_exec with out_exec in one config creates file descriptors leak in_exec with out_exec in one config create file descriptors leak Feb 9, 2022
@kenhys
Copy link
Contributor

kenhys commented Feb 15, 2022

It seems that it is reproducible with a more simple in_exec example. it may also reproducible for out_exec side.

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command PATH_TO_COMMAND
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<match hardware.memory>
  @type stdout
</match>

@LukasJerabek
Copy link
Author

LukasJerabek commented Feb 15, 2022

I fixed get_memory_metrics.sh in my example which did not work, prinf part was missing. I am gonna try your example, but I was not able to reproduce this when having only in_exec or out_exec in config, both had to be present.

@LukasJerabek
Copy link
Author

LukasJerabek commented Feb 15, 2022

Yea, I can confirm that your example causes this too. So its not the combination of both that causes this.

EDIT: wrong, it was too soon to judge

@LukasJerabek
Copy link
Author

I am sorry, it was too soon to judge, it has stabilized on around 174 of FDs. So my first thought still applies, I observe only the combination of both in_exec and out_exec to cause infinite rising of FDs.

@ashie ashie added the memory label Mar 15, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Jun 13, 2022
@LukasJerabek
Copy link
Author

Bump.

@ashie ashie removed the stale label Jun 13, 2022
@ashie
Copy link
Member

ashie commented Jul 28, 2022

I checked it but can't reproduce.
I modified your config a little because they doesn't work properly for me:

fluent.conf

<system>
  process_name fluentd_monitor
</system>

<source>
  @type exec
  tag hardware.memory
  run_interval 10s
  command /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/get_memory_metrics.sh
  <parse>
    @type json
    types mem_available:float,swap_total:float,swap_free:float
  </parse>
</source>

<filter hardware.memory>
  @type record_transformer
  enable_ruby
  <record>
    time_local ${time.strftime('%Y-%m-%dT%H:%M:%S%z')}
  </record>
</filter>

<match hardware.memory>
  @type exec
  command /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/psql_memory.sh
  <format>
    @type csv
    force_quotes false
    delimiter |
    fields time_local,mem_available,swap_total,swap_free
  </format>
  <buffer time>
    @type file
    path /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/hardware.memory.buff
    timekey 50s
    timekey_wait 0s
    flush_mode lazy
    total_limit_size 200MB
    retry_type periodic  # exponential backoff is broken https://github.com/fluent/fluentd/issues/3609
    retry_max_times 0  # need to go to secondary as soon as possible https://github.com/fluent/fluentd/issues/3610
  </buffer>
</match>

get_memory_metrics.sh:

#!/bin/bash
mem_info=$(awk '{ \
    if (/MemAvailable:/) {mem_available=$2}; \
    if (/SwapTotal:/) {swap_total=$2}; \
    if (/SwapFree:/) {swap_free=$2}; \
    if (mem_available && swap_total && swap_free) \
      {printf "%.3f %.3f %.3f", mem_available/1024/1024, swap_total/1024/1024, swap_free/1024/1024;exit} \
  }' /proc/meminfo)
read -r mem_available swap_total swap_free<<<"${mem_info}"

printf '{"mem_available":"%s","swap_total":"%s","swap_free":"%s"}' \
"$mem_available" \
"$swap_total" \
"$swap_free"

(<<< doesn't work for me so I modified the shebang to use bash explicitly.)

psql_memory.sh:

#!/bin/sh
cat $1 >> /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/memusage.txt
  • The argument of the command is a path of buffer chunk file.
  • Dump the contents to a file to confirm the command is called expectedly.

Environment:

  • Ubuntu 22.04
  • Ruby 3.0.2 & 2.7.5
  • Fluentd 1.15.1

Results:

$ sudo find /proc/2591/fd | wc -l
13
$ cat /home/aho/Projects/Fluentd/fluentd/issues/3627-exec-memleak/memusage.txt
2022-07-28T21:43:28+0900|22.107|15.999|15.983
2022-07-28T21:43:38+0900|22.107|15.999|15.983
2022-07-28T21:43:48+0900|22.107|15.999|15.983
...

Count of fd is always 13

@ashie
Copy link
Member

ashie commented Jul 28, 2022

It might depend on installed gems (low possibility though), please try updating them.

@LukasJerabek
Copy link
Author

LukasJerabek commented Jul 28, 2022

(<<< doesn't work for me so I modified the shebang to use bash explicitly.)

I dont see any difference in script you posted? There is bash shebang as well in my version of the script.

Only difference I see is that you are missing secondary section. Could you try that with that? The {{ home }} was meant to be rewritten according to you environment.

I will try again later and post if I still observe the problem.

@LukasJerabek
Copy link
Author

Also did you try sudo find /proc/2591/fd | wc -l in like 15 minutes after fluentd was started or only at the begging?

@ashie
Copy link
Member

ashie commented Jul 28, 2022

Only difference I see is that you are missing secondary section. Could you try that with that?

I also tried secondary but got almost same result.

Also did you try sudo find /proc/2591/fd | wc -l in like 15 minutes after fluentd was started or only at the begging?

I tried it every 1 sec for 2 hours.

@LukasJerabek
Copy link
Author

LukasJerabek commented Jul 28, 2022

I still see the problem even with version 1.15.1. Roughly after 10 minutes there is 90 in my case, after 20 minutes 160. 30 minutes 233 and so on...

  1. install vagrant https://www.vagrantup.com/downloads
  2. create VagrantFile:
# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "rockylinux/8"
  config.vm.box_check_update = true
  config.vm.hostname = "rocky8"
  config.vm.network "private_network", ip: "192.168.56.121"
  config.vm.define "rocky8"
  config.vm.provider "virtualbox" do |vb|
    vb.name = "rocky8"
    vb.gui = false
    vb.memory = 1024
    vb.cpus = 1
    vb.customize [ "guestproperty", "set", :id, "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold", 1000 ]
  end
  config.vm.network :forwarded_port, guest: 22, host: 4444, id: 'ssh'
  config.ssh.extra_args = ["-L", "8000:localhost:8000"]
end
  1. vagrant up
  2. vagrant ssh
  3. sudo dnf -y module enable ruby:3.0
  4. sudo dnf install gcc make cmake ruby ruby-devel
  5. gem install fluentd --version 1.15.1 --no-doc
  6. cd ~
  7. mkdir fluentd
  8. cd fluentd
  9. create fluentd.conf, get_memory_metric.sh, psql_memory.sh from my first post above
  10. swich {{ home }} for /home/vagrant in all files from my first post above
  11. chmod +x on both .sh files
  12. cd ~/fluentd
  13. ~/bin/fluentd -c fluentd.conf
  14. start another shell (vagrant ssh from vagrant folder again)
  15. ps aux | grep fluentd
  16. take PID of worker:fluentd_monitor
  17. cd /proc/<PID>
  18. find fd | wc -l
  19. observe forever rising number of fds

@LukasJerabek
Copy link
Author

I use vagrant in this example, but I have seen this in rented KVM cloud.

@ashie
Copy link
Member

ashie commented Jul 29, 2022

Thanks, it reproduced on Rocky Linux 8:

[vagrant@localhost ~]$ ls -l /proc/5844/fd/
total 0
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 0 -> /dev/pts/1
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 1 -> /dev/pts/1
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 10 -> 'pipe:[58579]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 12 -> 'pipe:[58126]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 13 -> 'pipe:[58620]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 14 -> 'pipe:[58589]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 15 -> 'pipe:[58600]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 16 -> 'pipe:[58608]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 17 -> 'pipe:[58273]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 18 -> 'pipe:[58630]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 19 -> 'pipe:[58184]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 2 -> /dev/pts/1
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 20 -> 'pipe:[58642]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:20 21 -> 'pipe:[58654]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 22 -> 'pipe:[58674]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 23 -> 'pipe:[58336]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 24 -> 'pipe:[58697]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 25 -> 'pipe:[58679]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 26 -> 'pipe:[58297]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 27 -> 'pipe:[58729]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 28 -> 'pipe:[58739]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 29 -> 'pipe:[58817]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 3 -> 'anon_inode:[eventfd]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 30 -> 'pipe:[58346]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:21 31 -> 'pipe:[58322]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 32 -> 'pipe:[58776]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 33 -> 'pipe:[58784]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 34 -> 'pipe:[58793]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 36 -> 'pipe:[58826]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 37 -> 'pipe:[58835]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:22 38 -> 'pipe:[58802]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 39 -> 'pipe:[58846]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 4 -> 'anon_inode:[eventfd]'
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 40 -> 'pipe:[58852]'
l-wx------. 1 vagrant vagrant 64 Jul 29 12:23 41 -> /vagrant/3627-exec-memleak/hardware.memory.buff/buffer.b5e4f0b73baa41208d74aab41afc22048.log.meta
lr-x------. 1 vagrant vagrant 64 Jul 29 12:23 43 -> 'pipe:[58861]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 5 -> 'anon_inode:[eventpoll]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 6 -> 'anon_inode:[eventpoll]'
lrwx------. 1 vagrant vagrant 64 Jul 29 12:20 8 -> /vagrant/3627-exec-memleak/hardware.memory.buff/buffer.b5e4f0b73baa41208d74aab41afc22048.log
l-wx------. 1 vagrant vagrant 64 Jul 29 12:20 9 -> 'pipe:[58532]'

@ashie
Copy link
Member

ashie commented Jul 29, 2022

Analyzed data by fluent-plugin-sigdump:
fluentd-issue-3627-exec-memory-leak Objects linearly incremented

They aren't decreased even after major GC is triggered.

ashie added a commit that referenced this issue Aug 3, 2022
Pipes for child processes aren't closed explicitly after exiting child
processes.

Fix #3627

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie
Copy link
Member

ashie commented Aug 4, 2022

With #3844 file descriptor leak seems fixed, but still following objects are linearly incremented:

fluentd-issue-3627-exec-memory-leak Objects linearly incremented (with PR#3844)

ashie added a commit that referenced this issue Aug 9, 2022
Pipes for child processes aren't closed explicitly after exiting child
processes.

Fix #3627

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie
Copy link
Member

ashie commented Aug 10, 2022

Although file descriptor leak is resolved, not yet resolved the problem described above.
Reopen this.

@ashie ashie reopened this Aug 10, 2022
@LukasJerabek
Copy link
Author

Hi, any luck on this one?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

@github-actions github-actions bot added the stale label Apr 12, 2023
@LukasJerabek
Copy link
Author

I think this is still not solved completely...

@daipom daipom added incomplete Missing version, reproducible steps, etc and removed stale labels Apr 12, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this issue will be closed in 7 days

@github-actions github-actions bot added the stale label May 13, 2023
@LukasJerabek
Copy link
Author

I think this is still not solved completely...

@github-actions github-actions bot removed the stale label May 14, 2023
@ashie ashie added bug and removed incomplete Missing version, reproducible steps, etc labels May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants