Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in ruby process? #368

Open
flori opened this issue Nov 30, 2021 · 5 comments
Open

Memory leak in ruby process? #368

flori opened this issue Nov 30, 2021 · 5 comments

Comments

@flori
Copy link

flori commented Nov 30, 2021

I had one of these killed recently by the linux OOM killer – after 87 days running on the google cloud:

[7595615.399139] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,mems_allowed=0,global_oom,task_memcg=/docker/b9bd611bc9ee699608d0bfe919b90f987e94e924fbf57748f078433c8c4b1e20,task=ruby,pid=1237,uid=0
[7595615.424623] Out of memory: Killed process 1237 (ruby) total-vm:1254608kB, anon-rss:614712kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1940kB oom_score_adj:0
[7595615.528767] oom_reaper: reaped process 1237 (ruby), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

There seems to be a memory leak somewhere.

@jkschulz
Copy link

Hey, thanks for reporting this! Can you provide some more detail about your environment and agent so that we can try to reproduce?

@flori
Copy link
Author

flori commented Dec 30, 2021

This is a single Compute Engine VM running on GCP's europe-west1-b zone. I cannot find out which ruby it is running or which gems it is using, only that it is currently running cos-stable-89-16108-470-16. The leaking seems to be still an issue, the fluentd ruby already takes up one third of the VM's memory at present. See this metrics graph I made for the vm memory used:

Screenshot 2021-12-30 at 10 38 39

@elbakerino
Copy link

Just had a failure of the whole cluster, because of google-fluentd ruby memory issues.

Infrastructure is complelty logged with google-fluentd and prometheus, ubuntu/centos servers for docker hosts+nginx load balancers and some custom apps. Error recognized because the docker hosts where not able to run some containers (OOM), nginx still manged to work with 2% free memory. Don't know when it started, had a similiar issue a 2-3 months ago where I thought docker logs where the problem and during that time a reboot-fix was all i had time for.

For most servers a sudo systemctl restart google-fluentd fixed it, the ruby process restarted and memory was freed.

Log of /var/log/google-fluentd/google-fluentd.log till restart, on a server where it worked:

2022-01-16 03:26:02 +0000 [info]: #0 flushing all buffer forcedly
2022-01-16 03:26:02 +0000 [info]: #0 detected rotation of /var/log/messages; waiting 5 seconds
2022-01-16 03:26:02 +0000 [info]: #0 following tail of /var/log/messages
2022-01-16 13:04:57 +0000 [info]: Received graceful stop
2022-01-16 13:04:58 +0000 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 13:04:58 +0000 [info]: #0 shutting down fluentd worker worker=0

On some servers, the restart/stop took ages, and i've just killed the google-fluentd service, the ruby server was still existing and I've needed to kill it with SIGKILL kill -9 <pid> to free the memory, without -9 worked on some, but not on all.

Log on such a server only shows the latest stop (this nginx servers timezone is somehow germany, not in utc like the other one), value of object replaced dont know if maybe sensible:

2022-01-16 03:42:01 +0100 [info]: #0 flushing all buffer forcedly
2022-01-16 10:18:44 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=85.15702900663018 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 11:52:57 +0100 [warn]: #0 buffer flush took longer time than slow_flush_log_threshold: elapsed_time=83.46101826429367 slow_flush_log_threshold=20.0 plugin_id="object:aaaaaaa"
2022-01-16 14:10:53 +0100 [info]: Received graceful stop
2022-01-16 14:10:54 +0100 [info]: #0 fluentd worker is now stopping worker=0
2022-01-16 14:10:54 +0100 [info]: #0 shutting down fluentd worker worker=0

Commands used:

sudo systemctl restart google-fluentd
top # ram was not free
sudo systemctl restart google-fluentd # took ages
sudo systemctl stop google-fluentd # took ages
sudo systemctl kill google-fluentd
sudo systemctl start google-fluentd
top # ram was not free
sudo kill -9 <pid-or-ruby>
top # ram was free
sudo reboot # just to be sure everything was/is running as it should

What further infos would help and how could I get them easily?

@EDeijl
Copy link

EDeijl commented Feb 21, 2022

We are running into the same issue:

It is installed on a VM instance running COS 89 with the metadata tag google-logging-enabled: true

restarting the stackdriver-logging-agent mitigates the issue, but it will come back in a while.

the specific image with the memory leak is: gcr.io/stackdriver-agents/stackdriver-logging-agent:1.8.4

An older version of that image is not having the same memory issue: gcr.io/stackdriver-agents/stackdriver-logging-agent:0.2-1.5.33-1-1

@igorpeshansky
Copy link
Member

igorpeshansky commented Feb 24, 2022

Is this the same as fluent/fluentd#3401?

[EDIT] Actually, fluent-plugin-systemd was upgraded to 1.0.5 in 1.9.1, but #368 (comment) mentions 1.8.4 leaking as well.

[ANOTHER EDIT] Mystery potentially solved: the fluent-plugin-systemd misbehavior with read_from_head set, diagnosed in the referenced issue, was already present in version 1.0.2 (used by google-fluentd 1.8.4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants