Possible memory leak in NodeJS / Python services #538

askmeegs · 2021-05-03T18:26:58Z

Uptime checks for the production deployment of OnlineBoutique have been failing once every few weeks. Looking at kubectl events timed with an uptime check failure --

38m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-azeq   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "004baa97-3c3b554d-9bbcccf8-870ced36"}}
43m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-i6m8   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "706b7d5f-9df4b412-e8eb875e-179c4765"}}
46m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-jvwz   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "a0f734c5-5c9a56e1-06aeb420-0010498e"}}
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 569290 (node) score 2181 or sacrifice child
Killed process 569290 (node) total-vm:1418236kB, anon-rss:121284kB, file-rss:33236kB, shmem-rss:0kB
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 2592522 (grpc_health_pro) score 1029 or sacrifice child
Killed process 2592530 (grpc_health_pro) total-vm:710956kB, anon-rss:1348kB, file-rss:7376kB, shmem-rss:0kB

It looks like memory requests are exceeding their limit. There seems to be plenty of allocatable memory across the prod GKE nodes

But as observed by @bourgeoisor, it seems that three of the workloads are using steadily increasing amounts of memory until the pods are killed by GKE.

Currency and payment (NodeJS):

Recommendation: (Python)

TODO - investigate possible memory leaks starting with the NodeJS services. Investigate why the services use an increasing amount of memory over time rather than a constant amount. Then investigate the Python services + see if other python services (emailservice, for instance) show the same behavior as recommendation service.

The text was updated successfully, but these errors were encountered:

askmeegs · 2021-09-13T16:35:34Z

Was unable to find the root cause of the NodeJS memory leak after a few weeks of testing. Needs a Node expert or someone else to further investigate. Internal doc with my notes so far: https://docs.google.com/document/d/1gyc8YvfKwMr86wzY_cz1NICQU48VE-wXifqjDprAafI/edit?resourcekey=0-g04_Kba4MQjeXDFzsp-Bqw

Shabirmean · 2021-11-26T20:10:25Z

According to the profiler data for the currencyservice and serviceservice the request-retry package is the one that seems to be using a lot of memory. It is imported by the google-cloud/common library that is used by google-cloud/tracing, google-cloud/debug and google-cloud/profiler.

The same behaviour is reported in the google-clou/debug nodejs repository. As per this recent comment the issue seems to have been eradicated after disabling google-cloud/debug

I have created for PRs to stage 4 clusters with different settings to observe how the memory usage is over time.

test: no debug (do not merge) #637 - has no google-cloud/debug
test: no tracing (do-not-merge) #638 - has no google-cloud/trace
test: no profiler (do not merge) #639 - has no google-cloud/profiler
test: no tracing / profiler / debugger (do not merge) #640 - has all three of the above disabled

Shabirmean · 2021-11-29T15:21:27Z

So the issue clearly seems like it's with any library that uses google-cloud/common. In our case google-cloud/debug and google-cloud/tracing. See the memory graphs for the four cases described in the earlier PR. So ideally we would have to wait for the fix for googleapis/cloud-debug-nodejs#811

Shabirmean · 2021-11-29T15:40:11Z

One more thing that was noticed is that the google-cloud/debug was erroring out with a bunch of insufficient scopes errors:

This is because the Cloud Debugger API access scope is not granted for the online-boutique-pr and online-boutique-master cluster nodes. Thus, we should create the clusters with --scopes=https://www.googleapis.com/auth/cloud_debugger,gke-default in order for the debug agent to be able to connect to API.

I have created a new cluster online-boutique-pr-v2 with the above mentioned scopes and updated the GitHub CI workflows to use the new cluster. The changes can be viewed in #644

This takes care of all the insufficient scopes errors that were observed but does not seem to fully eradicate the memory issue. This change seems to delay the time it takes for the memory to hit the peak by ~1.5 hours.

Shabirmean · 2021-12-14T15:52:20Z

I create two PRs two generate some profiler data in the CI project for this repo.

test: has all enabled (do not merge) #653 has the google-cloud-debug agent enabled
test: has no debug (do not merge) #654 has no google-cloud-debug agent enabled

These PR had different version tags for the profiler agent in the currencyservice

test: has all enabled (do not merge) #653 --> vNew-HasDebug
test: has no debug (do not merge) #654 --> vNew-NoDebug

You can view the profiler data for these versions under the profiler view in the CI project. Filter by the following criteria and you use it to understand the differences.

NimJay · 2022-01-12T20:17:08Z

Hi @Shabirmean,

Please correct me if I'm wrong.
We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811.
Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

Shabirmean · 2022-01-12T21:52:40Z

Hi @Shabirmean,

Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811. Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

Hello @NimJay

There isn't much we can do from our side. I have communicated with Ben and seeing if we can work with the debug team to get that issue (googleapis/cloud-debug-nodejs#811) fixed. Until then, no action is needed/possible from our side. I suggest we keep this issue open!

bourgeoisor · 2022-05-06T19:47:36Z

This is still an issue, but bumping priority down to p3

mathieu-benoit · 2022-11-29T13:41:10Z

Now that #1281 is merged into the main branch we could close this issue. Cloud Debugger is now removed from this project.

askmeegs changed the title ~~Possible memory leak in~~ Possible memory leak in NodeJS / Python services May 3, 2021

askmeegs self-assigned this Aug 4, 2021

askmeegs removed their assignment Sep 13, 2021

askmeegs mentioned this issue Nov 22, 2021

node 12 -> node 16, grpc -> grpc-js #630

Merged

Shabirmean closed this as completed in #630 Nov 25, 2021

Shabirmean reopened this Nov 25, 2021

This was referenced Nov 26, 2021

Increment profiler version at release time #641

Open

test: add maxRetries to the debug agent (do not merge) #642

Closed

test: add maxRetries to the debug agent and profiler (do not merge) #643

Closed

cleanup: update pr cluster in actions workflow #644

Closed

Shabirmean mentioned this issue Nov 29, 2021

improvement: fix the cloud debugger errors in nodejs services #645

Closed

This was referenced Dec 9, 2021

test: has all enabled (do not merge) #653

Closed

test: has no debug (do not merge) #654

Closed

Shabirmean self-assigned this Feb 8, 2022

bourgeoisor added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 6, 2022

bourgeoisor mentioned this issue May 23, 2022

Deprecate the use of Google Cloud Debugger #836

Closed

mathieu-benoit closed this as completed Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible memory leak in NodeJS / Python services #538

Possible memory leak in NodeJS / Python services #538

askmeegs commented May 3, 2021

askmeegs commented Sep 13, 2021

Shabirmean commented Nov 26, 2021 •

edited

Shabirmean commented Nov 29, 2021 •

edited

Shabirmean commented Nov 29, 2021 •

edited

Shabirmean commented Dec 14, 2021

NimJay commented Jan 12, 2022 •

edited

Shabirmean commented Jan 12, 2022

bourgeoisor commented May 6, 2022

mathieu-benoit commented Nov 29, 2022

Possible memory leak in NodeJS / Python services #538

Possible memory leak in NodeJS / Python services #538

Comments

askmeegs commented May 3, 2021

askmeegs commented Sep 13, 2021

Shabirmean commented Nov 26, 2021 • edited

Shabirmean commented Nov 29, 2021 • edited

Shabirmean commented Nov 29, 2021 • edited

Shabirmean commented Dec 14, 2021

NimJay commented Jan 12, 2022 • edited

Shabirmean commented Jan 12, 2022

bourgeoisor commented May 6, 2022

mathieu-benoit commented Nov 29, 2022

Shabirmean commented Nov 26, 2021 •

edited

Shabirmean commented Nov 29, 2021 •

edited

Shabirmean commented Nov 29, 2021 •

edited

NimJay commented Jan 12, 2022 •

edited