Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak in NodeJS / Python services #538

Closed
askmeegs opened this issue May 3, 2021 · 9 comments · Fixed by #630
Closed

Possible memory leak in NodeJS / Python services #538

askmeegs opened this issue May 3, 2021 · 9 comments · Fixed by #630
Assignees
Labels
lang: nodejs Issues specific to JavaScript or TypeScript. lang: python Issues specific to Python. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@askmeegs
Copy link
Contributor

askmeegs commented May 3, 2021

Uptime checks for the production deployment of OnlineBoutique have been failing once every few weeks. Looking at kubectl events timed with an uptime check failure --

38m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-azeq   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "004baa97-3c3b554d-9bbcccf8-870ced36"}}
43m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-i6m8   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "706b7d5f-9df4b412-e8eb875e-179c4765"}}
46m         Warning   NodeSysctlChange   node/gke-online-boutique-mast-default-pool-65a22575-jvwz   {"unmanaged": {"net.ipv4.tcp_fastopen_key": "a0f734c5-5c9a56e1-06aeb420-0010498e"}}
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 569290 (node) score 2181 or sacrifice child
Killed process 569290 (node) total-vm:1418236kB, anon-rss:121284kB, file-rss:33236kB, shmem-rss:0kB
39m         Warning   OOMKilling         node/gke-online-boutique-mast-default-pool-65a22575-jvwz   Memory cgroup out of memory: Kill process 2592522 (grpc_health_pro) score 1029 or sacrifice child
Killed process 2592530 (grpc_health_pro) total-vm:710956kB, anon-rss:1348kB, file-rss:7376kB, shmem-rss:0kB

It looks like memory requests are exceeding their limit. There seems to be plenty of allocatable memory across the prod GKE nodes
Screen Shot 2021-05-03 at 1 55 56 PM

But as observed by @bourgeoisor, it seems that three of the workloads are using steadily increasing amounts of memory until the pods are killed by GKE.

Currency and payment (NodeJS):

Screen Shot 2021-05-03 at 2 23 13 PM

Recommendation: (Python)

Screen Shot 2021-05-03 at 2 24 16 PM

TODO - investigate possible memory leaks starting with the NodeJS services. Investigate why the services use an increasing amount of memory over time rather than a constant amount. Then investigate the Python services + see if other python services (emailservice, for instance) show the same behavior as recommendation service.

@askmeegs askmeegs added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. lang: nodejs Issues specific to JavaScript or TypeScript. lang: python Issues specific to Python. labels May 3, 2021
@askmeegs askmeegs changed the title Possible memory leak in Possible memory leak in NodeJS / Python services May 3, 2021
@askmeegs askmeegs self-assigned this Aug 4, 2021
@askmeegs askmeegs removed their assignment Sep 13, 2021
@askmeegs
Copy link
Contributor Author

Was unable to find the root cause of the NodeJS memory leak after a few weeks of testing. Needs a Node expert or someone else to further investigate. Internal doc with my notes so far: https://docs.google.com/document/d/1gyc8YvfKwMr86wzY_cz1NICQU48VE-wXifqjDprAafI/edit?resourcekey=0-g04_Kba4MQjeXDFzsp-Bqw

@Shabirmean
Copy link
Member

Shabirmean commented Nov 26, 2021

According to the profiler data for the currencyservice and serviceservice the request-retry package is the one that seems to be using a lot of memory. It is imported by the google-cloud/common library that is used by google-cloud/tracing, google-cloud/debug and google-cloud/profiler.

The same behaviour is reported in the google-clou/debug nodejs repository. As per this recent comment the issue seems to have been eradicated after disabling google-cloud/debug

I have created for PRs to stage 4 clusters with different settings to observe how the memory usage is over time.

image

@Shabirmean
Copy link
Member

Shabirmean commented Nov 29, 2021

So the issue clearly seems like it's with any library that uses google-cloud/common. In our case google-cloud/debug and google-cloud/tracing. See the memory graphs for the four cases described in the earlier PR. So ideally we would have to wait for the fix for googleapis/cloud-debug-nodejs#811

Screen Shot 2021-11-27 at 7 19 33 AM

@Shabirmean
Copy link
Member

Shabirmean commented Nov 29, 2021

One more thing that was noticed is that the google-cloud/debug was erroring out with a bunch of insufficient scopes errors:

image

This is because the Cloud Debugger API access scope is not granted for the online-boutique-pr and online-boutique-master cluster nodes. Thus, we should create the clusters with --scopes=https://www.googleapis.com/auth/cloud_debugger,gke-default in order for the debug agent to be able to connect to API.

I have created a new cluster online-boutique-pr-v2 with the above mentioned scopes and updated the GitHub CI workflows to use the new cluster. The changes can be viewed in #644

This takes care of all the insufficient scopes errors that were observed but does not seem to fully eradicate the memory issue. This change seems to delay the time it takes for the memory to hit the peak by ~1.5 hours.

image

@Shabirmean
Copy link
Member

I create two PRs two generate some profiler data in the CI project for this repo.

These PR had different version tags for the profiler agent in the currencyservice

You can view the profiler data for these versions under the profiler view in the CI project. Filter by the following criteria and you use it to understand the differences.

image

@NimJay
Copy link
Collaborator

NimJay commented Jan 12, 2022

Hi @Shabirmean,

Please correct me if I'm wrong.
We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811.
Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

@Shabirmean
Copy link
Member

Hi @Shabirmean,

Please correct me if I'm wrong. We are now just waiting on this issue to be fixed via googleapis/cloud-debug-nodejs#811. Judging from Ben Coe's comment, this is something they plan to fix.

Let me know if there is any action we need to take in the meantime.

Hello @NimJay

There isn't much we can do from our side. I have communicated with Ben and seeing if we can work with the debug team to get that issue (googleapis/cloud-debug-nodejs#811) fixed. Until then, no action is needed/possible from our side. I suggest we keep this issue open!

@Shabirmean Shabirmean self-assigned this Feb 8, 2022
@bourgeoisor bourgeoisor added priority: p3 Desirable enhancement or fix. May not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels May 6, 2022
@bourgeoisor
Copy link
Member

This is still an issue, but bumping priority down to p3

@mathieu-benoit
Copy link
Contributor

Now that #1281 is merged into the main branch we could close this issue. Cloud Debugger is now removed from this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang: nodejs Issues specific to JavaScript or TypeScript. lang: python Issues specific to Python. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants