Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busyloop on getpid() #42277

Closed
sheepa opened this issue Mar 10, 2022 · 7 comments
Closed

Busyloop on getpid() #42277

sheepa opened this issue Mar 10, 2022 · 7 comments

Comments

@sheepa
Copy link

sheepa commented Mar 10, 2022

Version

v16.14.0

Platform

Linux machine 5.4.0-104-generic #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

No response

What steps will reproduce the bug?

No specific code snippets unfortunately my program (express http server) suddenly started to hang on getpid() (from strace).

How often does it reproduce? Is there a required condition?

Usually within 5 minute runtime, could be longer (hours) though.

What is the expected behavior?

No response

What do you see instead?

Busylooping (100% 1 core) on getpid(), making the node process completely unresponsive.

Additional information

Please see strace below, right when getpid() busyloop starts.

epoll_wait(14, [], 1024, 0)             = 0
epoll_wait(14, [], 1024, 35)            = 0
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
mprotect(0x7f7778203000, 245760, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
mprotect(0x7f7778203000, 245760, PROT_READ|PROT_EXEC) = 0
mprotect(0x7f7778203000, 245760, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
mprotect(0x7f7778203000, 245760, PROT_READ|PROT_EXEC) = 0
mprotect(0x35922200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x6c071100000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x36a9986c0000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0xc4b87c40000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
mprotect(0xb5a4f200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x32da82880000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x20416c700000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x1cdea14c0000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x5091e60, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x5091de8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
write(17, "\1\0\0\0\0\0\0\0", 8)        = 8
mprotect(0x35922200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x6c071100000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x36a9986c0000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0xc4b87c40000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f4, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
mprotect(0xb5a4f200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x32da82880000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x20416c700000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x1cdea14c0000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
mprotect(0x35922200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x6c071100000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x36a9986c0000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0xc4b87c40000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
mprotect(0xb5a4f200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x32da82880000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x20416c700000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x1cdea14c0000, 262144, PROT_READ|PROT_WRITE) = 0
futex(0x4ce71f0, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x4ce71a0, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 55650
mprotect(0x35922200000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x6c071100000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0x36a9986c0000, 262144, PROT_READ|PROT_WRITE) = 0
mprotect(0xc4b87c40000, 262144, PROT_READ|PROT_WRITE) = 0
getpid()                                = 55650
getpid()                                = 55650
getpid()                                = 55650
getpid()                                = 55650
getpid()                                = 55650
getpid()                                = 55650
...

I also enabled NODE_DEBUG=timer,http,net,fs,cluster,tls,stream,child_process,module I do use some setTimeouts() and it always ends one the very same line, see below:

2022-03-09 22:08:33 +00:00: TIMER 93324: 1 list empty
2022-03-09 22:08:33 +00:00: TIMER 93324: process timer lists 1040777
2022-03-09 22:08:33 +00:00: TIMER 93324: timeout callback 990
2022-03-09 22:08:33 +00:00: TIMER 93324: 990 list wait because diff is 0
2022-03-09 22:08:33 +00:00: STREAM 93324: readableAddChunk <Buffer 30 ... 83 more bytes>
2022-03-09 22:08:33 +00:00: STREAM 93324: maybeReadMore read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: need readable true
2022-03-09 22:08:33 +00:00: STREAM 93324: length less than watermark true
2022-03-09 22:08:33 +00:00: STREAM 93324: do read
2022-03-09 22:08:33 +00:00: NET 93324: _read
2022-03-09 22:08:33 +00:00: STREAM 93324: readableAddChunk <Buffer 30 ... 81 more bytes>
2022-03-09 22:08:33 +00:00: TIMER 93324: no 1 list was found in insert, creating a new one
2022-03-09 22:08:33 +00:00: STREAM 93324: maybeReadMore read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: need readable true
2022-03-09 22:08:33 +00:00: STREAM 93324: length less than watermark true
2022-03-09 22:08:33 +00:00: STREAM 93324: do read
2022-03-09 22:08:33 +00:00: NET 93324: _read
2022-03-09 22:08:33 +00:00: TIMER 93324: process timer lists 1040817
2022-03-09 22:08:33 +00:00: TIMER 93324: timeout callback 1
2022-03-09 22:08:33 +00:00: TIMER 93324: 1 list empty
2022-03-09 22:08:33 +00:00: STREAM 93324: readableAddChunk <Buffer 30 ... 83 more bytes>
2022-03-09 22:08:33 +00:00: STREAM 93324: maybeReadMore read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: read 0
2022-03-09 22:08:33 +00:00: STREAM 93324: need readable true
2022-03-09 22:08:33 +00:00: STREAM 93324: length less than watermark true
2022-03-09 22:08:33 +00:00: STREAM 93324: do read
2022-03-09 22:08:33 +00:00: NET 93324: _read
2022-03-09 22:08:33 +00:00: TIMER 93324: process timer lists 1040840
2022-03-09 22:08:33 +00:00: TIMER 93324: timeout callback 10000
2022-03-09 22:08:33 +00:00: TIMER 93324: no 1 list was found in insert, creating a new one
@sheepa
Copy link
Author

sheepa commented Mar 10, 2022

This was related to isaacs/node-lru-cache#209

@sheepa sheepa closed this as completed Mar 10, 2022
@palashkulsh
Copy link

@sheepa how did you find this. I am facing the same issue where we are getting busy loop in nodejs process on getpid() after upgrading from node 8 to node 18

@sheepa
Copy link
Author

sheepa commented Mar 17, 2023

@palashkulsh turned out to be an issue with lru-cache.

@paulrutter
Copy link

paulrutter commented Nov 24, 2023

@sheepa @palashkulsh @isaacs states that node-lru-cache doesn't call getpid(), which makes it very unlikely that it's the actual cause of this issue.

We're seeing the same issue since moving from Node16 to Node18. Some Node.js processes are in an endless loop for getpid().

root@ip-aa-aa-aa-aa ~]# strace -p 22774
strace: Process 22774 attached
strace: [ Process PID=22774 runs in x32 mode. ]
strace: [ Process PID=22774 runs in 64 bit mode. ]
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903
getpid()                                = 12903

We're not using the relevant versions of node-lru-cache.
How was the issue resolved for you?

@palashkulsh
Copy link

@paulrutter from what i vaguely remember from my debuggings. it was enable offline queue functionality of redis package which was getting stuck in infinite loop once the redis command was failing. we disabled the offline queue functionality of redis package in our company internal module and we didn't face the issue again.

somewhere something is busylooping in your case as well. node profiler might be able to help you in this.

@paulrutter
Copy link

Thanks for your response, we will dive into it further.

@paulrutter
Copy link

microsoft/vscode#194477 (comment) Seems related (at least for our usecase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants