Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

190% CPU hike on v2.2.4 #1556

Open
Hyjaz opened this issue Mar 20, 2023 · 8 comments · May be fixed by #1667
Open

190% CPU hike on v2.2.4 #1556

Hyjaz opened this issue Mar 20, 2023 · 8 comments · May be fixed by #1667

Comments

@Hyjaz
Copy link

Hyjaz commented Mar 20, 2023

Describe the bug
When upgrading to v2.2.4 we saw a 190% increase in our CPU usage.

To Reproduce
Not sure how you can reproduce it. We do however send thousand of messages per second.

If none of the above are possible to provide, please write down the exact steps to reproduce the behavior:

  1. Run a producer that continuously produces messages to a topic

Expected behavior
No cpu usage hike between v2.2.3 and v.2.2.4

Observed behavior
A clear and concise description of what did happen. Please include any relevant logs with the log level set to debug.

Environment:

  • OS: N/A
  • KafkaJS version: 2.2.4
  • Kafka version:
  • NodeJS version:16.18.0

Additional context
Add any other context about the problem here.

@Hyjaz Hyjaz changed the title CPU hike on v2.2.4 190% CPU hike on v2.2.4 Mar 20, 2023
@Nevon
Copy link
Collaborator

Nevon commented Mar 21, 2023

Do you have a CPU profile that could show where CPU time is being spent? You can use something like 0x to generate a flame graph, if you don't instrument your application with some APM solution. Ideally with a comparison to 2.2.3

@Hyjaz
Copy link
Author

Hyjaz commented Mar 21, 2023

Hello @Nevon, on the latest kafkajs version it seems to be coming from the scheduleCheckPendingRequest. I noticed that there was a change related to this in the requestQueue/index.js in the latest release. Let me know if you need more details.

Screenshot 2023-03-21 at 08 41 31

This is v2.2.3. You can see there is a huge jump in the cpu usage between v2.2.4 and this one.

Screenshot 2023-03-21 at 09 06 00

@Nevon
Copy link
Collaborator

Nevon commented Mar 21, 2023

Thank you, that's what I suspected, but it's great to have some data to back it up. For reference, the change was introduced in #1532.

@JoseGoncalves
Copy link

Also noticed this CPU increase in my app.
In idle v2.2.3 uses almost no CPU, while v2.2.4 uses around 8%.

@MDSLKTR
Copy link

MDSLKTR commented May 8, 2023

I kinda want to have the best of 2.2.3 and 2.2.4 but the CPU spike is way too much to upgrade currently, which is why we pinned it to 2.2.3

@Nevon i gave it a stab in the linked PR here #1572

MDSLKTR added a commit to MDSLKTR/kafkajs that referenced this issue May 8, 2023
MDSLKTR added a commit to MDSLKTR/kafkajs that referenced this issue May 9, 2023
@siimsams
Copy link

siimsams commented Jun 1, 2023

We also have this issue after upgrading from kafkajs 1 -> latest. All services that have upgraded consume way more cpu and event loop iterations per second have increased 100X. After applying @MDSLKTR's fix as a patch this issue goes away. Would like to see this get merged asap.

Thank you @MDSLKTR !

terebentina added a commit to terebentina/kafkajs that referenced this issue Jun 9, 2023
* fix regression when timeout is marginal

See
tulios#1556

fixes

* always use the calculated scheduled timeout or 0

---------

Co-authored-by: MDSLKTR <simon.kunz@fashion-digital.de>
@1solation
Copy link

Also noticed this CPU increase in our app. Using roughly 1.5/2x more CPU than previously

@atiquefiroz
Copy link

We have been using kafka-node for very long time, and decided to move to kafkajs for publishing to start with.
We have a very high throughput logging system (~0.5 Million RPM) for a single micro service.
Once we switched to kafakjs 2.2.4 the CPU spike as mentioned was too much to handle on resource side.
So i can validate that @MDSLKTR findings effect the system in expected way. We switched it back to 2.2.3 and the resource utilisation came back to normal.
We should think of patching this is next version. Attaching some system matrices from our production. ( First spike is when we switched from kafka-node to kafkajs 2.2.4, second downfall is when we dpeloyed kafkajs 2.2.3 )
Screenshot 2023-11-11 at 9 31 17 AM

Screenshot 2023-11-11 at 9 31 58 AM Screenshot 2023-11-11 at 9 33 34 AM Screenshot 2023-11-11 at 9 24 42 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants