Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Scaling Issues with KRaft-Based Kafka Clusters #10054

Open
see-quick opened this issue May 2, 2024 · 5 comments
Open

Performance Scaling Issues with KRaft-Based Kafka Clusters #10054

see-quick opened this issue May 2, 2024 · 5 comments

Comments

@see-quick
Copy link
Member

see-quick commented May 2, 2024

Related problem

In performance capacity tests for Topic Operators using KRaft-based Kafka clusters, several scaling limitations were observed. Notably, ARM architecture displayed better scalability than Intel's, with issues arising when the number of KafkaTopics approached 4000. Specific errors related to the LogManager and connection refusals were noted in Kafka broker logs, indicating potential faults in topic ID management and network communications under high-load scenarios.

Here is the problem with the LogManager (full log [1]):

2024-05-02 15:05:46,464 ERROR Encountered fatal fault: Error starting LogManager (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) [kafka-0-metadata-loader-event-handler]
java.lang.RuntimeException: The log dir Log(dir=/var/lib/kafka/data/kafka-log0/my-topic-23124280-2137229491-2824-10, topic=my-topic-23124280-2137229491-2824, partition=10, highWatermark=0, lastStableOffset=0, logStartOffset=0, logEndOffset=0) does not have a topic ID, which is not allowed when running in KRaft mode.
	at kafka.log.LogManager$.$anonfun$findStrayReplicas$2(LogManager.scala:1559)
	at scala.Option.getOrElse(Option.scala:201)
	at kafka.log.LogManager$.$anonfun$findStrayReplicas$1(LogManager.scala:1558)
	at scala.collection.StrictOptimizedIterableOps.flatMap(StrictOptimizedIterableOps.scala:118)
	at scala.collection.StrictOptimizedIterableOps.flatMap$(StrictOptimizedIterableOps.scala:105)
	at scala.collection.mutable.ArrayBuffer.flatMap(ArrayBuffer.scala:43)
	at kafka.log.LogManager$.findStrayReplicas(LogManager.scala:1556)
	at kafka.log.LogManager.deleteStrayKRaftReplicas(LogManager.scala:577)
	at kafka.server.metadata.BrokerMetadataPublisher.initializeManagers(BrokerMetadataPublisher.scala:298)
	at kafka.server.metadata.BrokerMetadataPublisher.onMetadataUpdate(BrokerMetadataPublisher.scala:127)
	at org.apache.kafka.image.loader.MetadataLoader.initializeNewPublishers(MetadataLoader.java:309)
	at org.apache.kafka.image.loader.MetadataLoader.lambda$scheduleInitializeNewPublishers$0(MetadataLoader.java:266)
	at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127)
	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210)
	at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181)
	at java.base/java.lang.Thread.run(Thread.java:840)

and this is the Topic Operator container log (full log [2])

2024-05-02 15:06:23,85526 DEBUG [-472234151-pool-3-thread-3] DefaultSharedIndexInformer:236 - Checking for resync at interval for kafka.strimzi.io/v1beta2/namespaces/co-namespace/kafkatopics
2024-05-02 15:06:24,45023 DEBUG [kafka-admin-client-thread | strimzi-topic-operator-4bf40b6e-5f7a-443d-aabd-673ede726ee9] Selector:606 - [AdminClient clientId=strimzi-topic-operator-4bf40b6e-5f7a-443d-aabd-673ede726ee9] Connection with cluster-03ebd4e7-b-cc352780-2.cluster-03ebd4e7-kafka-brokers.co-namespace.svc/10.244.4.26 (channelId=2) disconnected
java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.apache.kafka.common.network.SslTransportLayer.finishConnect(SslTransportLayer.java:139) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:224) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:526) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.Selector.poll(Selector.java:481) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:585) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1504) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1435) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2024-05-02 15:06:24,55200 DEBUG [kafka-admin-client-thread | strimzi-topic-operator-4bf40b6e-5f7a-443d-aabd-673ede726ee9] Selector:606 - [AdminClient clientId=strimzi-topic-operator-4bf40b6e-5f7a-443d-aabd-673ede726ee9] Connection with cluster-03ebd4e7-b-cc352780-0.cluster-03ebd4e7-kafka-brokers.co-namespace.svc/10.244.5.25 (channelId=0) disconnected
java.net.ConnectException: Connection refused
	at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
	at sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[?:?]
	at org.apache.kafka.common.network.SslTransportLayer.finishConnect(SslTransportLayer.java:139) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:224) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:526) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.common.network.Selector.poll(Selector.java:481) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:585) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1504) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1435) ~[org.apache.kafka.kafka-clients-3.7.0.jar:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]

[1] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/logs-pod-cluster-03ebd4e7-b-cc352780-0-container-kafka.log

[2] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/logs-pod-cluster-03ebd4e7-entity-operator-d78d79d5c-sjnrg-container-topic-operator.log

Suggested solution

Not sure, we should observe what's the main root cause.

Alternatives

No response

Additional context

The issue was specifically noted in scenarios where the number of KafkaTopics was scaled up to 4000, suggesting a threshold where current configurations begin to falter. Performance logs and detailed test outcomes from both ARM and Intel architecture setups (available in linked artefacts [1]) underline the need for targeted improvements in scalability and error management. All measurements:

And this is my specific findings:
Multi-node (ZK-based)

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
 | Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 3200                                | 215.26356                           | 0.219718797                                    | 56415.0                                    | 0.605                            | 100.0                                      | 22.554448171                                     | 8.737679829                                    | 0.201271031                                       | 102240.315236831                                                 |
| 2          | 2147483647         | 10                      | 1                         | 3200                                | 214.598832                          | 0.220416525                                    | 56232.0                                    | 0.38875                          | 10.0                                       | 2.898720235                                      | 1.531996913                                    | 0.061099615                                       | 12172.831177297001                                               |
| 3          | 2147483647         | 50                      | 100                       | 4800                                | 254.322944                          | 0.206390701                                    | 188946.0                                   | 1.34625                          | 50.0                                       | 62.053237627                                     | 4.362440202                                    | 60.049029957                                      | 63501.666334768                                                  |
| 4          | 2147483647         | 100                     | 500                       | 4000                                | 221.194208                          | 0.361124567                                    | 94535.0                                    | 0.565                            | 100.0                                      | 15.49413541                                      | 7.430776523                                    | 0.247871354                                       | 105783.41873366499                                               |
| 5          | 2147483647         | 500                     | 1000                      | 5500                                | 338.318472                          | 0.279250863                                    | 244963.0                                   | 1.22125                          | 500.0                                      | 76.091719096                                     | 60.006696895                                   | 60.049583745                                      | 663336.367527252                                                 |
| 6          | 2147483647         | 1000                    | 2000                      | 4300                                | 273.048416                          | 0.254702686                                    | 153003.0                                   | 1.29                             | 1000.0                                     | 90.530778363                                     | 10.867806866                                   | 60.07707421                                       | 1109098.767682166                                                |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Multi-node (KRaft-based)

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 5600                                | 195.439096                          | 0.22489477                                     | 58228.0                                    | 0.79                             | 100.0                                      | 1.784441551                                      | 0.139120398                                    | 0.063826924                                       | 40307.747651726                                                  |
| 2          | 2147483647         | 10                      | 1                         | 4300                                | 187.411216                          | 0.40710244                                     | 33721.0                                    | 0.765                            | 10.0                                       | 0.52696775                                       | 0.093613257                                    | 0.071502803                                       | 4324.554014178                                                   |
| 3          | 2147483647         | 50                      | 100                       | 4800                                | 220.39176                           | 0.275977841                                    | 42992.0                                    | 0.71                             | 50.0                                       | 1.234386444                                      | 0.248006673                                    | 0.067473604                                       | 20203.399965616998                                               |
| 4          | 2147483647         | 100                     | 500                       | 5700                                | 203.301648                          | 0.127244311                                    | 66151.0                                    | 0.78125                          | 100.0                                      | 2.267170077                                      | 0.168008125                                    | 0.078424406                                       | 42221.078131318005                                               |
| 5          | 2147483647         | 500                     | 1000                      | 5800                                | 374.976736                          | 0.239079959                                    | 129444.0                                   | 1.28125                          | 500.0                                      | 75.606669081                                     | 0.151590669                                    | 60.064203694                                      | 476065.79083824                                                  |
| 6          | 2147483647         | 1000                    | 2000                      | 5400                                | 306.997096                          | 0.224739169                                    | 134374.0                                   | 1.48                             | 1000.0                                     | 92.287175352                                     | 0.15948684                                     | 60.063947644                                      | 835873.344649334                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Testing farm ARM:
Use Case: capacityUseCase

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 4500                                | 731.639096                          | 0.098367982                                    | 78873.0                                    | 0.20421875                       | 100.0                                      | 60.754216112                                     | 60.002203204                                   | 60.000849276                                      | 108018.144774452                                                 |
| 2          | 2147483647         | 10                      | 1                         | 2800                                | 209.412064                          | 0.091291686                                    | 46315.0                                    | 0.216875                         | 10.0                                       | 60.092980774                                     | 0.075462826                                    | 60.000769675                                      | 6332.63210499                                                    |
| 3          | 2147483647         | 50                      | 100                       | 4400                                | 209.128448                          | 0.045046907                                    | 79613.0                                    | 0.2634375                        | 50.0                                       | 60.451420408                                     | 0.113813242                                    | 60.001154181                                      | 53516.134807393006                                               |
| 4          | 2147483647         | 100                     | 500                       | 5300                                | 288.959168                          | 0.101506987                                    | 92527.0                                    | 0.26671875                       | 100.0                                      | 60.828857234                                     | 0.195412376                                    | 60.00111564                                       | 118558.314525059                                                 |
| 5          | 2147483647         | 500                     | 1000                      | 4300                                | 272.103896                          | 0.049253628                                    | 73222.0                                    | 0.27296875                       | 500.0                                      | 63.1228414                                       | 0.190617895                                    | 60.001418996                                      | 350705.097103311                                                 |
| 6          | 2147483647         | 1000                    | 2000                      | 3600                                | 832.115216                          | 0.057186219                                    | 55337.0                                    | 0.2925                           | 1000.0                                     | 67.99075671                                      | 0.122869061                                    | 60.002316816                                      | 289868.954987523                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

Testing farm Intel
Use Case: capacityUseCase

+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+
| Experiment | IN: MAX QUEUE SIZE | IN: MAX BATCH SIZE (ms) | IN: MAX BATCH LINGER (ms) | OUT: Successful KafkaTopics Created | jvm_memory_used_megabytes_total.txt | strimzi_update_status_duration_seconds_max.txt | strimzi_reconciliations_max_queue_size.txt | system_load_average_per_core.txt | strimzi_reconciliations_max_batch_size.txt | strimzi_reconciliations_duration_seconds_max.txt | strimzi_create_topics_duration_seconds_max.txt | strimzi_describe_configs_duration_seconds_max.txt | strimzi_total_time_spend_on_uto_event_queue_duration_seconds.txt |
| 1          | 2147483647         | 100                     | 100                       | 4100                                | 313.79264                           | 0.270303261                                    | 72879.0                                    | 0.120625                         | 100.0                                      | 60.713250828                                     | 60.001026375                                   | 60.001035712                                      | 23303.511383901                                                  |
| 2          | 2147483647         | 10                      | 1                         | 2800                                | 197.653832                          | 0.143603987                                    | 46974.0                                    | 0.15364583333333334              | 10.0                                       | 60.079330599                                     | 1.024587598                                    | 60.00056892                                       | 6277.153443556                                                   |
| 3          | 2147483647         | 50                      | 100                       | 4100                                | 202.440888                          | 0.086903046                                    | 67232.0                                    | 0.12760416666666666              | 50.0                                       | 60.286231975                                     | 60.001271054                                   | 60.000842754                                      | 50124.335254043                                                  |
| 4          | 2147483647         | 100                     | 500                       | 4900                                | 821.440424                          | 0.099006388                                    | 84448.0                                    | 0.16447916666666665              | 100.0                                      | 60.687474644                                     | 0.078397415                                    | 60.001052049                                      | 112642.51520577799                                               |
| 5          | 2147483647         | 500                     | 1000                      | 3400                                | 209.177632                          | 0.119279889                                    | 52186.0                                    | 0.14885416666666665              | 500.0                                      | 63.252552456                                     | 60.00182004                                    | 60.001381924                                      | 290553.26413272496                                               |
| 6          | 2147483647         | 1000                    | 2000                      | 4100                                | 1101.374072                         | 0.128018699                                    | 64970.0                                    | 0.1259375                        | 1000.0                                     | 66.056154521                                     | 0.082030432                                    | 60.002737845                                      | 406052.291981094                                                 |
+------------+--------------------+-------------------------+---------------------------+-------------------------------------+-------------------------------------+------------------------------------------------+--------------------------------------------+----------------------------------+--------------------------------------------+--------------------------------------------------+------------------------------------------------+---------------------------------------------------+------------------------------------------------------------------+

[1] - https://artifacts.dev.testing-farm.io/ce820a24-0fe5-41c2-8701-fc0e045f5097/work-performance-topic-operator-capacity025kvr2n/systemtest/tmt/plans/performance-topic-operator-capacity/data/logs/logs/2024-05-02-14-02-54/io.strimzi.systemtest.performance.TopicOperatorPerformance/testCapacity/co-namespace/

@scholzj
Copy link
Member

scholzj commented May 16, 2024

Discussed on the community call on 16.5.2024: @mimaison will try to have a look at the Kafka error in more detail and we can get back to it next time.

@showuon
Copy link
Contributor

showuon commented May 17, 2024

@see-quick , do you have full log for the error case? I am thinking if there's some errors while creating topic earlier?

@see-quick
Copy link
Member Author

@showuon
Copy link
Contributor

showuon commented May 22, 2024

Identified a bug in upstream Kafka. Filed KAFKA-16814 for this issue.

@ppatierno
Copy link
Member

Triaged on 30/5/2024: let's keep this open and waiting for the corresponding Kafka issue to be fixed and available in next releases Kafka 3.8.0 and 3.7.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants