Fix stalls in forward_service::dispatch() with large tablet count #18695

raphaelsc · 2024-05-15T19:32:18Z

With a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint.

Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f

Also there are inefficient copies that are being removed. partition_range_vector for a single endpoint can grow beyond 1M.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

each partition_range_vector might grow to ~9600 elements, assuming 96-shard nodes, each with 100 tablets. ~9600 elements, where each is 120 bytes (sizeof(partition_range)) can result in vector with capacity of ~2M due to growth factor of 2. we're copying each range 3x in dispatch(), and we can easily avoid it. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

with a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint. Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

scylladb-promoter · 2024-05-16T04:10:21Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 8 hr 0 min
Builder: spider8.cloudius-systems.com

raphaelsc added 4 commits May 15, 2024 16:30

service: coroutinize forward_service::dispatch()

f9d2b9a

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

service: fix indentation in dispatch()

012ba25

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

raphaelsc requested a review from avikivity May 15, 2024 19:32

raphaelsc added the backport/none Backport is not required label May 15, 2024

scylladb-promoter merged commit 6982de6 into scylladb:master May 16, 2024
12 of 13 checks passed

github-actions bot added the promoted-to-master label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stalls in forward_service::dispatch() with large tablet count #18695

Fix stalls in forward_service::dispatch() with large tablet count #18695

raphaelsc commented May 15, 2024

scylladb-promoter commented May 16, 2024

Fix stalls in forward_service::dispatch() with large tablet count #18695

Fix stalls in forward_service::dispatch() with large tablet count #18695

Conversation

raphaelsc commented May 15, 2024

scylladb-promoter commented May 16, 2024

🟢 CI State: SUCCESS

Build Details: