Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stalls in forward_service::dispatch() with large tablet count #18695

Conversation

raphaelsc
Copy link
Member

With a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint.

Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f

Also there are inefficient copies that are being removed. partition_range_vector for a single endpoint can grow beyond 1M.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
each partition_range_vector might grow to ~9600 elements, assuming
96-shard nodes, each with 100 tablets.

~9600 elements, where each is 120 bytes (sizeof(partition_range))
can result in vector with capacity of ~2M due to growth factor of
2.

we're copying each range 3x in dispatch(), and we can easily avoid
it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
with a large tablet count, e.g. 128k, forward_service::dispatch() can
potentially stall when grouping ranges per endpoint.

Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
@raphaelsc raphaelsc requested a review from avikivity May 15, 2024 19:32
@raphaelsc raphaelsc added the backport/none Backport is not required label May 15, 2024
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest with topology changes
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 8 hr 0 min
  • Builder: spider8.cloudius-systems.com

@scylladb-promoter scylladb-promoter merged commit 6982de6 into scylladb:master May 16, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/none Backport is not required promoted-to-master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants