-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DTest]: Scale test with 100 nodes failed with Aborting due to allocation failure: failed to refill emergency reserve of 30 (have 12 free segments) #18669
Comments
My comments on the old issue:
Summarizing, we should compare if it's a regression or not, it could indeed be as @aleksbykov suggested that the instance is too small (running out of memory), because we have 72 nodes running on a single machine (this is a dtest). If it happens on 5.4 too, there should be nothing to worry about. |
Marking as release blocker for now, before checking the above, but there's a chance we will be able to take it off and/or close the issue quickly. |
We could compare 3 runs:
|
From the other issue:
But I see no @aleksbykov you could also check whether it keeps failing with |
Job https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/Dtest/job/master-dtest-with-raft/272/ also failed. also around 71 node. not all logs are collected yet. But also there is coredumps. |
@kbr-scylla , the test was run with enabled --overprovisioned option according to log in ScyllaNode.start method, because cpuset was not passed in parameters:
|
Ok, so there's a chance there's no regression. Let's confirm with 5.4 run
Good |
it is running for now |
Failed by timeout. Configured timeout for test was not enough. Increased and job has been rerun. |
Are you saying it failed at node 69? |
Issue reproduced with 5.4. Looks very similar abort: it failed on 87 node |
@yaronkaikov , is it possible to run these custom test: master-dtest-with-raft/269 with 100 nodes on more powerfull instance? |
So it's most likely as you said -- the instance is too weak to deal with that many nodes. Let's see if the test passes on |
How many instances do you need? we may need to create a new auto-scaling group with more powerful instances for testing btw, you have a job running now for 5 days https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/Dtest/job/master-dtest-with-raft/272/ |
At the moment, I'm considering running these tests to verify a theory that the issue lies in the lack of resources on the instance to perform tests with 100 nodes. I want to do this through Jenkins because it will be more convenient to share the results and rerun the tests or reproduce the problem. As far as I know, we currently have one test with such a large configuration. The new auto-scaling groups sound great, and it should be easy to set them up in the current job parameters, if I'm not mistaken. |
Moving to 6.1, just to remove it from 6.0 blockers list for the time being. |
@aleksbykov Done, you can use it by using the label |
Latest job with stronger instance failed by timeout (not enough time configured for test running). Timeout was increased and job is rerun |
Job failed on large instance also with raft topology. Topology coordinator failed with same error:
Job logs: link |
latest run: m5ad.12xlarge 192.0 GiB 48 vCPUs 1800 GB (2 * 900 GB NVMe SSD) 10 Gigabit previous run: c5ad.8xlarge 64.0 GiB 32 vCPUs 1200 GB (2 * 600 GB NVMe SSD) 10 Gigabit |
@aleksbykov we should compare to 5.4 for that larger instance as well. |
Can you explain to me this decision? @mykaul if it's not a blocker for 6.0, then we should get rid of How can we pretend that we solved release blockers if we didn't solve them -- just for the sake of reducing some metric (number of 6.0 release blockers)? |
I'll try to explain - this limitation is not one that we'll fix quickly (otherwise we would have fixed it by now), but one we could document and fix later, in 6.0.x, which mean we need to begin with fixing it in 6.1 first. I moved it as is, with the 'release blocker' flag so it'll be a higher priority to fix than other items in the 6.1 queue. |
@kbr-scylla , i started it yesterday. 86 nodes was added, 87th node failed to start due to node16 aborted. |
I don't think there is a real limitation. This is a dtest which is booting 100 nodes in a single machine, which is something that we don't officially support. And we know that it's also failing in 5.4, after booting similar number of nodes as in 6.0. So if there is a problem, it was already there. So next steps we should do are:
|
@kbr-scylla , i think it was in plan. I will search it and trigger.
I don't think it was ever run . We have 40 nodes test which is passed. And i created 100 nodes tests just as POC and check for single run as @mykaul requested. |
Probably not. We don't need to support 100 nodes running on a single machine. Also now I understand (after Avi pointed it out) that the problem is probably here:
The larger instance you tried:
should be enough -- if each node takes 1GB of memory, and we boot 100 nodes, then 100GB of memory should be enough. We're allocating 1GB for two shards so 512 MB per shard -- and most likely what happens, is that some topology-related metadata is trying to take over 512 MB of memory when we have 100 nodes. This metadata should not be that large. I would guess So @aleksbykov I have one more request: please run the test again on the larger instance, but this time use |
Ok i will check |
Installation details
Scylla version (or git commit hash): Scylla version 5.5.0~dev-0.20240513.2ce643d06bf0 with build-id 2e2c89cbb469c1231861753c4af823040a31579e
Cluster size: up to 100 nodes
POC dtest update_cluster_layout_tests.py::TestLargeScaleCluster::test_add_many_nodes_under_load_100_nodes](https://jenkins.scylladb.com/job/scylla-staging/job/abykov/job/Dtest/job/master-dtest-with-raft/269/artifact/logs-full.release.000/1715679351559_update_cluster_layout_tests.py%3A%3ATestLargeScaleCluster%3A%3Atest_add_many_nodes_under_load_100_nodes/) failed upon adding 72nd node.
with one hundred nodes test failed on node 72: with startup failed because node1(topology coordinator failed with core):
Node1 reported:
Test is presented by this commit: https://github.com/aleksbykov/scylla-dtest/commit/e9be258e70810650bcf3bda46af06f4b5d720b00
The text was updated successfully, but these errors were encountered: