Optionally allow many repair jobs per host #3790

Michal-Leszczynski · 2024-04-15T08:45:53Z

This is a dup of #3611, but since that issue has a long history of comments and was stale, I decided to create a new one with updated argumentation.

One of the more important changes in SM 3.2 repair was sticking to the one job per one host rule. I believe that for a bigger cluster this might kill any parallelism on the node level. Let's analyze a big cluster: 2 dcs 30 nodes each - all nodes have max_repair_ranges_in_parallel = 7. By default, each keyspace in such cluster would consist of 60 * 256 = 15360 token ranges. Assuming that they keyspace has replication {'dc1': 3, 'dc2': 3'}, we have (30!/(3! * 27!))^2 = 4060^2 = 16 483 600 possible replica sets. Assuming that token ranges are distributed uniformly across all possible replica sets, it is rather unlikely that a single repaired replica set owns more than 1 token range. This combined with the fact that SM sends repair jobs only for a single replica set results in SM sending only a single token range per repair job despite max_repair_ranges_in_parallel = 7.

This behavior could be controlled by an additional flag or repair config option in scylla-manager.yaml.
In terms of testing, it would be good to see performance improvement on a big cluster like: 2ds, 15 nodes each, keyspace with RF 3 in each dc, setup in which the repair indeed has to do some work (missing rows on some nodes). This bigger setup would definitely require a help from QA.

The text was updated successfully, but these errors were encountered:

Michal-Leszczynski added enhancement New feature or request repair labels Apr 15, 2024

This was referenced Apr 15, 2024

Experiment with speeding up the repair #3791

Open

Add separate parallel/intensity control for tablets #3792

Open

Michal-Leszczynski mentioned this issue Apr 29, 2024

Increase intensity over limit #3756

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally allow many repair jobs per host #3790

Optionally allow many repair jobs per host #3790

Michal-Leszczynski commented Apr 15, 2024

Optionally allow many repair jobs per host #3790

Optionally allow many repair jobs per host #3790

Comments

Michal-Leszczynski commented Apr 15, 2024