New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the default batch size for restore #3809
Comments
Grooming notes: We need to have SCT test covering the resotre process. We want to have the job available in jenkins. The goal is to check and observe the metrics during the restore and measure the impact of different values for batch size and transfers to the whole restore process. It's enough to use just 1TB of data on the cluster. There is not much (if any) scylla manager development work required. The most important part is to collect the metrics from SCT run. Make sure that this part of SCT is working. @mikliapko To keep this issue on his plate and start working on it soon. @Michal-Leszczynski we need to find the old restore SCT test and put a link. |
@karol-kokoszka @Michal-Leszczynski From what I see in current test for 4Tb restore, it returns the data only about:
Do we need anything else? |
@mikliapko it would be good to get both SM and Scylla metrics. |
The requested job is ready:
I will add the pipeline in our Jenkins folder after PR merged. @Michal-Leszczynski @karol-kokoszka Please, check it out in Jenkins/Argus. Let me know if you want to have something else in this job. |
@mikliapko I validated that the metrics are there. The only strange thing is that making schema changes visible in grafana doesn't seem to work, but that's not a big deal. Since this job does not provide a way to change restore task params in jenkins job params, we will still need to: fork the repo, change restore task params and provide this fork in job config, but that's also manageable. |
@Michal-Leszczynski |
I think that it would be useful to have:
But even then, if we want to change backup size or cluster topology, we need to do it by forking repo and changing yaml config files, so adding them probably won't solve the whole problem, but will be a step in a convenient direction. cc: @karol-kokoszka |
Maybe it's good if jenkins job would allow to specify all the arguments for |
@karol-kokoszka @Michal-Leszczynski Yes, I fully agree about the convenience such parameterized solution would provide us. I'll consider how to address all these points in SCT. Will keep you both informed here. |
So, from my brief look at how it's done in SCT, we can make About cluster topology: if to talk only about number of DB nodes in singleDC cluster, it seems like making this param configurable is not a big deal as well. Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable. |
@karol-kokoszka @Michal-Leszczynski About cluster topology - what type of parameters you would like to have adjustable - number of nodes, multiDC/singleDC, tablets? and how critical it is?
It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have. |
@mikliapko The more what we can configure is better. MultiDC option would be good to have. |
@karol-kokoszka What do you think about backup sizes? |
Those backup sizes look good, but maybe just add a 5TB as well. |
As mentioned in scylladb/scylladb#18234 (comment). Using batch size lesser than shard_cnt results in not utilizing all available shards on the node.
Although current implementation allows for a better control over restore batch size (so also shard utilization), the default value (2) is definitely too low for a full cluster restore. We should introduce a special value (0) for batch size which translates to 2 * shard_cnt for each node and make it the default.
cc: @karol-kokoszka @tzach
The text was updated successfully, but these errors were encountered: