Increase the default batch size for restore #3809

Michal-Leszczynski · 2024-04-17T11:02:42Z

As mentioned in scylladb/scylladb#18234 (comment). Using batch size lesser than shard_cnt results in not utilizing all available shards on the node.

Although current implementation allows for a better control over restore batch size (so also shard utilization), the default value (2) is definitely too low for a full cluster restore. We should introduce a special value (0) for batch size which translates to 2 * shard_cnt for each node and make it the default.

cc: @karol-kokoszka @tzach

karol-kokoszka · 2024-04-22T08:37:18Z

Grooming notes:

We need to have SCT test covering the resotre process.
We want to use it as a benchmark for the further changes/improvements to just compare different values for batch size (that is valid for this issue) and to compare the behavior when the number of parallel transfers is changed.

We want to have the job available in jenkins.
We need to have a possibility of choosing the manager branch that will be executed against the test.
We need to have a possibility of selecting Scylla version.

The goal is to check and observe the metrics during the restore and measure the impact of different values for batch size and transfers to the whole restore process.

It's enough to use just 1TB of data on the cluster.

There is not much (if any) scylla manager development work required.
The effort will be mostly on testing and building the CI job.

The most important part is to collect the metrics from SCT run. Make sure that this part of SCT is working.

@mikliapko To keep this issue on his plate and start working on it soon.

@Michal-Leszczynski we need to find the old restore SCT test and put a link.

Michal-Leszczynski · 2024-04-22T10:19:32Z

@mikliapko I think that this is the job: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/sct-feature-test-4TB-single-node-backup-and-restore-3-nodes-ilya/

mikliapko · 2024-04-30T13:41:51Z

@karol-kokoszka @Michal-Leszczynski
Could you please elaborate a bit on metrics we should collect from such kind of runs?

From what I see in current test for 4Tb restore, it returns the data only about:

backup run time;
restore run time.

Do we need anything else?

Michal-Leszczynski · 2024-05-06T07:51:46Z

@mikliapko it would be good to get both SM and Scylla metrics.
Mentioned run didn't collect them automatically (not sure why), so I had to collect them manually by logging into monitor node and copying the files. Manually collected metrics can be found here.

mikliapko · 2024-05-07T13:06:32Z

The requested job is ready:

I will add the pipeline in our Jenkins folder after PR merged.

@Michal-Leszczynski @karol-kokoszka Please, check it out in Jenkins/Argus. Let me know if you want to have something else in this job.

Michal-Leszczynski · 2024-05-08T09:24:18Z

@mikliapko I validated that the metrics are there. The only strange thing is that making schema changes visible in grafana doesn't seem to work, but that's not a big deal. Since this job does not provide a way to change restore task params in jenkins job params, we will still need to: fork the repo, change restore task params and provide this fork in job config, but that's also manageable.

mikliapko · 2024-05-08T09:34:02Z

@mikliapko I validated that the metrics are there. The only strange thing is that making schema changes visible in grafana doesn't seem to work, but that's not a big deal. Since this job does not provide a way to change restore task params in jenkins job params, we will still need to: fork the repo, change restore task params and provide this fork in job config, but that's also manageable.

@Michal-Leszczynski
Please, provide me the list of restore parameters you want to have editable in pipeline. I'll check how to integrate them and how much effort is needed.

Michal-Leszczynski · 2024-05-08T09:40:32Z

I think that it would be useful to have:

keyspace
batch-size
parallel

But even then, if we want to change backup size or cluster topology, we need to do it by forking repo and changing yaml config files, so adding them probably won't solve the whole problem, but will be a step in a convenient direction.

cc: @karol-kokoszka

karol-kokoszka · 2024-05-08T09:43:50Z

Maybe it's good if jenkins job would allow to specify all the arguments for sctool restore ?
Modifying backup size and cluster topology would be a great option too, the question is @mikliapko how difficult you think it is to have it ?

mikliapko · 2024-05-08T09:46:51Z

@karol-kokoszka @Michal-Leszczynski

Yes, I fully agree about the convenience such parameterized solution would provide us. I'll consider how to address all these points in SCT. Will keep you both informed here.

mikliapko · 2024-05-22T10:19:39Z

So, from my brief look at how it's done in SCT, we can make keyspaces_num, batch_size and parallel parameters configurable from pipeline.

About cluster topology: if to talk only about number of DB nodes in singleDC cluster, it seems like making this param configurable is not a big deal as well.

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

@karol-kokoszka @Michal-Leszczynski

mikliapko · 2024-05-23T12:42:51Z

@karol-kokoszka @Michal-Leszczynski

About cluster topology - what type of parameters you would like to have adjustable - number of nodes, multiDC/singleDC, tablets? and how critical it is?

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

karol-kokoszka · 2024-05-23T12:45:45Z

@mikliapko The more what we can configure is better.
But it's not a blocker.

MultiDC option would be good to have.
Number of nodes - not critical, but nice to have.
Tablets - yes, some enum like "VNODES, TABLETS, mixed"

mikliapko · 2024-05-23T12:55:39Z

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

@karol-kokoszka What do you think about backup sizes?

Michal-Leszczynski · 2024-05-23T14:00:05Z

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

Those backup sizes look good, but maybe just add a 5TB as well.

Michal-Leszczynski added the restore label Apr 17, 2024

karol-kokoszka added the ready-for-development label Apr 22, 2024

mikliapko mentioned this issue May 7, 2024

ci(manager): pipeline for 1TB of data restore/backup scylladb/scylla-cluster-tests#7408

Draft

4 tasks

mikliapko changed the title ~~Increase the deafult batch size for restore~~ Increase the default batch size for restore May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the default batch size for restore #3809

Increase the default batch size for restore #3809

Michal-Leszczynski commented Apr 17, 2024

karol-kokoszka commented Apr 22, 2024

Michal-Leszczynski commented Apr 22, 2024

mikliapko commented Apr 30, 2024

Michal-Leszczynski commented May 6, 2024

mikliapko commented May 7, 2024 •

edited

Michal-Leszczynski commented May 8, 2024

mikliapko commented May 8, 2024 •

edited

Michal-Leszczynski commented May 8, 2024

karol-kokoszka commented May 8, 2024

mikliapko commented May 8, 2024 •

edited

mikliapko commented May 22, 2024

mikliapko commented May 23, 2024

karol-kokoszka commented May 23, 2024

mikliapko commented May 23, 2024

Michal-Leszczynski commented May 23, 2024

Increase the default batch size for restore #3809

Increase the default batch size for restore #3809

Comments

Michal-Leszczynski commented Apr 17, 2024

karol-kokoszka commented Apr 22, 2024

Michal-Leszczynski commented Apr 22, 2024

mikliapko commented Apr 30, 2024

Michal-Leszczynski commented May 6, 2024

mikliapko commented May 7, 2024 • edited

Michal-Leszczynski commented May 8, 2024

mikliapko commented May 8, 2024 • edited

Michal-Leszczynski commented May 8, 2024

karol-kokoszka commented May 8, 2024

mikliapko commented May 8, 2024 • edited

mikliapko commented May 22, 2024

mikliapko commented May 23, 2024

karol-kokoszka commented May 23, 2024

mikliapko commented May 23, 2024

Michal-Leszczynski commented May 23, 2024

mikliapko commented May 7, 2024 •

edited

mikliapko commented May 8, 2024 •

edited

mikliapko commented May 8, 2024 •

edited