Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the default batch size for restore #3809

Open
Michal-Leszczynski opened this issue Apr 17, 2024 · 15 comments
Open

Increase the default batch size for restore #3809

Michal-Leszczynski opened this issue Apr 17, 2024 · 15 comments

Comments

@Michal-Leszczynski
Copy link
Collaborator

As mentioned in scylladb/scylladb#18234 (comment). Using batch size lesser than shard_cnt results in not utilizing all available shards on the node.

Although current implementation allows for a better control over restore batch size (so also shard utilization), the default value (2) is definitely too low for a full cluster restore. We should introduce a special value (0) for batch size which translates to 2 * shard_cnt for each node and make it the default.

cc: @karol-kokoszka @tzach

@karol-kokoszka
Copy link
Collaborator

Grooming notes:

We need to have SCT test covering the resotre process.
We want to use it as a benchmark for the further changes/improvements to just compare different values for batch size (that is valid for this issue) and to compare the behavior when the number of parallel transfers is changed.

We want to have the job available in jenkins.
We need to have a possibility of choosing the manager branch that will be executed against the test.
We need to have a possibility of selecting Scylla version.

The goal is to check and observe the metrics during the restore and measure the impact of different values for batch size and transfers to the whole restore process.

It's enough to use just 1TB of data on the cluster.


There is not much (if any) scylla manager development work required.
The effort will be mostly on testing and building the CI job.

The most important part is to collect the metrics from SCT run. Make sure that this part of SCT is working.

@mikliapko To keep this issue on his plate and start working on it soon.

@Michal-Leszczynski we need to find the old restore SCT test and put a link.

@mikliapko
Copy link

@karol-kokoszka @Michal-Leszczynski
Could you please elaborate a bit on metrics we should collect from such kind of runs?

From what I see in current test for 4Tb restore, it returns the data only about:

  • backup run time;
  • restore run time.

Do we need anything else?

@Michal-Leszczynski
Copy link
Collaborator Author

@mikliapko it would be good to get both SM and Scylla metrics.
Mentioned run didn't collect them automatically (not sure why), so I had to collect them manually by logging into monitor node and copying the files. Manually collected metrics can be found here.

@mikliapko
Copy link

mikliapko commented May 7, 2024

The requested job is ready:

I will add the pipeline in our Jenkins folder after PR merged.

@Michal-Leszczynski @karol-kokoszka Please, check it out in Jenkins/Argus. Let me know if you want to have something else in this job.

@Michal-Leszczynski
Copy link
Collaborator Author

@mikliapko I validated that the metrics are there. The only strange thing is that making schema changes visible in grafana doesn't seem to work, but that's not a big deal. Since this job does not provide a way to change restore task params in jenkins job params, we will still need to: fork the repo, change restore task params and provide this fork in job config, but that's also manageable.

@mikliapko
Copy link

mikliapko commented May 8, 2024

@mikliapko I validated that the metrics are there. The only strange thing is that making schema changes visible in grafana doesn't seem to work, but that's not a big deal. Since this job does not provide a way to change restore task params in jenkins job params, we will still need to: fork the repo, change restore task params and provide this fork in job config, but that's also manageable.

@Michal-Leszczynski
Please, provide me the list of restore parameters you want to have editable in pipeline. I'll check how to integrate them and how much effort is needed.

@Michal-Leszczynski
Copy link
Collaborator Author

I think that it would be useful to have:

  • keyspace
  • batch-size
  • parallel

But even then, if we want to change backup size or cluster topology, we need to do it by forking repo and changing yaml config files, so adding them probably won't solve the whole problem, but will be a step in a convenient direction.

cc: @karol-kokoszka

@karol-kokoszka
Copy link
Collaborator

Maybe it's good if jenkins job would allow to specify all the arguments for sctool restore ?
Modifying backup size and cluster topology would be a great option too, the question is @mikliapko how difficult you think it is to have it ?

@mikliapko
Copy link

mikliapko commented May 8, 2024

@karol-kokoszka @Michal-Leszczynski

Yes, I fully agree about the convenience such parameterized solution would provide us. I'll consider how to address all these points in SCT. Will keep you both informed here.

@mikliapko
Copy link

So, from my brief look at how it's done in SCT, we can make keyspaces_num, batch_size and parallel parameters configurable from pipeline.

About cluster topology: if to talk only about number of DB nodes in singleDC cluster, it seems like making this param configurable is not a big deal as well.

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

@karol-kokoszka @Michal-Leszczynski

@mikliapko mikliapko changed the title Increase the deafult batch size for restore Increase the default batch size for restore May 23, 2024
@mikliapko
Copy link

@karol-kokoszka @Michal-Leszczynski

About cluster topology - what type of parameters you would like to have adjustable - number of nodes, multiDC/singleDC, tablets? and how critical it is?

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

@karol-kokoszka
Copy link
Collaborator

@mikliapko The more what we can configure is better.
But it's not a blocker.

MultiDC option would be good to have.
Number of nodes - not critical, but nice to have.
Tablets - yes, some enum like "VNODES, TABLETS, mixed"

@mikliapko
Copy link

Backup size - a bit harder as the current approach comes with hardcoded in configuration file stress commands where the amount of db entries is defined. As a way around, we can introduce a couple of jobs for different amount of data where all the rest requested parameters would be configurable.

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

@karol-kokoszka What do you think about backup sizes?

@Michal-Leszczynski
Copy link
Collaborator Author

It can be an option to define several configuration files for different backup sizes, for example, 500Gb, 1Tb, 2Tb or anything else you'd like to have.

Those backup sizes look good, but maybe just add a 5TB as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants