New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add separate parallel/intensity control for tablets #3792
Comments
Also, it might be confusing for the user that the default value of |
We already have enough parameters for the parallel intensity. I think it is better to use the existing ones but adjust automatically internally for the tablet tables when scheduling. It would be a nightmare for users to figure out the exact meaning of --tablet-intensity, --tablet-parallel, ---intensity and --parallel. E.g., for tablet, if ---intensity is 1, we make sure SM sends 1 range per shard when scheduling. |
Since Tablets are new, we will likely need to update the default instantly for tablets in the field while keeping the vNode intensity as is (or vice versa) So, I suggest having different APIs for Tablets and vNodes as long as we have a mixed cluster with both. It does add complexity, but one does not have to set any of these values. The suggested default above makes sense. |
We already have enough parameters for the parallel intensity. I think it is better to use the existing ones but adjust automatically internally for the tablet tables when scheduling. It would be a nightmare for users to figure out the exact meaning of --tablet-intensity, --tablet-parallel, ---intensity and --parallel. E.g., for tablet, if ---intensity is 1, we make sure SM sends 1 range per shard when scheduling.
With --intenisty 0, SM still sends max_repair_ranges_in_parallel ranges per shard, resulting max_repair_ranges_in_parallel * number_of_shards ranges in total. We know which shard owns the range for tablet, so SM could select the correct ranges for each shard easily.
|
If we expose the new options, users will start to use. It is confusing to set values differently for vnode or tablet tables. In the long term, scylla core is going to schedule and control the intensity internally once we have a built-in repair scheduler which works better with tablet migration, e.g., no need to disable migration.
|
The reality is these are two different parameters with two different defaults.
+1 |
No. The defaults are the same for both vnode and tablet tables. The meaning of the intensity is also the same, which controls the number of ranges repaired per shard. E.g.,
The default –intensity 1, still makes sense for tablet. There is not need to change it. The only difference is internal to SM, which sends one range per shard to scylla core (number of shard ranges to scylla core), but this does not change the promise that one range per shard is repaired, which is exactly like what we do for vnode table previously.
|
@asias Just to clarify: Is
Also, can elaborate on this? Does it mean that when choosing ranges for repair job, SM should do it so that each shard owns |
Keeping in mind that for some time we'll have both tablets and vnodes in the same cluster, what does it actually mean to run with different intensity? How do we prevent one hurting the other? |
This is correct. The requested range will be worked in parallel by all shards. It does what the --intensity 1 is supposed to control.
Not multiply by max_repair_ranges_in_parallel. With --intensity 1, SM will find ranges for each shard and send one such range per shard. With tablet, each range will always be owned by one shard. E.g, there are two shards and there are 8 tablets (8 ranges) r1, r2, r3, r4 shard0 With --intensity 1, SM will find r1,r2,r3,r4 belongs to shard 0 and the rests belongs to shard 1. SM sends one range ( out of r1 to r4) to shard 0 and one range (out of r5 to r8) to shard 1 independently. This ensures, at any point in time, scylla will repair at most one range per shard. This is exactly what --intensity 1 does for vnode. Does this make sense now?
max_repair_ranges_in_parallel is per SHARD limit.
See the example above. |
I am actually suggesting not to run different intensity for vnode and tablet tables. SM repairs table after table, so they will not hurt each other. |
@asias, after reading integration with SM from tablet repair design doc, I wasn't aware that SM is supposed to calculate ranges ownership and treat intensity as owned ranges per shard. Right now it's not done, but can be added. I understand that it is important to support this? Also, is this range to shard ownership calculated based on Murmur3Partitioner arithmetic (like here)? |
But after this explanation, perhaps it's indeed ok to use the same intensity flag for both vnode and tablet tables (but send intensity owned ranges per shard for tablet table). |
No, the tablet token range to shard ownership is completely not using the vnode algorithms. We need scylla core to expose this information if needed. However, I have an idea to avoid asking SM to be aware of this shard mapping, which simplifies the requests logic on SM side significantly. We could use ranges_parallelism option we introduced to control the parallelism.
Without knowing which ranges belong to which shard, SM sends all ranges (or big portion of them) need to repair per repair job, in the mean while speechifying ranges_parallelism to 1. This will ensure all shards have work to do and each shard will only work on 1 range in parallel. This avoids asking SM to send only 1 token range per repair job in order to implement the --intensity 1. |
Yes, the option is introduced exactly for the vnode tables in the past. I suggested in the past to use the ranges_parallelism to implement ranges parallelism while sending more ranges in a batch as well. |
@Michal-Leszczynski FYI. This PR adds ranges_parallelism for tablet. scylladb/scylladb#18385 |
@asias btw, can we expect |
No. It is not implemented. We do not need small table optimization for
tablet tables.
…On Fri, Apr 26, 2024, 18:41 Michal-Leszczynski ***@***.***> wrote:
@asias <https://github.com/asias> btw, can we expect
small_table_optimization to be implemented for tablets in Scylla 6.0?
—
Reply to this email directly, view it on GitHub
<#3792 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACOETBPTSAVVOF5YB63IF3Y7IVNVAVCNFSM6AAAAABGG7V4X6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGEZTOMZUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@Michal-Leszczynski FYI, the scylladb/scylladb#18385 is now in master. |
Always sending all (or a lot of) ranges belonging to given replica set in a single repair job is hurting task granularity. Is re-repairing given token range as expensive as the first repair of that token range? Or is it way cheaper and can be ignored (because it just streams hashes and not actual data)? If re-repairing is usually as expensive as the first one, then we would need to somehow control the amount of ranges sent in a single repair job. I see 3 options:
@asias what do you think? |
It should be much cheaper because the difference between node is much smaller than it was. When there is no difference, repair only read data and calculate a check. A granularity of 10% ranges looks good enough, i.e, sending 10% of ranges in a single job. if the job with 10% ranges fails, we retry it. It is not a big deal. Also, failed repair is supposed to be rare compared to successful repair.
It is best to avoid the --batch-size config to users. We have enough options to control it.
If some 10% of ranges fail, we could continue to repair the next 10% of ranges. This prevents some of the ranges blocking the other ranges infinitely. We could also add some randomness how we group them into 10% of ranges.
In the long run, we could provide a api to return the successfully repaired ranges. |
I agree.
By default, repair does not end task execution on first encountered error, but just goes through all ranges, repairs them, and then retries any failed jobs after some backoff. So when I wrote the following:
I meant that perhaps SM should drop this 10% batching (return to the current behavior) when retrying failed ranges after backoff. So all ranges will be first repaired with 10% batching and only in case of an error, they will be retried without batching. The same could be applied to the situation when repair task needs to be interrupted because getting outside of
There is some slight possibility that applying this 10% batching could result in SM not being able to make progress between going out of work windows. So this approach (dropping batching on retry) would be a general safety valve for this implementation that could perhaps be removed in the future. |
@Michal-Leszczynski I labelled this issue so that we can go through it on Monday's grooming. Let's sum up your discussion with @asias there and prepare the summary for Wednesday's planning and triage it with expected SM version. |
GROOMING NOTES The initial idea was to introduce completely new flags to control intensity and parallelism for tablet-based tables. @asias suggested that we should use the same flag as we do for regular tables. Intensity = 1 means that only one range is sent in a single repair job, and the job is going to be handled by all the shards. For tablet-based repair, ranges are mapped to shards, which means that a single range will be repaired by the "owner" shard only. Scylla Manager (SM) doesn't know the mapping and is not aware of which shard owns which ranges. This makes it impossible to find the correct set of ranges to be sent in a single repair job. Due to the facts described above, the proposed solution includes changing the internal meaning of the intensity flag for tablet-based tables.
This is interpreted by SM code as "how many token ranges to send in a single repair job." scylla-manager/swagger/scylla_v1.json Lines 10407 to 10412 in 4d169fe
By implementing this, we improve the CPU/shard utilization during the repair process, which should lead to improved performance. The following design document must be updated to include the chosen approach: https://docs.google.com/document/d/1mBaufbomXU6hDO_25KkhC7CO65AbFgKpHVpB5Mei8xA/edit#heading=h.o646d4880ftd The proposal from @Michal-Leszczynski is to create a fallback for the approach with 10% of ranges sent in a single repair job. If one range of these 10% fails, then SM would need to repair all these ranges once again. To avoid a loop of retried repairs, we should use 10% only in the first repair job execution; failover should proceed with the old approach (intensity = number of ranges in the repair job). To evaluate if the given approach is correct, we need to have Scylla Cluster Test (SCT) allowing us to execute the repair on a large cluster (1TB?) and compare the metrics. (cc: @mikliapko) cc: @mykaul @tzach @vladzcloudius -> We need to decide if we want to change the repair now and include @asias's suggestions, or if we agree to underutilize the nodes during the repair and send the intensity number of token ranges in a single repair job, maintaining the 3.2 approach. Let's bring this to the planning. |
Retry after failure and Resume after pause with less than 10% ranges per job sounds reasonable. It is possible that 10% ranges would take more than the specified window. However, it is also possible that even a single range would exceed the window too, which should be rare. Consider 10% ranges -> failure -> 1 range at a time until all ranges in the first 10% are repaired How about the next 10% ranges, would we use 10% or 1 range at a time? |
I am fine with either sending some percentage of ranges per job (+ retry with 1 job per time) or sending one job at a time (under-utilize the cluster for repair) for tablets. But let's not add new options for the tablet intensity control. Another point is that, I am already working on the in-core automatic repair scheduler that schedules repair jobs in side scylla core, which could cope better with tablet migration and topology changes. |
The following issue is expected to help with measuring/validating the upgraded repair on tablets #3853 |
As per tablet repair design, SM should introduce separate flags for controlling repair parallelism/intensity for tablet tables.
Those new flags should also be supported by the
sctool repair control
command.As per mentioned doc, the default for
--tablet-intensity
should be the amount of shards (even despitemax_repair_ranges_in_parallel
). I'm not aware of any reasonable limit for this value, so I suppose that SM won't cap it at any point.I'm wondering whether it makes sense to introduce
--tablet-parallel
flag (@tgrabiec please advise). If there is a need for this flag, then I believe that it should behave in the same way as--parallel
flag. This means that it has the same default (1) and the same cap.cc: @karol-kokoszka @tzach @tgrabiec @asias
The text was updated successfully, but these errors were encountered: