BP-65: implement load balance for select bookie #4247

TakaHiR07 · 2024-03-27T03:46:16Z

BP

This is the master ticket for tracking BP-65 :
Proposal PR - #4246

Motivation

One of our clusters have 255 bookies, and we find that bookie's write pressure is very unbalance.

Usually there are several bookies write latency too high, which cause the message publish latency also too high in pulsar broker.

Currently, bookie have quarantine mechanism to deal with this case.

when broker select ensemble for a ledger, it uses RackAwarePlacement and random select strategy to decide which bookie should be written.
If bookie write error achieve n times, the bookie would be quarantined in broker.
To avoid too many broker quarantine a bookie at the same time, we can define quarantineRatio to do quarantine randomly.

However, this mechanism is not good enough to avoid bookie high write latency

Since we select bookie ensemble randomly, ramdom strategy is hard to achieve good performance if we don't have too many bookies or too many ledgers to select.
It is hard to define appropriate n time write error, and quarantineRatio. Because it is hard to map bookie write pressure to these configs.
It is hard to define how long we quarantine a bookie. Because we don't know which time bookie's write pressure has already dropped. Now we can only set the quarantineTime by config.
Now we can only quarantine a bookie when it already has write-pressure, but we can't avoid the write-pressure problem occur in advance.

Proposed Changes

To solve this write pressure problem, we propose to introduce bookie load balance mechanism, which is supplement of current quarantine mechanism.
When we choose ensemble for ledger, if we have load information of all bookies, we can prefer to select the low-load bookie as ensemble.
And we can avoid to write into the high-load bookie, which make the bookie perform worse and cause high write latency.
Therefore, with the bookie load information, we can better avoid high write latency problem occur

We notice that bookie already has DiskWeightBasedPlacement mechanism, which is similar to load balance. We just need to enhance this mechanism,
replace it to LoadWeightBasedPlacement.

The proposed changes involves:

Implement BaseMetricMonitor in bookie server, which would collect bookie load information periodically
bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
modify the implementation of RackawareEnsemblePlacementPolicyImpl, support select ensemble by LoadWeightBasedPlacement. Since LoadWeightBasedPlacement is an enhancement of DiskWeightBasedPlacement, it would cover the DiskWeightBasedPlacement if feature enable.

BaseMetricMonitor

Now BaseMetricMonitor would collect multiple load metrics, including journal IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk space, total disk space.
Then we can define the bookie load pressure by these metrics.
Actually for our cluster, bookie load pressure is mainly influenced by journal IOUtil, because we use HDD as journal disk and 1 journal disk is responsible for 3 bookie.

BaseMetricMonitor would collect the metrics per second by default. But we find that some metrics is jittering so much.
So it is necessary to smooth the collected metrics, by calculating average value between 3 seconds.
This 3 second can be modified by config baseMetricMonitorMetricSlideWindowSize

If one bookie contains multiple disks, we calculate the average value.

modification of getBookieInfo restApi

bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
That means if we enable LoadWeightBasedPlacement, the restApi would contain more information.

Previous: getBookieInfo contains totalDiskUsage and freeDiskUsage

message GetBookieInfoResponse {
    required StatusCode status = 1;
    optional int64 totalDiskCapacity = 2;
    optional int64 freeDiskSpace = 3;
}

Current: getBookieInfo contains totalDiskUsage, freeDiskUsage, and the other load information

message GetBookieInfoResponse {
    required StatusCode status = 1;
    optional int64 totalDiskCapacity = 2;
    optional int64 freeDiskSpace = 3;
    optional int32 journalIoUtil = 4;
    optional int32 ledgerIoUtil = 5;
    optional int32 cpuUsedRate = 6;
    optional int64 writeBytePerSecond = 7;
}

GetBookieInfoResponse in BookkeeperProtocol would be changed.
If we disable LoadWeightBasedPlacement or the restApi is error because of timeout or throwing exception, the load information would be -1.

And we have tested the pressure of this restApi bringing to cluster. Such as for our cluster, with more than 20 brokers and more than 200 bookies,
the pressure of restApi is still acceptable.

modification of RackawareEnsemblePlacementPolicyImpl

Implement a new strategy to select bookies for LoadWeightBasedPlacement.

The target is :

make write pressure more balance on all bookies. (Usually the more throughput, the higher write pressure)
avoid some bookies occur high write latency problem. (Usually the IOUtil become higher, write latency would be also higher)

So the designed strategy is :

use writeBytesPerSecond as load weight. And use roulette wheel selection, define the bookie selection probability by its load weight. The higher load weight, the less probability bookie to be selected.
Selection filter the high load bookie, whose journal IOUtil is higher than threshold. If bookie is already high load, we should not continue to write entry on it. Once its IOUtil decrease, bookie become writable.

To avoid a corner case that so many bookies being filtered, we add config lowLoadBookieRatio. Default if more than half of bookies are filtered, fallback to randomly selection.

Notice that many bookie clients would do ensemble selections separately, the probability of each bookie should not differ too much, or it would cause write incline problem.
So we have probability smooth operation in roulette wheel selection.

Furthermore, different users can implement their own selected strategy based on their production environment.

compatibility

LoadWeightBasedPlacement is a supplemental feature, ledger replication must obey the RackAwarePolicy/RegionAwarePolicy firstly,
and then try to obey LoadWeightBasedPlacement. We can disable the feature by configuration.

Because GetBookieInfo protocol has been changed, this restApi would get error if the version of bookie-server and bookie-client is not the same one. But since this restApi is used only when enable diskWeightBasedPlacement, I think it is no problem for most people, who do not enable diskWeightBasedPlacement in client.

Performance

We have applied LoadWeightBasedPlacement to our production clusters. And the high write latency problem no matter happen.

The text was updated successfully, but these errors were encountered:

thetumbled · 2024-05-09T07:53:09Z

List some data collected from the production enviroment after we lauch this feature.
A direct way to evaluate the performance of a load balancing algorithm is to look at the indicators we are concerned about, specifically P99 write latency. What is more, we care about the standard deviation of journal disk write throughput , the range of journal disk write throughput, and the top 10 of journal disk write throughput .
Let's take a look at the comparison of these indicators one by one (launched on 1.10):

P99 write latency

It can be seen that the P99 write latency has significantly decreased from a peak of more than ten to twenty seconds to less than one second.

the standard deviation of journal disk write throughput

The peak standard deviation of journal disk write throughput has decreased from 25MB/s to 21MB/s.

the range of journal disk write throughput

The peak write throughput range has decreased from 150MB/s to 130MB/s.

Journal disk write throughput top 10

There is also a significant decrease in the top 10 journal disk write throughput.

Could you help to review this BP? @shoothzj @wenbingshen @eolivelli @dlg99 @jiazhai

eolivelli · 2024-05-09T07:58:47Z

This is a great idea +1

TakaHiR07 added the type/proposal label Mar 27, 2024

TakaHiR07 mentioned this issue Mar 27, 2024

BP-65: Implement load balance for select bookie #4246

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BP-65: implement load balance for select bookie #4247

BP-65: implement load balance for select bookie #4247

TakaHiR07 commented Mar 27, 2024 •

edited

thetumbled commented May 9, 2024

eolivelli commented May 9, 2024

BP-65: implement load balance for select bookie #4247

BP-65: implement load balance for select bookie #4247

Comments

TakaHiR07 commented Mar 27, 2024 • edited

BP

Motivation

Proposed Changes

BaseMetricMonitor

modification of getBookieInfo restApi

modification of RackawareEnsemblePlacementPolicyImpl

compatibility

Performance

thetumbled commented May 9, 2024

eolivelli commented May 9, 2024

TakaHiR07 commented Mar 27, 2024 •

edited