You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the master ticket for tracking BP-65 :
Proposal PR - #4246
Motivation
One of our clusters have 255 bookies, and we find that bookie's write pressure is very unbalance.
Usually there are several bookies write latency too high, which cause the message publish latency also too high in pulsar broker.
Currently, bookie have quarantine mechanism to deal with this case.
when broker select ensemble for a ledger, it uses RackAwarePlacement and random select strategy to decide which bookie should be written.
If bookie write error achieve n times, the bookie would be quarantined in broker.
To avoid too many broker quarantine a bookie at the same time, we can define quarantineRatio to do quarantine randomly.
However, this mechanism is not good enough to avoid bookie high write latency
Since we select bookie ensemble randomly, ramdom strategy is hard to achieve good performance if we don't have too many bookies or too many ledgers to select.
It is hard to define appropriate n time write error, and quarantineRatio. Because it is hard to map bookie write pressure to these configs.
It is hard to define how long we quarantine a bookie. Because we don't know which time bookie's write pressure has already dropped. Now we can only set the quarantineTime by config.
Now we can only quarantine a bookie when it already has write-pressure, but we can't avoid the write-pressure problem occur in advance.
Proposed Changes
To solve this write pressure problem, we propose to introduce bookie load balance mechanism, which is supplement of current quarantine mechanism.
When we choose ensemble for ledger, if we have load information of all bookies, we can prefer to select the low-load bookie as ensemble.
And we can avoid to write into the high-load bookie, which make the bookie perform worse and cause high write latency.
Therefore, with the bookie load information, we can better avoid high write latency problem occur
We notice that bookie already has DiskWeightBasedPlacement mechanism, which is similar to load balance. We just need to enhance this mechanism,
replace it to LoadWeightBasedPlacement.
The proposed changes involves:
Implement BaseMetricMonitor in bookie server, which would collect bookie load information periodically
bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
modify the implementation of RackawareEnsemblePlacementPolicyImpl, support select ensemble by LoadWeightBasedPlacement. Since LoadWeightBasedPlacement is an enhancement of DiskWeightBasedPlacement, it would cover the DiskWeightBasedPlacement if feature enable.
BaseMetricMonitor
Now BaseMetricMonitor would collect multiple load metrics, including journal IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk space, total disk space.
Then we can define the bookie load pressure by these metrics.
Actually for our cluster, bookie load pressure is mainly influenced by journal IOUtil, because we use HDD as journal disk and 1 journal disk is responsible for 3 bookie.
BaseMetricMonitor would collect the metrics per second by default. But we find that some metrics is jittering so much.
So it is necessary to smooth the collected metrics, by calculating average value between 3 seconds.
This 3 second can be modified by config baseMetricMonitorMetricSlideWindowSize
If one bookie contains multiple disks, we calculate the average value.
modification of getBookieInfo restApi
bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
That means if we enable LoadWeightBasedPlacement, the restApi would contain more information.
Previous: getBookieInfo contains totalDiskUsage and freeDiskUsage
GetBookieInfoResponse in BookkeeperProtocol would be changed.
If we disable LoadWeightBasedPlacement or the restApi is error because of timeout or throwing exception, the load information would be -1.
And we have tested the pressure of this restApi bringing to cluster. Such as for our cluster, with more than 20 brokers and more than 200 bookies,
the pressure of restApi is still acceptable.
modification of RackawareEnsemblePlacementPolicyImpl
Implement a new strategy to select bookies for LoadWeightBasedPlacement.
The target is :
make write pressure more balance on all bookies. (Usually the more throughput, the higher write pressure)
avoid some bookies occur high write latency problem. (Usually the IOUtil become higher, write latency would be also higher)
So the designed strategy is :
use writeBytesPerSecond as load weight. And use roulette wheel selection, define the bookie selection probability by its load weight. The higher load weight, the less probability bookie to be selected.
Selection filter the high load bookie, whose journal IOUtil is higher than threshold. If bookie is already high load, we should not continue to write entry on it. Once its IOUtil decrease, bookie become writable.
To avoid a corner case that so many bookies being filtered, we add config lowLoadBookieRatio. Default if more than half of bookies are filtered, fallback to randomly selection.
Notice that many bookie clients would do ensemble selections separately, the probability of each bookie should not differ too much, or it would cause write incline problem.
So we have probability smooth operation in roulette wheel selection.
Furthermore, different users can implement their own selected strategy based on their production environment.
compatibility
LoadWeightBasedPlacement is a supplemental feature, ledger replication must obey the RackAwarePolicy/RegionAwarePolicy firstly,
and then try to obey LoadWeightBasedPlacement. We can disable the feature by configuration.
Because GetBookieInfo protocol has been changed, this restApi would get error if the version of bookie-server and bookie-client is not the same one. But since this restApi is used only when enable diskWeightBasedPlacement, I think it is no problem for most people, who do not enable diskWeightBasedPlacement in client.
Performance
We have applied LoadWeightBasedPlacement to our production clusters. And the high write latency problem no matter happen.
The text was updated successfully, but these errors were encountered:
List some data collected from the production enviroment after we lauch this feature.
A direct way to evaluate the performance of a load balancing algorithm is to look at the indicators we are concerned about, specifically P99 write latency. What is more, we care about the standard deviation of journal disk write throughput , the range of journal disk write throughput, and the top 10 of journal disk write throughput .
Let's take a look at the comparison of these indicators one by one (launched on 1.10):
P99 write latency
It can be seen that the P99 write latency has significantly decreased from a peak of more than ten to twenty seconds to less than one second.
the standard deviation of journal disk write throughput
The peak standard deviation of journal disk write throughput has decreased from 25MB/s to 21MB/s.
the range of journal disk write throughput
The peak write throughput range has decreased from 150MB/s to 130MB/s.
Journal disk write throughput top 10
There is also a significant decrease in the top 10 journal disk write throughput.
BP
This is the master ticket for tracking BP-65 :
Proposal PR - #4246
Motivation
One of our clusters have 255 bookies, and we find that bookie's write pressure is very unbalance.
Usually there are several bookies write latency too high, which cause the message publish latency also too high in pulsar broker.
Currently, bookie have quarantine mechanism to deal with this case.
However, this mechanism is not good enough to avoid bookie high write latency
Proposed Changes
To solve this write pressure problem, we propose to introduce bookie load balance mechanism, which is supplement of current quarantine mechanism.
When we choose ensemble for ledger, if we have load information of all bookies, we can prefer to select the low-load bookie as ensemble.
And we can avoid to write into the high-load bookie, which make the bookie perform worse and cause high write latency.
Therefore, with the bookie load information, we can better avoid high write latency problem occur
We notice that bookie already has DiskWeightBasedPlacement mechanism, which is similar to load balance. We just need to enhance this mechanism,
replace it to LoadWeightBasedPlacement.
The proposed changes involves:
BaseMetricMonitor
Now BaseMetricMonitor would collect multiple load metrics, including journal IOUtil, ledger IOUtil, bookie write bytes per second, cpu usage, free disk space, total disk space.
Then we can define the bookie load pressure by these metrics.
Actually for our cluster, bookie load pressure is mainly influenced by journal IOUtil, because we use HDD as journal disk and 1 journal disk is responsible for 3 bookie.
BaseMetricMonitor would collect the metrics per second by default. But we find that some metrics is jittering so much.
So it is necessary to smooth the collected metrics, by calculating average value between 3 seconds.
This 3 second can be modified by config baseMetricMonitorMetricSlideWindowSize
If one bookie contains multiple disks, we calculate the average value.
modification of getBookieInfo restApi
bookie client continue to use getBookieInfo restApi to acquire load information from each bookie.
That means if we enable LoadWeightBasedPlacement, the restApi would contain more information.
GetBookieInfoResponse in BookkeeperProtocol would be changed.
If we disable LoadWeightBasedPlacement or the restApi is error because of timeout or throwing exception, the load information would be -1.
And we have tested the pressure of this restApi bringing to cluster. Such as for our cluster, with more than 20 brokers and more than 200 bookies,
the pressure of restApi is still acceptable.
modification of RackawareEnsemblePlacementPolicyImpl
Implement a new strategy to select bookies for LoadWeightBasedPlacement.
The target is :
So the designed strategy is :
To avoid a corner case that so many bookies being filtered, we add config
lowLoadBookieRatio
. Default if more than half of bookies are filtered, fallback to randomly selection.Notice that many bookie clients would do ensemble selections separately, the probability of each bookie should not differ too much, or it would cause write incline problem.
So we have probability smooth operation in roulette wheel selection.
Furthermore, different users can implement their own selected strategy based on their production environment.
compatibility
LoadWeightBasedPlacement is a supplemental feature, ledger replication must obey the RackAwarePolicy/RegionAwarePolicy firstly,
and then try to obey LoadWeightBasedPlacement. We can disable the feature by configuration.
Because GetBookieInfo protocol has been changed, this restApi would get error if the version of bookie-server and bookie-client is not the same one. But since this restApi is used only when enable diskWeightBasedPlacement, I think it is no problem for most people, who do not enable diskWeightBasedPlacement in client.
Performance
We have applied LoadWeightBasedPlacement to our production clusters. And the high write latency problem no matter happen.
The text was updated successfully, but these errors were encountered: