Active Rebalancing #8877

ledjon-behluli · 2024-02-23T22:49:28Z

Intro

This PR adds "Active Rebalancing" which is a mechanism to automatically and dynamically migrate grain activations based on the locality-aware partitioning algorithm which is described in section 4 of this paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/06/eurosys16loca_camera_ready-1.pdf

Implementation details have been provided extensively via comments in the code itself, but below we will discuss some of the overarching points, and decisions.

Components

`ActiveRebalancerGrain`

This is the main component which determines both: when & how the rebalancing will occur. It was meant to be implemented as a GrainService but due to the fact that a GrainService is a SystemTarget, it means the by default those are always reentrant, which is not fitted for this functionality. Therefor this component is a normal grain which is activated locally upon each silo startup via the ActiveRebalancerGateway.

This grain is called periodically by an internal timer based on the time spans which are set via the ActiveRebalancingOptions.
We force reentrancy of the timer by means of executing its TriggerExchangeRequest as a reference to the grain itself. We do this so further edge (communication links) recordings will be stopped until the protocol finishes (if it is self-triggered, or by another rebalancer grain from another silo). We do this to avoid changing the counters while the protocol is running, not because of thread-safety reasons (since it runs under the grains TaskScheduler), but due to logical reasons.

The rebalancer grain as a mechanism which breaks out of potential deadlocks on which the grain might enter. This could happen when a rebalancer grain is currently performing an AcceptExchangeRequest (the counterpart of TriggerExchangeRequest) with the another silo, while that same silo is doing the same operating with this grain.

In addition to breaking out of the deadlock, when that happens the timer of the broken-out silo is slightly shifted from the randomly picked due time, in order to further contribute on avoid these "mutual exchange attempts".

`ActiveRebalancerGateway`

This is a component which sits between the networking layer and the rebalancer grain. It is itself a lifecycle participant and
is responsible for 2 main operations:

Quickly filter out any unwanted messages that come from the networking layer. Usually these are system messages, but there are also different cases which are well elaborated in the code comments.
Messages of interest are converted into another format which the rebalancer understands, and these are forwarded into a BoundedChannel which is configured for maximum parallelism, and drop the oldest messages in case new once arrive in case the channel is full (currently configured to hold up to 100_000 items, but I am open to changing that). This technically is not 100% NEEDED, due to the fact that this is an extremally hot path any optimization is good to have. It also supports dropping messages, which is fine for our case, since either way the communication edge recorded are filtered by a probabilistic data structure.

`ActiveRebalancerExtension`

A GrainExtension which is used to decouple some cross-cutting concerns out of the ActiveRebalancerGrain so it stays focused on its main logic. More information about these concerns are given in the code comments. In addition to those it is used to swap out the current frequency counters with empty once, which is used for automated tests.

Note that the extension is called upon starting the rebalancer gateway.

`FrequencySink`

Due to the enourmous amount of messages which are anticipated to flow through the system, any attempt to record all of those is destined to fail, for 2 reasons:

Storing those data in memory will quickly result in maxing out memory.
Performing a graph partitioning problem with so many vertices (represented by the communication edges) would take a very long time.

The FrequencySink is a component which implements a modified version of the space-saving algorithm which is described in section 3.1 this paper: https://www.cse.ust.hk/~raywong/comp5331/References/EfficientComputationOfFrequentAndTop-kElementsInDataStreams.pdf

In addition to the algorithm itself, further optimizations in the form of a updatable min-heap structure has been employed in order to minimize time spent when the data structure is full, and lowest value counters need to be dropped/replaced with new incoming data. While the heap structure has been modified a lot to be optimized for our specific needs, a lot of credit must be given to: https://github.com/DesignEngrLab/TVGL/blob/master/TessellationAndVoxelizationGeometryLibrary/Miscellaneous%20Functions/UpdatablePriorityQueue.cs

`DefaultImbalanceRule`

This is the default implementation of the IImbalanceToleranceRule interface, which is publicly available and is meant to be implement by users, in order to fit their specific needs.

Before we elaborate the default rule, we should point out that IImbalanceToleranceRule represents a rule that controls the degree of imbalance between the number of grain activations (that is considered tolerable), when any pair of silos are exchanging activations.

This is part of the ActOp paper, and is a key component on making sure that the exchange protocol does not overload a single silo due to cutting remote connection and pushing them all towards a single one.

Back to the default rule - This is a tolerance rule which is aware of the cluster size (number of active silos). This has been specifically crafted so that the tolerance is higher when the number of silos is slow(er), and becomes tighter when the number increases.

Usually system deployed are in the range of 1-10 silos, and its rare that system go upwards of 100+ silos. With this "guess" the default rule "rewards" smaller cluster with a higher tolerance, and "punishes" the once which are larger in size. It is sound to do this, because if its fixed, and the cluster has more silos, it means that there is a greater overall imbalance in the cluster.

The rule follows a piecewise, inverted, scaled and shifted, sigmoid function which maps the number of active silos to different tolerance levels. Below we can see a graph on how this changes, where the x-axis represents the number of silos, and the y-axis is the percentage deviation from the baseline value (set to 10).

For example: when the number of silos = 2, the tolerance is ~ 1000 activation difference allowed between any pair of silos, and if the number of silos >= 100 the tolerance is ~ 100 activation difference (a 10x reduction factor)

Note that I am open to change this if there is a better suited, generic rule!

`ActiveRebalancingOptions`

This class is used to control the behavior of the rebalancer(s), and while it has been decorated with plentiful comments on the meanings of each of the properties, 2 things are worth pointing out here:

The DueTime are given in a range, as opposed to a single value. This has been done so that the actual due time is picked randomly between that range, so we do a better job at avoiding so called "mutual exchange attempts", where 2 silos begin the exchange protocol between each other and the same time. And while there exists a mechanism (mentioned above in the ActiveRebalancerGrain section) to break those silos out of this state, it does not hurt to avoid this as much as possible.
The default values are picked "to my best reasoning" and I am open to changing them if more sound arguments are presented. Note that the DEFAULT_RECOVERY_PERIOD is picked in accordance with the ActOp paper.

Configuration

While operational, it is fair to say that this feature needs a good amount of testing, therefor it is opt-in!

In order for users to make use of it, all they have to do is add the extension method AddActiveRebalancing when configuring the ISiloBuilder, or potentially use the generic version which accepts an implementation of IImbalanceToleranceRule specifically AddActiveRebalancing<TRule>.

The options can be configured (if defaults are not suited to users needs) as any other options in the framework via hostBuilder..Configure<ActiveRebalancingOptions>(o => {...})

Additional Information

The main point of the algorithm is to break remote actor communications and convert those into local calls, while adhering to the imbalance tolerance defined above.
The algorithm determines the most heavies communication edges based on 2 factors: number of connections & frequency of message exchange between the connections.
SystemTargets such are stream puling agents, grain services, but also stateless worker grains, are also supported, but since they represent immovable components, they won't be treated as sources of migration, instead they are treated as potential targets towards which other (movable) grains can be migrated.
In addition to these immovable components, the user are free to decorate their own (inertly movable) grains via a special attribute call ImmovableAttribute which will instruct the runtime to not move activations of such grain types. Yet other movable grain can be moved towards them.

Tests

The solution is covered by some overarching tests (which could be improved ofc)

Examples

Below we can see some examples of isolated graphs which represent a microsome view of much larger graphs, and the overall working principles of the algorithm.

Microsoft Reviewers: Open in CodeFlow

gfoidl

Some nits.

src/Orleans.Runtime/Placement/Rebalancing/ActiveRebalancerGateway.cs

src/Orleans.Runtime/Placement/Rebalancing/ActiveRebalancerGrain.cs

ledjon-behluli · 2024-02-29T23:21:55Z

@gfoidl please have a look again, thx!

gfoidl

One small nit -- otherwise LGTM.

Nice work 👍🏻 (and super description of the PR, thanks for that).

src/Orleans.Runtime/Placement/Rebalancing/ActiveRebalancerGrain.cs

gfoidl reviewed Feb 29, 2024

View reviewed changes

gfoidl reviewed Mar 1, 2024

View reviewed changes

src/Orleans.Runtime/Placement/Rebalancing/ActiveRebalancerGrain.cs Outdated Show resolved Hide resolved

ReubenBond self-assigned this Mar 11, 2024

ReubenBond force-pushed the active-rebalancing branch from 3148fac to e50a28a Compare April 9, 2024 02:40

AlgorithmsAreCool reviewed Apr 10, 2024

View reviewed changes

src/Orleans.Runtime/Placement/Rebalancing/ActiveRebalancerGrain.cs Outdated Show resolved Hide resolved

ReubenBond force-pushed the active-rebalancing branch 8 times, most recently from e60d33e to 8ec56ef Compare May 22, 2024 01:59

ReubenBond and others added 11 commits May 22, 2024 16:38

Non-reentrant timers

8b5fa40

Active Rebalancing

0b7ff59

Try to improve candidate vertex max-heap perf

895febb

WIP

46623f0

The breaking

a562b28

Fix & clean up

34fa779

wip

6d44a9e

Log 1st chance exceptions in fan-out test

9087dcc

Fixes

9c80b4c

Reschedule timer whether sending or receiving a request

2e0104b

minor clean up

557a66a

ReubenBond force-pushed the active-rebalancing branch from ca641bc to 557a66a Compare May 22, 2024 23:39

Ignore remote vertex instance if it appears in the local set.

5592a98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Active Rebalancing #8877

Active Rebalancing #8877

ledjon-behluli commented Feb 23, 2024 •

edited

gfoidl left a comment

ledjon-behluli commented Feb 29, 2024

gfoidl left a comment

Active Rebalancing #8877

Are you sure you want to change the base?

Active Rebalancing #8877

Conversation

ledjon-behluli commented Feb 23, 2024 • edited

Intro

Components

ActiveRebalancerGrain

ActiveRebalancerGateway

ActiveRebalancerExtension

FrequencySink

DefaultImbalanceRule

ActiveRebalancingOptions

Configuration

Additional Information

Tests

Examples

Microsoft Reviewers: Open in CodeFlow

gfoidl left a comment

Choose a reason for hiding this comment

ledjon-behluli commented Feb 29, 2024

gfoidl left a comment

Choose a reason for hiding this comment

ledjon-behluli commented Feb 23, 2024 •

edited

`ActiveRebalancerGrain`

`ActiveRebalancerGateway`

`ActiveRebalancerExtension`

`FrequencySink`

`DefaultImbalanceRule`

`ActiveRebalancingOptions`