Ensure Top State Evenness When Capacity Keys Not Defined #2760

GrantPSpencer · 2024-02-16T23:16:39Z

Issues

My PR addresses the following Helix issues and references them in the PR description:

When no resource or instance capacity keys are defined for a WAGED cluster, then WAGED does not try to ensure topstate evenness. In a cluster where the # participants = # replicas, each instance will have the same assignment.

If each resource has multiple partitions, then a degree of evenness will be achieved, but it could be improved.

If there is only 1 partition per resource and the # participants = # replicas, then one instance will be assigned leader for all resources as the tiebreak is the participant's logical ID.

Description

Here are some details about my PR, including screenshots of any UI changes:

Customers should define resource weights and instance capacities to fully utilize the benefits of WAGED. However, for those that do not we should still consider top state evenness in WAGED assignment calculations. This PR adds changes to the "TopStateMaxCapacityUsageInstanceConstraint" soft constraint. Currently it preferentially assigns top state replicas to nodes with less top states based off the replica's weight. If no weight is defined for the resource and the node, then the score will always be 0 and tiebreak will be determined by the node's logicalID.

This change will make the soft constraint calculate the score based off the number of top states if there are no capacity keys defined for both the resource and the instance.

This only includes a very minor change of changing one test method's access modifier from private to public. I checked the CI logs and testNG has been successfully running this method, but convention is to set it to public.

Tests

The following tests are written for this issue:

TestTopStateMaxCapacityUsageInstanceConstraint

testGetNormalizedScoreNoCapacityKeysDefined

Manually tested distribution before and after change:

No Capacity keys defined for either the resources or the participants

Test 1
3 Participants, 20 WAGED Resources, 1 Partition Each.

BEFORE CHANGE:
Count of Leader Assignments
{localhost_12919=0, localhost_12918=20, localhost_12920=0}

AFTER CHANGE:
Count of Leader Assignments
{localhost_12919=7, localhost_12918=7, localhost_12920=6}

Test 2
3 Participants, 20 WAGED Resources, 10 Partitios Each.

BEFORE CHANGE:
Count of Leader Assignments
{localhost_12919=60, localhost_12918=80, localhost_12920=60}

AFTER CHANGE:
Count of Leader Assignments
{localhost_12919=67, localhost_12918=67, localhost_12920=66}

The following is the result of the "mvn test" command on the appropriate module:

Ran PR CI against my personal fork, failed due to flaky test testEvacuationWithOfflineInstancesInCluster #2721
https://github.com/GrantPSpencer/helix/actions/runs/7935782011/job/21669654755?pr=6

Changes that Break Backward Compatibility (Optional)

My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A

Commits

My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Code Quality

My diff has been formatted using helix-style.xml
(helix-style-intellij.xml if IntelliJ IDE is used)

junkaixue · 2024-02-22T18:09:13Z

...elix/controller/rebalancer/waged/constraints/TopStateMaxCapacityUsageInstanceConstraint.java

+    // No node or replica capacity defined, we default to just looking at # top state
+    if (node.getMaxCapacity().isEmpty() && replica.getCapacity().isEmpty()) {
+      int curTopPartitionCountForResource = node.getAssignedTopStatePartitionsCount();
+      int estimatedMaxTopStateCount = clusterContext.getEstimatedMaxTopStateCount();
+      return computeUtilizationScore(estimatedMaxTopStateCount, curTopPartitionCountForResource);
+    }


Will this be the right behavior? If Capacity is not defined, we should not process and let the computation failed because this is an error.

Adding this code is more like a hiding the problem.

Expected behavior should be :

Evenness of top state should be handled in different constraint.

If it miss the required field, we should fail the computation and let user handle it.

Currently users can onboard to WAGED without setting instance and partition capacities. WAGED validation fails if only 1 has been defined or if there are missing keys, but will successfully run if neither are defined. In this case, WAGED doesn't ensure any top state evenness as the assignment scores will always be 0 and tiebreak will be determined by the node's logicalID.

Should the expectation that users have to define at least a default instance and partition capacity for the cluster? Like so:

"mapFields": { "DEFAULT_INSTANCE_CAPACITY_MAP": { "PARTCOUNT": "2000", }, "DEFAULT_PARTITION_WEIGHT_MAP": { "PARTCOUNT": "386" } ... "listFields": { "INSTANCE_CAPACITY_KEYS": [ "PARTCOUNT" ]

Yeah. I think my second point get the answer. But still we should not make top state evenness into capacity constraint. Logically it is not clear.

Would be good to have a separate constraints to hold it.

GrantPSpencer added 2 commits February 16, 2024 11:42

TopStateMaxCapacity use count when no capacity defined

a1ab414

change test method to public

61dfbf8

GrantPSpencer changed the title ~~Top state evenness~~ Ensure Top State Evenness When Capacity Keys Not Defined #6 Feb 16, 2024

GrantPSpencer changed the title ~~Ensure Top State Evenness When Capacity Keys Not Defined #6~~ Ensure Top State Evenness When Capacity Keys Not Defined Feb 16, 2024

junkaixue reviewed Feb 22, 2024

View reviewed changes

GrantPSpencer closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Top State Evenness When Capacity Keys Not Defined #2760

Ensure Top State Evenness When Capacity Keys Not Defined #2760

GrantPSpencer commented Feb 16, 2024

junkaixue Feb 22, 2024

GrantPSpencer Mar 13, 2024

junkaixue Mar 21, 2024

Ensure Top State Evenness When Capacity Keys Not Defined #2760

Ensure Top State Evenness When Capacity Keys Not Defined #2760

Conversation

GrantPSpencer commented Feb 16, 2024

Issues

Description

Tests

Manually tested distribution before and after change:

Changes that Break Backward Compatibility (Optional)

Commits

Code Quality

junkaixue Feb 22, 2024

Choose a reason for hiding this comment

GrantPSpencer Mar 13, 2024

Choose a reason for hiding this comment

junkaixue Mar 21, 2024

Choose a reason for hiding this comment