Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Top State Evenness When Capacity Keys Not Defined #2760

Closed
wants to merge 2 commits into from

Conversation

GrantPSpencer
Copy link
Contributor

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

When no resource or instance capacity keys are defined for a WAGED cluster, then WAGED does not try to ensure topstate evenness. In a cluster where the # participants = # replicas, each instance will have the same assignment.

If each resource has multiple partitions, then a degree of evenness will be achieved, but it could be improved.

If there is only 1 partition per resource and the # participants = # replicas, then one instance will be assigned leader for all resources as the tiebreak is the participant's logical ID.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Customers should define resource weights and instance capacities to fully utilize the benefits of WAGED. However, for those that do not we should still consider top state evenness in WAGED assignment calculations. This PR adds changes to the "TopStateMaxCapacityUsageInstanceConstraint" soft constraint. Currently it preferentially assigns top state replicas to nodes with less top states based off the replica's weight. If no weight is defined for the resource and the node, then the score will always be 0 and tiebreak will be determined by the node's logicalID.

This change will make the soft constraint calculate the score based off the number of top states if there are no capacity keys defined for both the resource and the instance.

This only includes a very minor change of changing one test method's access modifier from private to public. I checked the CI logs and testNG has been successfully running this method, but convention is to set it to public.

Tests

  • The following tests are written for this issue:

TestTopStateMaxCapacityUsageInstanceConstraint

  • testGetNormalizedScoreNoCapacityKeysDefined

Manually tested distribution before and after change:

No Capacity keys defined for either the resources or the participants

Test 1
3 Participants, 20 WAGED Resources, 1 Partition Each.

BEFORE CHANGE:
Count of Leader Assignments
{localhost_12919=0, localhost_12918=20, localhost_12920=0}

AFTER CHANGE:
Count of Leader Assignments
{localhost_12919=7, localhost_12918=7, localhost_12920=6}

Test 2
3 Participants, 20 WAGED Resources, 10 Partitios Each.

BEFORE CHANGE:
Count of Leader Assignments
{localhost_12919=60, localhost_12918=80, localhost_12920=60}

AFTER CHANGE:
Count of Leader Assignments
{localhost_12919=67, localhost_12918=67, localhost_12920=66}
  • The following is the result of the "mvn test" command on the appropriate module:

Ran PR CI against my personal fork, failed due to flaky test testEvacuationWithOfflineInstancesInCluster #2721
https://github.com/GrantPSpencer/helix/actions/runs/7935782011/job/21669654755?pr=6

Changes that Break Backward Compatibility (Optional)

  • My PR contains changes that break backward compatibility or previous assumptions for certain methods or API. They include:

N/A

Commits

  • My commits all reference appropriate Apache Helix GitHub issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Code Quality

  • My diff has been formatted using helix-style.xml
    (helix-style-intellij.xml if IntelliJ IDE is used)

@GrantPSpencer GrantPSpencer changed the title Top state evenness Ensure Top State Evenness When Capacity Keys Not Defined #6 Feb 16, 2024
@GrantPSpencer GrantPSpencer changed the title Ensure Top State Evenness When Capacity Keys Not Defined #6 Ensure Top State Evenness When Capacity Keys Not Defined Feb 16, 2024
Comment on lines +45 to +50
// No node or replica capacity defined, we default to just looking at # top state
if (node.getMaxCapacity().isEmpty() && replica.getCapacity().isEmpty()) {
int curTopPartitionCountForResource = node.getAssignedTopStatePartitionsCount();
int estimatedMaxTopStateCount = clusterContext.getEstimatedMaxTopStateCount();
return computeUtilizationScore(estimatedMaxTopStateCount, curTopPartitionCountForResource);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be the right behavior? If Capacity is not defined, we should not process and let the computation failed because this is an error.

Adding this code is more like a hiding the problem.

Expected behavior should be :

  1. Evenness of top state should be handled in different constraint.
  2. If it miss the required field, we should fail the computation and let user handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently users can onboard to WAGED without setting instance and partition capacities. WAGED validation fails if only 1 has been defined or if there are missing keys, but will successfully run if neither are defined. In this case, WAGED doesn't ensure any top state evenness as the assignment scores will always be 0 and tiebreak will be determined by the node's logicalID.

Should the expectation that users have to define at least a default instance and partition capacity for the cluster? Like so:

  "mapFields": {
    "DEFAULT_INSTANCE_CAPACITY_MAP": {
      "PARTCOUNT": "2000",
    },
   "DEFAULT_PARTITION_WEIGHT_MAP": {
      "PARTCOUNT": "386"
    }

...

  "listFields": {
    "INSTANCE_CAPACITY_KEYS": [
      "PARTCOUNT"
    ]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think my second point get the answer. But still we should not make top state evenness into capacity constraint. Logically it is not clear.

Would be good to have a separate constraints to hold it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants