Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] statefulset controller: fix requests tracker concurrency #124624

Merged
merged 2 commits into from May 10, 2024

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Apr 29, 2024

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

fixes a flake when using Parallel PodManagementPolicy (burst) in StatefulSets

Which issue(s) this PR fixes:

Fixes #124526

Special notes for your reviewer:

tested with the stress tool

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 29, 2024
@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 29, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 29, 2024
@atiratree
Copy link
Member Author

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 29, 2024
@ingvagabund
Copy link
Contributor

Can you more elaborate on why the current implementation does not work as expected? I.e. what's the underlying cause that makes the test flake? And what has changed and how it removes the flakiness?

@atiratree
Copy link
Member Author

atiratree commented May 7, 2024

We set excpected failure here after 2 requests:

om.SetCreateStatefulPodError(apierrors.NewInternalError(errors.New("API server failed")), 2)

Then we scale up the replica set in parallel (Burst).

if err := scaleUpStatefulSetControl(set, ssc, om, invariants); !isOrHasInternalError(err) {

Then the pod tracker is called.

if om.createPodTracker.errorReady() {

Each tracker method is thread safe but they are not thread safe together. This is the reason behind the flake.

So we fix that by locking the increment and error checking with a single lock.


Also we have the parallel requests checking which is independent and used in TestParallelScale test.

func TestParallelScale(t *testing.T) {

This was guarded by a lock before and maxParallel was always equal to the number of requests. So it didn't test the parallelness correctly and had to be fixed as well.

desc string
replicas int32
desiredReplicas int32
expectedMinParallelRequests int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedMinParallelRequests is a new field? Why the original 1 is not enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved this refactoring to a new commit: 7b725f3.

This is to make the test more accurate as originally we expected only 2 parallel requests. But if we create a 1000 pods we should expect more, otherwise it would suggest a regression.

@@ -3017,8 +3015,8 @@ func parallelScale(t *testing.T, set *apps.StatefulSet, replicas, desiredReplica
t.Errorf("Failed to scale statefulset to %v replicas, got %v replicas", desiredReplicas, set.Status.Replicas)
}

if (diff < -1 || diff > 1) && om.createPodTracker.maxParallel <= 1 {
t.Errorf("want max parallel requests > 1, got %v", om.createPodTracker.maxParallel)
if om.createPodTracker.maxParallelRequests < expectedMinParallelRequests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is (diff < -1 || diff > 1) no longer tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was replaced by the expectedMinParallelRequests as the test case above can specify the expected number of requests for a diff they expect

@ingvagabund
Copy link
Contributor

Worth mentioning:

  • with trackParallelRequests locking the parallel requests/routines are serialized which makes the testing less real (some of the state space is not explored).
  • removing reset of both delay and parallel fields is ok since they execution path does not generate the error so the reset code is actually never executed -> safe to remove the code resetting both fields

@atiratree
Copy link
Member Author

with trackParallelRequests locking the parallel requests/routines are serialized which makes the testing less real (some of the state space is not explored).

I have solved this by requiring the enablement for the tracking

https://github.com/kubernetes/kubernetes/pull/124624/files#diff-49263a54d6a7753728e489881537462578ddc2b9b09404ee335c73e27e178ddcR2436-R2438

so it should not affect the other tests now, apart from the TestParallelScale


I have also added a check for the expectedMinParallelRequests values.

https://github.com/kubernetes/kubernetes/pull/124624/files#diff-49263a54d6a7753728e489881537462578ddc2b9b09404ee335c73e27e178ddcR2990-R2994

we can test for larger number of parallel requests than 2 in
statefulsets with hundreds of replicas
@ingvagabund ingvagabund self-assigned this May 10, 2024
@ingvagabund
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9efa79f0eea8c3b1ddaaa63a3763993f5280d538

@ingvagabund ingvagabund changed the title fix requests tracker concurrency [test] statefulset controller: fix requests tracker concurrency May 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit f75d8e9 into kubernetes:master May 10, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Flaky UT TestStatefulSetControl/CreatePodFailure/Burst/ScaleDownOnly/StatefulSetAutoDeletePVCEnabled
3 participants