[test] statefulset controller: fix requests tracker concurrency #124624

atiratree · 2024-04-29T20:14:29Z

What type of PR is this?

/kind bug
/kind flake

What this PR does / why we need it:

fixes a flake when using Parallel PodManagementPolicy (burst) in StatefulSets

Which issue(s) this PR fixes:

Fixes #124526

Special notes for your reviewer:

tested with the stress tool

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-04-29T20:15:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: atiratree

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/statefulset/OWNERS~~ [atiratree]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree · 2024-04-29T20:15:51Z

/triage accepted
/priority important-longterm

ingvagabund · 2024-05-07T09:22:03Z

Can you more elaborate on why the current implementation does not work as expected? I.e. what's the underlying cause that makes the test flake? And what has changed and how it removes the flakiness?

atiratree · 2024-05-07T10:08:01Z

We set excpected failure here after 2 requests:

kubernetes/pkg/controller/statefulset/stateful_set_control_test.go

Line 523 in 0590bb1

    
           om.SetCreateStatefulPodError(apierrors.NewInternalError(errors.New("API server failed")), 2)

Then we scale up the replica set in parallel (Burst).

kubernetes/pkg/controller/statefulset/stateful_set_control_test.go

Line 525 in 0590bb1

    
           if err := scaleUpStatefulSetControl(set, ssc, om, invariants); !isOrHasInternalError(err) {

Then the pod tracker is called.

kubernetes/pkg/controller/statefulset/stateful_set_control_test.go

Line 2516 in 0590bb1

if om.createPodTracker.errorReady() {

Each tracker method is thread safe but they are not thread safe together. This is the reason behind the flake.

So we fix that by locking the increment and error checking with a single lock.

Also we have the parallel requests checking which is independent and used in TestParallelScale test.

kubernetes/pkg/controller/statefulset/stateful_set_control_test.go

Line 2946 in 0590bb1

func TestParallelScale(t *testing.T) {

This was guarded by a lock before and maxParallel was always equal to the number of requests. So it didn't test the parallelness correctly and had to be fixed as well.

pkg/controller/statefulset/stateful_set_control_test.go

ingvagabund · 2024-05-07T10:44:44Z

pkg/controller/statefulset/stateful_set_control_test.go

+		desc                        string
+		replicas                    int32
+		desiredReplicas             int32
+		expectedMinParallelRequests int


expectedMinParallelRequests is a new field? Why the original 1 is not enough?

I have moved this refactoring to a new commit: 7b725f3.

This is to make the test more accurate as originally we expected only 2 parallel requests. But if we create a 1000 pods we should expect more, otherwise it would suggest a regression.

ingvagabund · 2024-05-07T10:45:25Z

pkg/controller/statefulset/stateful_set_control_test.go

@@ -3017,8 +3015,8 @@ func parallelScale(t *testing.T, set *apps.StatefulSet, replicas, desiredReplica
 		t.Errorf("Failed to scale statefulset to %v replicas, got %v replicas", desiredReplicas, set.Status.Replicas)
 	}

-	if (diff < -1 || diff > 1) && om.createPodTracker.maxParallel <= 1 {
-		t.Errorf("want max parallel requests > 1, got %v", om.createPodTracker.maxParallel)
+	if om.createPodTracker.maxParallelRequests < expectedMinParallelRequests {


Why is (diff < -1 || diff > 1) no longer tested?

this was replaced by the expectedMinParallelRequests as the test case above can specify the expected number of requests for a diff they expect

pkg/controller/statefulset/stateful_set_control_test.go

ingvagabund · 2024-05-07T11:29:28Z

Worth mentioning:

with trackParallelRequests locking the parallel requests/routines are serialized which makes the testing less real (some of the state space is not explored).
removing reset of both delay and parallel fields is ok since they execution path does not generate the error so the reset code is actually never executed -> safe to remove the code resetting both fields

atiratree · 2024-05-07T18:57:40Z

with trackParallelRequests locking the parallel requests/routines are serialized which makes the testing less real (some of the state space is not explored).

I have solved this by requiring the enablement for the tracking

https://github.com/kubernetes/kubernetes/pull/124624/files#diff-49263a54d6a7753728e489881537462578ddc2b9b09404ee335c73e27e178ddcR2436-R2438

so it should not affect the other tests now, apart from the TestParallelScale

I have also added a check for the expectedMinParallelRequests values.

https://github.com/kubernetes/kubernetes/pull/124624/files#diff-49263a54d6a7753728e489881537462578ddc2b9b09404ee335c73e27e178ddcR2990-R2994

we can test for larger number of parallel requests than 2 in statefulsets with hundreds of replicas

ingvagabund · 2024-05-10T12:42:14Z

/lgtm

k8s-ci-robot · 2024-05-10T12:42:20Z

LGTM label has been added.

Git tree hash: 9efa79f0eea8c3b1ddaaa63a3763993f5280d538

k8s-ci-robot requested review from mortent and smarterclayton April 29, 2024 20:15

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 29, 2024

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 29, 2024

atiratree mentioned this pull request Apr 29, 2024

Flaky UT TestStatefulSetControl/CreatePodFailure/Burst/ScaleDownOnly/StatefulSetAutoDeletePVCEnabled #124526

Closed

ingvagabund reviewed May 7, 2024

View reviewed changes

atiratree force-pushed the fix-flake branch from 8108826 to 2523fb6 Compare May 7, 2024 18:34

fix requests tracker concurrency

df276c5

atiratree force-pushed the fix-flake branch from 2523fb6 to 7b725f3 Compare May 7, 2024 18:45

improve TestParallelScale test

8ca077a

we can test for larger number of parallel requests than 2 in statefulsets with hundreds of replicas

atiratree force-pushed the fix-flake branch from 7b725f3 to 8ca077a Compare May 7, 2024 19:02

ingvagabund self-assigned this May 10, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 10, 2024

ingvagabund changed the title ~~fix requests tracker concurrency~~ [test] statefulset controller: fix requests tracker concurrency May 10, 2024

k8s-ci-robot merged commit f75d8e9 into kubernetes:master May 10, 2024
14 checks passed

k8s-ci-robot added this to the v1.31 milestone May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] statefulset controller: fix requests tracker concurrency #124624

[test] statefulset controller: fix requests tracker concurrency #124624

atiratree commented Apr 29, 2024 •

edited

k8s-ci-robot commented Apr 29, 2024

atiratree commented Apr 29, 2024

ingvagabund commented May 7, 2024

atiratree commented May 7, 2024 •

edited

ingvagabund May 7, 2024 •

edited

atiratree May 7, 2024

ingvagabund May 7, 2024

atiratree May 7, 2024

ingvagabund commented May 7, 2024

atiratree commented May 7, 2024

ingvagabund commented May 10, 2024

k8s-ci-robot commented May 10, 2024

[test] statefulset controller: fix requests tracker concurrency #124624

[test] statefulset controller: fix requests tracker concurrency #124624

Conversation

atiratree commented Apr 29, 2024 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Apr 29, 2024

atiratree commented Apr 29, 2024

ingvagabund commented May 7, 2024

atiratree commented May 7, 2024 • edited

ingvagabund May 7, 2024 • edited

Choose a reason for hiding this comment

atiratree May 7, 2024

Choose a reason for hiding this comment

ingvagabund May 7, 2024

Choose a reason for hiding this comment

atiratree May 7, 2024

Choose a reason for hiding this comment

ingvagabund commented May 7, 2024

atiratree commented May 7, 2024

ingvagabund commented May 10, 2024

k8s-ci-robot commented May 10, 2024

atiratree commented Apr 29, 2024 •

edited

atiratree commented May 7, 2024 •

edited

ingvagabund May 7, 2024 •

edited