RFC: e2e: better polling support #112013

pohly · 2022-08-24T15:01:30Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

When running tests manually it is important to know when a test is stuck and why. The commits in this PR provide some replacements for ginkgo and gomega functions which provide better output while a long-running operation is going on.

Special notes for your reviewer:

I described the idea in onsi/gomega#574. We need to decide whether this can and/or should be solved in ginkgo/gomega or in Kubernetes.

The new functions get added to test/e2e/framework instead of, say, test/e2e/framework/ginkgowrapper because I see it as similar to framework.Logf and framework.Failf and because the intermediate output is done with framework.Logf. That would not be possible when in a sub-package because sub-packages must not depend on the core framework.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

ginkgo.By can be used to record how long a callback runs. This particular feature is not used much (or at all?) in Kubernetes and we don't have tooling for analyzing the measurements. What would be more important is text output. framework.By provides that: - reminders that a step is still running at regular intervals, configurable with --slow-step-threshold - the total duration for slow steps Example: STEP: delete pods and claims 08/24/22 11:55:19.676 STEP: deleting *v1.Pod dra-5514/tester-1 08/24/22 11:55:19.679 STEP: deleting *v1.ResourceClaim dra-5514/tester-1-my-inline-claim 08/24/22 11:55:19.686 STEP still running: delete pods and claims 08/24/22 11:55:29.677 STEP still running: delete pods and claims 08/24/22 11:55:39.676 STEP still running: delete pods and claims 08/24/22 11:55:49.677 <additional output when the objects finally get removed> STEP duration: delete pods and claims 33.253509791s

gomega.Eventually with a function that takes a gomega.Gomega instance can be used to poll for certain conditions. The advantage over custom code is that gomega assertions can be used. However, gomega as it stands today does not report why the polling continues. This is annoying when running a test interactively: one has to wait for the timeout before additional information is shown in the failure message. SucceedEventually is a replacement for gomega.Eventually which is tailored towards this use case and adds intermediate output. It's best combined with the By replacement. Example with some artificial failure: STEP: wait for plugin registration 08/24/22 14:17:23.488 ... STEP still running: wait for plugin registration 08/24/22 14:17:33.489 STEP still running: wait for plugin registration 08/24/22 14:17:43.489 STEP still running: wait for plugin registration 08/24/22 14:17:53.489 Aug 24 14:17:53.509: INFO: wait for plugin registration xxx Expected <bool>: false to equal <bool>: true STEP still running: wait for plugin registration 08/24/22 14:18:03.489 STEP still running: wait for plugin registration 08/24/22 14:18:13.489 STEP still running: wait for plugin registration 08/24/22 14:18:23.489 Aug 24 14:18:23.489: FAIL: Timed out after 60.000s. wait for plugin registration Expected <string>: xxx Expected <bool>: false to equal <bool>: true to be empty Full Stack Trace k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp.func7() /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:195 +0xda k8s.io/kubernetes/test/e2e/framework.By({0x702cfda, 0x1c}, 0xc0029d3df8, {0x0, 0x0, 0x6fbc0cb?}) /nvme/gopath/src/k8s.io/kubernetes/test/e2e/framework/log.go:115 +0x5ea k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp(0xc0006af700) /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:194 +0xa67 k8s.io/kubernetes/test/e2e/dra.NewDriver.func1() /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:67 +0x26

k8s-ci-robot · 2022-08-24T15:01:40Z

@pohly: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-08-24T15:02:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/framework/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2022-08-24T15:02:31Z

Some tests are needed if this is to be merged.

/hold

pohly · 2022-08-24T20:38:02Z

/retest

pohly · 2022-08-25T07:01:45Z

test/e2e/framework/log.go

+//
+// To get timing information for a long running SucceedEventually, wrap it in
+// By.
+func SucceedEventually(callback func(g gomega.Gomega), extra []interface{}, intervals ...interface{}) {


I need to point out one potential pitfall (will also add to the comment): the callback must use g.Expect instead of ginkgo.Expect or our framework helper functions like framework.ExpectNoError.

That's because the failure must be reported via the failure handler encapsulated inside the Gomega instance, not the global failure handler.

The alternative is to support a func() error callback instead or in addition to func(g gomega.Gomega). If we want to support both, callback would have to be of type interface{} with runtime type assertion. Not sure whether generics would help.

pohly · 2022-08-25T07:03:59Z

/cc @aojea

It looks like @onsi is wary of supporting something like this in Gomega (onsi/gomega#574 (comment)) and I tend to agree, so I'd like to get some opinions about this PR and move it forward.

onsi · 2022-08-25T18:22:22Z

I don't have a lot of day-to-day experience with the e2e suite so y'all should take this input with a big grain of salt.

It seems like the main problem this is trying to solve is around the user experience when manually (i.e. on the console locally? in a specialized CI environment?) running a small subset of tests. Namely: can we get the user faster feedback about where the test is and what might be in the process of failing and why. The proposed solution is to emit information periodically whenever the spec is waiting/polling.

One potential downside to this is that it would make CI much more verbose - but it looks like the answer to that is to make the status update interval configurable. That's fair...

With that said, though, the I'm not sure the addition to have By emit periodically adds a lot of value. I would expect that a developer tailing the log could simply follow along in the cod and see where things are currently.

The additions to Eventually are more interesting. One challenge is that some error messages can be quite long and if every Eventually is emitting periodically the log can become overwhelming/hard to follow. The other challenge is that, as implemented, it seems most users will have to adopt SucceedEventually and use a func(g gomega.Gomega) callback in situations, prior to this, simpler and clearer Eventually(func() <return>).Should(...) assertions would have sufficed.

Personally, when I'm debugging specs that can get stuck locally I try to lower the timeouts involved to get faster feedback. I'd be happy to add support to Gomega to make this dead simple to use. For example GOMEGA_INTERVAL_SCALE=0.1 ginkgo could scale all timeouts down to a factor of 10%. This would allow committed code to be optimized for worst-case scenario CI environments, but make it possible to have a faster developer experience where (presumably) infrastructure is less shared and therefore (perhaps?) more reliable.

If you'd nonetheless like to pursue the approach in this PR we could explore ways to make Eventually a bit more generically extensible. For example I think it could make sense to add something like .Notify() to Eventually that takes a function that will be called while Eventually runs with status updates (namely: how many times things have run, the failure message, etc.). You could then provide a wrapper that calls Notify on the users behalf (or, if you'd like, override the default Gomega Eventually - though I'd need to think through how best to propose doing that).

pohly · 2022-08-25T20:13:40Z

I would expect that a developer tailing the log could simply follow along in the cod and see where things are currently.

That only work if code is instrumented consistently, i.e. starts with a By. It can happen that I am looking at a

STEP: doing something

but in reality that operation is already completed and the test is stuck on "doing something else".

pohly · 2022-08-25T20:14:54Z

Personally, when I'm debugging specs that can get stuck locally I try to lower the timeouts involved to get faster feedback. I'd be happy to add support to Gomega to make this dead simple to use.

That won't help in Kubernetes because most polling is not done with Gomega.

onsi · 2022-08-25T20:26:27Z

got it, makes sense.

i wonder if an additional helpful thing could be for ginkgo to support receiving a signal and telling you where the current test is currently waiting. could even imagine adding a flag where it will emit that information periodically if a test has been running for >N seconds

pohly · 2022-08-26T07:11:07Z

i wonder if an additional helpful thing could be for ginkgo to support receiving a signal and telling you where the current test is currently waiting.

As noted in onsi/ginkgo#1023 (comment), we need to keep printing BeforeEach/AfterEach in Kubernetes because we cannot assume that each node starts with a By or some other log output. Even with that, improving By as proposed here still makes sense for a long-running node because ginkgo cannot know where in the node the process currently is: it knows nothing about ginkgo.By, nor about gomega.Eventually.

A signal that triggers "current running BeforeEach/AfterEach/It XYZ" may be useful, but IMHO not as useful as a the extended By from this PR.

If we were to add that By enhancement to ginkgo, perhaps the extra output could be limited to runs where -progress is enabled? Then it won't affect most existing users, but work nicely for us in Kubernetes.

pohly · 2022-08-30T15:59:33Z

I discussed this some more with Onsi and the conclusion is:

By and ginkgo will get enhanced so that its more obvious where a test is stuck.
Timeout handling through a context will make it simpler to use gomega.Eventually.

Both means that this PR is not going to be needed.

/close

k8s-ci-robot · 2022-08-30T15:59:52Z

@pohly: Closed this PR.

In response to this:

I discussed this some more with Onsi and the conclusion is:

By and ginkgo will get enhanced so that its more obvious where a test is stuck.

Timeout handling through a context will make it simpler to use gomega.Eventually.

Both means that this PR is not going to be needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

pohly added 2 commits August 24, 2022 16:49

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 24, 2022

pohly mentioned this pull request Aug 24, 2022

intermediate progress reports for asynchronous assertions (Eventually, Consistently, ...) onsi/gomega#574

Open

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2022

k8s-ci-robot requested review from fejta and oomichi August 24, 2022 15:02

pohly commented Aug 25, 2022

View reviewed changes

k8s-ci-robot requested a review from aojea August 25, 2022 07:04

k8s-ci-robot closed this Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: e2e: better polling support #112013

RFC: e2e: better polling support #112013

pohly commented Aug 24, 2022

k8s-ci-robot commented Aug 24, 2022

k8s-ci-robot commented Aug 24, 2022

pohly commented Aug 24, 2022

pohly commented Aug 24, 2022

pohly Aug 25, 2022 •

edited

pohly commented Aug 25, 2022

onsi commented Aug 25, 2022

pohly commented Aug 25, 2022

pohly commented Aug 25, 2022

onsi commented Aug 25, 2022

pohly commented Aug 26, 2022

pohly commented Aug 30, 2022

k8s-ci-robot commented Aug 30, 2022

RFC: e2e: better polling support #112013

RFC: e2e: better polling support #112013

Conversation

pohly commented Aug 24, 2022

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Aug 24, 2022

k8s-ci-robot commented Aug 24, 2022

pohly commented Aug 24, 2022

pohly commented Aug 24, 2022

pohly Aug 25, 2022 • edited

Choose a reason for hiding this comment

pohly commented Aug 25, 2022

onsi commented Aug 25, 2022

pohly commented Aug 25, 2022

pohly commented Aug 25, 2022

onsi commented Aug 25, 2022

pohly commented Aug 26, 2022

pohly commented Aug 30, 2022

k8s-ci-robot commented Aug 30, 2022

pohly Aug 25, 2022 •

edited