Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: e2e: better polling support #112013

Closed
wants to merge 2 commits into from
Closed

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Aug 24, 2022

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

When running tests manually it is important to know when a test is stuck and why. The commits in this PR provide some replacements for ginkgo and gomega functions which provide better output while a long-running operation is going on.

Special notes for your reviewer:

I described the idea in onsi/gomega#574. We need to decide whether this can and/or should be solved in ginkgo/gomega or in Kubernetes.

The new functions get added to test/e2e/framework instead of, say, test/e2e/framework/ginkgowrapper because I see it as similar to framework.Logf and framework.Failf and because the intermediate output is done with framework.Logf. That would not be possible when in a sub-package because sub-packages must not depend on the core framework.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


ginkgo.By can be used to record how long a callback runs. This particular
feature is not used much (or at all?) in Kubernetes and we don't have tooling
for analyzing the measurements.

What would be more important is text output. framework.By provides that:
- reminders that a step is still running at regular intervals,
  configurable with --slow-step-threshold
- the total duration for slow steps

Example:

    STEP: delete pods and claims 08/24/22 11:55:19.676
    STEP: deleting *v1.Pod dra-5514/tester-1 08/24/22 11:55:19.679
    STEP: deleting *v1.ResourceClaim dra-5514/tester-1-my-inline-claim 08/24/22 11:55:19.686
    STEP still running: delete pods and claims 08/24/22 11:55:29.677
    STEP still running: delete pods and claims 08/24/22 11:55:39.676
    STEP still running: delete pods and claims 08/24/22 11:55:49.677
    <additional output when the objects finally get removed>
    STEP duration: delete pods and claims 33.253509791s
gomega.Eventually with a function that takes a gomega.Gomega instance can be
used to poll for certain conditions. The advantage over custom code is that
gomega assertions can be used. However, gomega as it stands today does not
report why the polling continues. This is annoying when running a test
interactively: one has to wait for the timeout before additional information is
shown in the failure message.

SucceedEventually is a replacement for gomega.Eventually which is tailored
towards this use case and adds intermediate output. It's best combined with the
By replacement.

Example with some artificial failure:

    STEP: wait for plugin registration 08/24/22 14:17:23.488
    ...
    STEP still running: wait for plugin registration 08/24/22 14:17:33.489
    STEP still running: wait for plugin registration 08/24/22 14:17:43.489
    STEP still running: wait for plugin registration 08/24/22 14:17:53.489
    Aug 24 14:17:53.509: INFO: wait for plugin registration
    xxx
    Expected
        <bool>: false
    to equal
        <bool>: true
    STEP still running: wait for plugin registration 08/24/22 14:18:03.489
    STEP still running: wait for plugin registration 08/24/22 14:18:13.489
    STEP still running: wait for plugin registration 08/24/22 14:18:23.489
    Aug 24 14:18:23.489: FAIL: Timed out after 60.000s.
    wait for plugin registration
    Expected
        <string>: xxx
        Expected
            <bool>: false
        to equal
            <bool>: true
    to be empty

    Full Stack Trace
    k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp.func7()
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:195 +0xda
    k8s.io/kubernetes/test/e2e/framework.By({0x702cfda, 0x1c}, 0xc0029d3df8, {0x0, 0x0, 0x6fbc0cb?})
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/framework/log.go:115 +0x5ea
    k8s.io/kubernetes/test/e2e/dra.(*Driver).SetUp(0xc0006af700)
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:194 +0xa67
    k8s.io/kubernetes/test/e2e/dra.NewDriver.func1()
            /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/deploy.go:67 +0x26
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 24, 2022
@k8s-ci-robot
Copy link
Contributor

@pohly: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 24, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2022
@pohly
Copy link
Contributor Author

pohly commented Aug 24, 2022

Some tests are needed if this is to be merged.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2022
@k8s-ci-robot k8s-ci-robot added area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 24, 2022
@pohly
Copy link
Contributor Author

pohly commented Aug 24, 2022

/retest

//
// To get timing information for a long running SucceedEventually, wrap it in
// By.
func SucceedEventually(callback func(g gomega.Gomega), extra []interface{}, intervals ...interface{}) {
Copy link
Contributor Author

@pohly pohly Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to point out one potential pitfall (will also add to the comment): the callback must use g.Expect instead of ginkgo.Expect or our framework helper functions like framework.ExpectNoError.

That's because the failure must be reported via the failure handler encapsulated inside the Gomega instance, not the global failure handler.

The alternative is to support a func() error callback instead or in addition to func(g gomega.Gomega). If we want to support both, callback would have to be of type interface{} with runtime type assertion. Not sure whether generics would help.

@pohly
Copy link
Contributor Author

pohly commented Aug 25, 2022

/cc @aojea

It looks like @onsi is wary of supporting something like this in Gomega (onsi/gomega#574 (comment)) and I tend to agree, so I'd like to get some opinions about this PR and move it forward.

@onsi
Copy link
Contributor

onsi commented Aug 25, 2022

I don't have a lot of day-to-day experience with the e2e suite so y'all should take this input with a big grain of salt.

It seems like the main problem this is trying to solve is around the user experience when manually (i.e. on the console locally? in a specialized CI environment?) running a small subset of tests. Namely: can we get the user faster feedback about where the test is and what might be in the process of failing and why. The proposed solution is to emit information periodically whenever the spec is waiting/polling.

One potential downside to this is that it would make CI much more verbose - but it looks like the answer to that is to make the status update interval configurable. That's fair...

With that said, though, the I'm not sure the addition to have By emit periodically adds a lot of value. I would expect that a developer tailing the log could simply follow along in the cod and see where things are currently.

The additions to Eventually are more interesting. One challenge is that some error messages can be quite long and if every Eventually is emitting periodically the log can become overwhelming/hard to follow. The other challenge is that, as implemented, it seems most users will have to adopt SucceedEventually and use a func(g gomega.Gomega) callback in situations, prior to this, simpler and clearer Eventually(func() <return>).Should(...) assertions would have sufficed.

Personally, when I'm debugging specs that can get stuck locally I try to lower the timeouts involved to get faster feedback. I'd be happy to add support to Gomega to make this dead simple to use. For example GOMEGA_INTERVAL_SCALE=0.1 ginkgo could scale all timeouts down to a factor of 10%. This would allow committed code to be optimized for worst-case scenario CI environments, but make it possible to have a faster developer experience where (presumably) infrastructure is less shared and therefore (perhaps?) more reliable.


If you'd nonetheless like to pursue the approach in this PR we could explore ways to make Eventually a bit more generically extensible. For example I think it could make sense to add something like .Notify() to Eventually that takes a function that will be called while Eventually runs with status updates (namely: how many times things have run, the failure message, etc.). You could then provide a wrapper that calls Notify on the users behalf (or, if you'd like, override the default Gomega Eventually - though I'd need to think through how best to propose doing that).

@pohly
Copy link
Contributor Author

pohly commented Aug 25, 2022

I would expect that a developer tailing the log could simply follow along in the cod and see where things are currently.

That only work if code is instrumented consistently, i.e. starts with a By. It can happen that I am looking at a

STEP: doing something

but in reality that operation is already completed and the test is stuck on "doing something else".

@pohly
Copy link
Contributor Author

pohly commented Aug 25, 2022

Personally, when I'm debugging specs that can get stuck locally I try to lower the timeouts involved to get faster feedback. I'd be happy to add support to Gomega to make this dead simple to use.

That won't help in Kubernetes because most polling is not done with Gomega.

@onsi
Copy link
Contributor

onsi commented Aug 25, 2022

got it, makes sense.

i wonder if an additional helpful thing could be for ginkgo to support receiving a signal and telling you where the current test is currently waiting. could even imagine adding a flag where it will emit that information periodically if a test has been running for >N seconds

@pohly
Copy link
Contributor Author

pohly commented Aug 26, 2022

i wonder if an additional helpful thing could be for ginkgo to support receiving a signal and telling you where the current test is currently waiting.

As noted in onsi/ginkgo#1023 (comment), we need to keep printing BeforeEach/AfterEach in Kubernetes because we cannot assume that each node starts with a By or some other log output. Even with that, improving By as proposed here still makes sense for a long-running node because ginkgo cannot know where in the node the process currently is: it knows nothing about ginkgo.By, nor about gomega.Eventually.

A signal that triggers "current running BeforeEach/AfterEach/It XYZ" may be useful, but IMHO not as useful as a the extended By from this PR.

If we were to add that By enhancement to ginkgo, perhaps the extra output could be limited to runs where -progress is enabled? Then it won't affect most existing users, but work nicely for us in Kubernetes.

@pohly
Copy link
Contributor Author

pohly commented Aug 30, 2022

I discussed this some more with Onsi and the conclusion is:

  • By and ginkgo will get enhanced so that its more obvious where a test is stuck.
  • Timeout handling through a context will make it simpler to use gomega.Eventually.

Both means that this PR is not going to be needed.

/close

@k8s-ci-robot
Copy link
Contributor

@pohly: Closed this PR.

In response to this:

I discussed this some more with Onsi and the conclusion is:

  • By and ginkgo will get enhanced so that its more obvious where a test is stuck.
  • Timeout handling through a context will make it simpler to use gomega.Eventually.

Both means that this PR is not going to be needed.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants