Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job controller keeps logging panics #121392

Closed
mimowo opened this issue Oct 20, 2023 · 22 comments · Fixed by #121554
Closed

Job controller keeps logging panics #121392

mimowo opened this issue Oct 20, 2023 · 22 comments · Fixed by #121554
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@mimowo
Copy link
Contributor

mimowo commented Oct 20, 2023

What happened?

Job controller, and possibly other controllers keep logging panics from this line in FilterActivePods:

logger.V(4).Info("Ignoring inactive pod", "pod", klog.KObj(p), "phase", p.Status.Phase, "deletionTime", p.DeletionTimestamp)
.

This happens during e2e tests, and I think it happens on production as well.

Here is an example for the successful build:
https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/121302/pull-kubernetes-e2e-kind/1715287183352401920/artifacts/kind-control-plane/containers/kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-ec8b5cc2095bc6b1bdbfe61f132b3d493dea09ab0808935b59e10dcc5ffe1082.log
2023-10-20T09:10:28.016158573Z stderr F I1020 09:10:28.016037 1 controller_utils.go:955] "Ignoring inactive pod" pod="ttlafterfinished-3394/rand-non-local-ghcvx" phase="Failed" deletionTime="<panic: runtime error: invalid memory address or nil pointer dereference>"

What did you expect to happen?

No panics during e2e test from this line in FilterActivePods.

How can we reproduce it (as minimally and precisely as possible)?

Run e2e or integration tests for the job controller.

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

Reproducible on kind during e2e tests

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label Oct 20, 2023
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 20, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 20, 2023

/sig apps
FYI @alculquicondor @pohly

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 20, 2023
@alculquicondor
Copy link
Member

Do you have any idea which variable is nil here?

@alculquicondor
Copy link
Member

oops, sorry, it's the deletionTimestamp

@alculquicondor
Copy link
Member

Is it actually a panic though? It seems that the logger is parsing a nil value as <panic: runtime error: invalid memory address or nil pointer dereference> and proceeding.

I think we could use ptr.Deref and provide a default value.

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Oct 20, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Oct 20, 2023

Possibly this is just logger indeed. Still it is important imo because it makes looking for real panics in logs hard. This was my use case to understand another issue.

@pohly
Copy link
Contributor

pohly commented Oct 20, 2023

The logger sees an any (= interface{}) value which implements fmt.Stringer and calls it's String method - in this case, https://cs.opensource.google/go/go/+/refs/tags/go1.21.3:src/time/format.go;l=527. That then panics.

We've gone back and forth whether the logging call should try to call the String method and recover from a panic or try to do some advance checks whether the any value is perhaps nil. There could be String implementations for pointers which produce some nice output for nil, so not calling it would be a loss. Also, detecting nil values is surprisingly hard, if I remember correctly (the fmt.Stringer interface is non-nil, only the value hidden inside it is)), therefore currently the code simply calls it.

@kaisoz
Copy link
Contributor

kaisoz commented Oct 23, 2023

I can work of this 😊
/assign

@pohly
Copy link
Contributor

pohly commented Oct 23, 2023

@kaisoz: how do you want to solve this?

Besides some potential klog improvement, fixing https://pkg.go.dev/k8s.io/apimachinery/pkg/apis/meta/v1#Time to not crash for a nil pointer would be good. https://github.com/kubernetes/apimachinery/blob/v0.28.3/pkg/apis/meta/v1/generated.pb.go#L4762 already does this.

@kaisoz
Copy link
Contributor

kaisoz commented Oct 24, 2023

Hi @pohly !

Sorry I didn't answer yesterday. I like the idea of doing the check at type level and we already have a precedent.
Should we provide a default value as @alculquicondor suggests? Tbh I'm not sure what would be a a proper default value here

@pohly
Copy link
Contributor

pohly commented Oct 24, 2023

When logging an object's DeletionTimestamp, "is nil" is the right information to log when the pointer is nil. There is no default that can or should be used instead.

@kaisoz
Copy link
Contributor

kaisoz commented Oct 26, 2023

@mimowo @pohly I've raised a PR with a simple fix. I've run the integration tests locally and it fixes the problem. Let me know if that's enough 😊

Thanks!

@kaisoz
Copy link
Contributor

kaisoz commented Nov 20, 2023

@mimowo the changes in klog have been merged but not yet released (kubernetes/klog#393). @pohly what's the klog release cycle? It seems that there's no clear release schedule, right?

@pohly
Copy link
Contributor

pohly commented Nov 20, 2023

We usually do one release for each Kubernetes release, if there are changes.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2023

We usually do one release for each Kubernetes release, if there are changes.

However, in this case the next release this could go into is 1.30, right? This would be unfortunate because we wouldn't be able to fix 1.29 with the klog changes. If this is the case, I think we may need to do the quick if in controller_utils?

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2023

cc @dims, is there a chance to do another release of klog, so that we can use it in a patch release of k/k 1.29?

@pohly
Copy link
Contributor

pohly commented Nov 20, 2023

For 1.29, https://github.com/kubernetes/kubernetes/pull/121554/files is a better fix than rushing a new klog release.

But is this critical enough to ask the release team for an exception?

@pohly
Copy link
Contributor

pohly commented Nov 20, 2023

It only affects developers who bump up the verbosity to >= 4.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 20, 2023

For 1.29, https://github.com/kubernetes/kubernetes/pull/121554/files is a better fix than rushing a new klog release.

So we can keep 1.29 as is, or do the quick fix in controller_utils.

I would suggest to do something about it, because it makes our log artifacts not so useful to detect real panics - a simple grep does not work anymore. @alculquicondor do you think it makes sense to patch 1.29?

But is this critical enough to ask the release team for an exception?
It only affects developers who bump up the verbosity to >= 4.

Oh, yes, probably it is not critical enough, I keep forgetting about the >= 4 level.

@alculquicondor
Copy link
Member

It's definitely not critical. It's not even a real panic... it's just a log line that says "panic"

@kaisoz
Copy link
Contributor

kaisoz commented Nov 21, 2023

It's definitely not critical. It's not even a real panic... it's just a log line that says "panic"

@alculquicondor so better wait for the klog and then make the change? or add this change meanwhile?

@alculquicondor
Copy link
Member

I would wait, given that the cherry-pick is not justified

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants