Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if time is nil before logging #121554

Merged
merged 1 commit into from
Feb 13, 2024

Conversation

kaisoz
Copy link
Contributor

@kaisoz kaisoz commented Oct 26, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

The Job controller logs a panic when logging a nil pod.deletionTimestamp. This PR fixes this by logging it only if it's not nil.

Which issue(s) this PR fixes:

Fixes #121392

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 26, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 26, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @kaisoz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 26, 2023
@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Oct 26, 2023
@kaisoz
Copy link
Contributor Author

kaisoz commented Oct 26, 2023

/assign @mimowo @pohly

@@ -180,3 +180,12 @@ func (t Time) MarshalQueryParameter() (string, error) {

return t.UTC().Format(time.RFC3339), nil
}

// String returns the string representation of the wrapped time.Time
func (t *Time) String() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was previously reported in #105981 but closed because of #105981 (comment)

almost identical fixes were previously proposed in #108436 and #105982 and declined because of #105982 (comment) and #105981 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this can't be fixed, but people will keep trying...

@kaisoz: can you change this PR so that instead of adding a String method, a comment gets added why there is no String implementation ("historic reasons") and why it cannot be added now (see #105981 (comment))?

@liggitt
Copy link
Member

liggitt commented Oct 26, 2023

shouldn't we fix the job controller by not logging the timestamp if it is nil?

@pohly
Copy link
Contributor

pohly commented Oct 27, 2023

shouldn't we fix the job controller by not logging the timestamp if it is nil?

That then would have to be done in all places where a DeletionTimestamp gets logged, causing more work for developers. Let's instead make the panic handling in klog better so that if a fmt.Stringer fails, it checks the value for nil and prints nil (or whatever fmt.Sprintf would do).

@mimowo
Copy link
Contributor

mimowo commented Oct 27, 2023

shouldn't we fix the job controller by not logging the timestamp if it is nil?

That then would have to be done in all places where a DeletionTimestamp gets logged, causing more work for developers. Let's instead make the panic handling in klog better so that if a fmt.Stringer fails, it checks the value for nil and prints nil (or whatever fmt.Sprintf would do).

The fix in FilterActivePods is very simple, we could do it if we think it is important for 1.29 code freeze and the other approach may take longer (I'm not sure). Question to the klog approach: would the panic handling in klog introduce noticeable performance cost for the function?

@pohly
Copy link
Contributor

pohly commented Oct 27, 2023

if we think it is important for 1.29 code freeze

I don't think it is.

Question to the klog approach: would the panic handling in klog introduce noticeable performance cost for the function?

If it was done in advance before all fmt.Stringer calls, then maybe. But I am suggesting to only do extra work if that implementation panicked. That should be so rare (we only know of this one instance) that it shouldn't matter.

@mimowo
Copy link
Contributor

mimowo commented Oct 27, 2023

But I am suggesting to only do extra work if that implementation panicked. That should be so rare (we only know of this one instance) that it shouldn't matter.

Makes sense. Still, the function FilterActivePods is now called on every sync by the job controller, and it panics for every non-active, not deleted pod, so terminated on its own as Succeeded, or Failed. The number of completed non-active pods in the system might be up to 12500 (default for terminated-pod-gc-threshold).

So, I'm wondering if it's preferable to prevent passing nil in to avoid potential performance increase due to handling the panics.

@pohly
Copy link
Contributor

pohly commented Oct 27, 2023

This only occurs at -v4 or higher, which is "Debug level verbosity". Does it matter at that level?

@pohly
Copy link
Contributor

pohly commented Oct 27, 2023

As simple fix at the call site might be:

// Only log p.DeletionTimestamp if non-nil, otherwise it's String implementation panics.
var deletionTime interface{}
if p.DeletionTimestamp != nil {
   deletionTime = p.DeletionTimestamp
}
logger.V(4).Info("Ignoring inactive pod", "pod", klog.KObj(p), "phase", p.Status.Phase, "deletionTime", deletionTime)

@kaisoz
Copy link
Contributor Author

kaisoz commented Oct 27, 2023

As simple fix at the call site might be:

// Only log p.DeletionTimestamp if non-nil, otherwise it's String implementation panics.
var deletionTime interface{}
if p.DeletionTimestamp != nil {
   deletionTime = p.DeletionTimestamp
}
logger.V(4).Info("Ignoring inactive pod", "pod", klog.KObj(p), "phase", p.Status.Phase, "deletionTime", deletionTime)

I just read the thread @pohly . So should I go ahead with your code suggestion or the comment you propose in here?

IMO I'd go with the nil check since this panic in the logs might mislead the developer while debugging

@mimowo
Copy link
Contributor

mimowo commented Oct 27, 2023

This only occurs at -v4 or higher, which is "Debug level verbosity". Does it matter at that level?

Yeah, I missed that it is -v4. So, I'm happy with any of the fixes.

@alculquicondor
Copy link
Member

As simple fix at the call site might be:

// Only log p.DeletionTimestamp if non-nil, otherwise it's String implementation panics.
var deletionTime interface{}
if p.DeletionTimestamp != nil {
   deletionTime = p.DeletionTimestamp
}
logger.V(4).Info("Ignoring inactive pod", "pod", klog.KObj(p), "phase", p.Status.Phase, "deletionTime", deletionTime)

Let's go with this

@alculquicondor
Copy link
Member

@pohly, could we add a method to the klog package that wraps that?
Something like klog.SafeTimePtr(p.DeletionTimestamp)

@kaisoz
Copy link
Contributor Author

kaisoz commented Oct 27, 2023

@pohly, could we add a method to the klog package that wraps that? Something like klog.SafeTimePtr(p.DeletionTimestamp)

would that go to a different PR @alculquicondor or this one?

@pohly
Copy link
Contributor

pohly commented Oct 27, 2023

@pohly, could we add a method to the klog package that wraps that? Something like klog.SafeTimePtr(p.DeletionTimestamp)

I was wondering about that myself. But it needs to be a bit more generic (sic!):

func klog.SafePtr[T any](p *T) any {
    if p == nil {
        return nil
    }
    return p
}

@kaisoz: klog changes need to go into https://github.com/kubernetes/klog/pulls

@kaisoz
Copy link
Contributor Author

kaisoz commented Oct 27, 2023

@pohly, could we add a method to the klog package that wraps that? Something like klog.SafeTimePtr(p.DeletionTimestamp)

I was wondering about that myself. But it needs to be a bit more generic (sic!):

func klog.SafePtr[T any](p *T) any {
    if p == nil {
        return nil
    }
    return p
}

@kaisoz: klog changes need to go into kubernetes/klog/pulls

ahh obviously 😅 so then I'll implement the change @alculquicondor suggested 👍🏻

@pacoxu
Copy link
Member

pacoxu commented Jan 12, 2024

I bumped klog #122706 and you can use the new func to handle this logging issue now. @kaisoz do you have time to rebase with latest master branch and update the change?

@kaisoz
Copy link
Contributor Author

kaisoz commented Jan 12, 2024

I bumped klog #122706 and you can use the new func to handle this logging issue now. @kaisoz do you have time to rebase with latest master branch and update the change?

Sure! I'll take care of it today @pacoxu

@pohly
Copy link
Contributor

pohly commented Jan 12, 2024

The linter error is a false positive. The SafePtr helper must be added to https://github.com/kubernetes-sigs/logtools/blob/4feaefaec55f458327a13320adb5ac98024a41ce/logcheck/pkg/logcheck.go#L404-L429 and then we need to update logcheck.

@kaisoz: can you submit a PR against logcheck? I can accept it and do a release.

@kaisoz
Copy link
Contributor Author

kaisoz commented Jan 12, 2024

The linter error is a false positive. The SafePtr helper must be added to kubernetes-sigs/logtools@4feaefa/logcheck/pkg/logcheck.go#L404-L429 and then we need to update logcheck.

@kaisoz: can you submit a PR against logcheck? I can accept it and do a release.

Ready @pohly kubernetes-sigs/logtools#27

@alculquicondor
Copy link
Member

/retest

@pohly
Copy link
Contributor

pohly commented Jan 12, 2024

#122733 was merged, now the tests should pass.

/retest

@pacoxu
Copy link
Member

pacoxu commented Jan 12, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 12, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: ca36ee06facf2e3a2754f9aac6b8571d7663bee0

@kaisoz
Copy link
Contributor Author

kaisoz commented Jan 12, 2024

Tests now pass @pohly , I had to rebase master. Now it is ready to be approved 👍

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@pohly
Copy link
Contributor

pohly commented Jan 15, 2024

/assign @mikedanese

@mimowo
Copy link
Contributor

mimowo commented Feb 13, 2024

/assign @mikedanese

@mikedanese any progress on this? Should we re-assign?

@pohly
Copy link
Contributor

pohly commented Feb 13, 2024

Let's see whether @andrewsykim has time. It's a small change, just needs someone to approve.

/assign @andrewsykim

@liggitt
Copy link
Member

liggitt commented Feb 13, 2024

/lgtm
/approve
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kaisoz, liggitt, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2024
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot k8s-ci-robot merged commit bd0e1e2 into kubernetes:master Feb 13, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Job controller keeps logging panics
10 participants