Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃尡 Add back error logging for Reconcile implementation #1245

Merged
merged 1 commit into from Nov 9, 2020

Conversation

enxebre
Copy link
Member

@enxebre enxebre commented Nov 4, 2020

This https://github.com/kubernetes-sigs/controller-runtime/pull/1054/files#diff-3f86588ef108e53ef346d3223ca106ccaa8efe72cb1055489665d7675f3da7b4R242 degaded the log level for the Reconcile implementation.

Then this completely remove the error logging https://github.com/kubernetes-sigs/controller-runtime/pull/1096/files#diff-3f86588ef108e53ef346d3223ca106ccaa8efe72cb1055489665d7675f3da7b4L242

When developing new controllers feels counterintuitive that controller runtime eats the error returned by the Reconcile implementation.

This proposes to log the error and let developers to create an additional log in the implementation if they choose to.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 4, 2020
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 4, 2020
@enxebre enxebre changed the title Add back error logging for Reconcile implementation 馃尡 Add back error logging for Reconcile implementation Nov 4, 2020
@ironcladlou
Copy link
Contributor

I don't understand the rationale behind https://github.com/kubernetes-sigs/controller-runtime/pull/1096/files#diff-3f86588ef108e53ef346d3223ca106ccaa8efe72cb1055489665d7675f3da7b4L242-L244. If the library is going to silently eat all errors, why does the Reconcile() interface method signature even return an error at all? Seems like an unintentional regression. Is that the case?

@alvaroaleman
Copy link
Member

If the library is going to silently eat all errors, why does the Reconcile() interface method signature even return an error at all? Seems like an unintentional regression. Is that the case?

The main purpose of the returned error is to cause a retry after an exponential backoff:

if result, err := c.Do.Reconcile(ctx, req); err != nil {
c.Queue.AddRateLimited(req)

@ironcladlou
Copy link
Contributor

ironcladlou commented Nov 4, 2020

@alvaroaleman

The main purpose of the returned error is to cause a retry after an exponential backoff:

From a library consumer's point of view, I would suggest it's completely surprising that if I return an error in this context it will be used for flow control and not actually recorded anywhere.

For example, because I (reasonably?) assume the error is recorded by the library, I wrap and propagate the error without recording it myself.

Generally, using errors for flow control seems suspicious, especially when the semantics of the concrete error are actually unknown to the caller (i.e. controller-runtime). The library has no idea what an opaque error means. The user already has an explicit way to drive flow control with regards to retries through the Result struct. If the user wanted a retry, they would have said so on the Result. Error handling behavior seems orthogonal.

Some options:

  • Adding big warning signs to the docs for Reconcile to indicate errors imply a retry and are discarded without recording
  • Reinstate error logging from the calling side
  • Introduce configuration for how the library should handle errors in this context (e.g. discard, record, map to a retry result)
  • Change the Reconcile signature so it doesn't return an error at all

@ironcladlou
Copy link
Contributor

The docs do already mention the retry semantics in the face of an error, but don't mention discarding the error:

type Reconciler interface {
	// Reconciler performs a full reconciliation for the object referred to by the Request.
	// The Controller will requeue the Request to be processed again if an error is non-nil or
	// Result.Requeue is true, otherwise upon completion it will remove the work from the queue.
	Reconcile(context.Context, Request) (Result, error)
}

@alvaroaleman
Copy link
Member

Generally, using errors for flow control seems suspicious, especially when the semantics of the concrete error are actually unknown to the caller (i.e. controller-runtime). The library has no idea what an opaque error means. The user already has an explicit way to drive flow control with regards to retries through the Result struct. If the user wanted a retry, they would have said so on the Result. Error handling behavior seems orthogonal

That is how all Kubernetes controllers work, it would be the same if you built your controller directly with client-go and generated clients+informers. Controller-Runtime is just an abstraction that aims at making building them simpler, it doesn't aim at changing anything about the concepts.

The user already has an explicit way to drive flow control with regards to retries through the Result struct.

Yes, that is if you know you have to wait a given interval or re-attempt your reconciliation. It also doesn't increase the error counter:

ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()

Some options:

Only the second suggestion is an option, everything else would mean we derive from how controllers in Kubernetes work.

The docs do already mention the retry semantics in the face of an error, but don't mention discarding the error:

Using an error to trigger a delayed retry and record it in a metric is not what I would consider "discarding" it.

@ironcladlou
Copy link
Contributor

Generally, using errors for flow control seems suspicious, especially when the semantics of the concrete error are actually unknown to the caller (i.e. controller-runtime). The library has no idea what an opaque error means. The user already has an explicit way to drive flow control with regards to retries through the Result struct. If the user wanted a retry, they would have said so on the Result. Error handling behavior seems orthogonal

That is how all Kubernetes controllers work, it would be the same if you built your controller directly with client-go and generated clients+informers. Controller-Runtime is just an abstraction that aims at making building them simpler, it doesn't aim at changing anything about the concepts.

Fair enough.

The user already has an explicit way to drive flow control with regards to retries through the Result struct.

Yes, that is if you know you have to wait a given interval or re-attempt your reconciliation. It also doesn't increase the error counter:

ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()

Some options:

Only the second suggestion is an option, everything else would mean we derive from how controllers in Kubernetes work.

The docs do already mention the retry semantics in the face of an error, but don't mention discarding the error:

Using an error to trigger a delayed retry and record it in a metric is not what I would consider "discarding" it.

If the actual error and even its message is lost, I would say the most important details have in fact been discarded. If my controller is in a failure loop due to my code propagating some error, the fact I have an opaque error counter doesn't really seem useful. Knowing errors are happening but not what the errors are doesn't seem valuable.

All things considered, recording the actual error (or possibly better, giving the user the means to trap and handle the error semantically) seems significant. But it does seem like we agree that logging (at least, in some form) is desirable?

@ncdc
Copy link
Contributor

ncdc commented Nov 5, 2020

馃憢 I originally advocated for removing the logging of the reconciler error because it does not include the source (file and line number) where the error was created. It is frustrating, at least for me, to see a log message saying something generic ("error doing xyx") where it is impossible to determine the provenance.

Having said that, I can definitely see how having something logged is better than nothing, so I don't have any issue restoring this.

I also did some work in Velero a while back to try to extract the file and line information from an error's stack (if you're using something like github.com/pkg/errors) and place that information as values in the log message. It's implementation-specific (Logrus), but it would be nice to investigate if something like this would be possible in the future in a way that worked with our logging interface. See https://github.com/vmware-tanzu/velero/blob/68a4b23722fa66c9fd4147898028d3e16c86a881/pkg/util/logging/error_location_hook.go for the code.

@ironcladlou
Copy link
Contributor

馃憢 @ncdc Hey! Long time no see! 馃榾

Go has provided error wrapping since 1.13. Does that help address the provenance issue assuming users are encouraged to implement stdlib error wrapping? It doesn't get you a stack trace, but it's certainly a big step forward.

@ncdc
Copy link
Contributor

ncdc commented Nov 5, 2020

@ironcladlou good to see you! Hope all is well. Sadly I don't think the error wrapping in 1.13 helps, given the lack of the file and line information. Unless I'm missing something?

@ironcladlou
Copy link
Contributor

@ncdc

Sadly I don't think the error wrapping in 1.13 helps, given the lack of the file and line information. Unless I'm missing something?

Yeah, the stdlib wrapping does nothing to address file/line info like you'd want from a typical stack trace. However, it does maintain the chain of errors, which is closer to useful than prior to 1.13 (assuming well-behaved code actually wrapping the errors at each level of the stack).

The implementation specific hacks sound useful to me, but I haven't really spent enough time thinking about it. Not sure if it's in the scope of what's being asked in this PR.

As it stands the burden falls to the library consumer to handle logging any error before propagating it out of Reconcile, which I consider a debatable design choice and maybe worth a note in the docs, but not a show stopping problem. Maybe this is worth taking up in an issue instead?

@vincepri
Copy link
Member

vincepri commented Nov 5, 2020

/milestone v0.7.x

@k8s-ci-robot k8s-ci-robot added this to the v0.7.x milestone Nov 5, 2020
@alvaroaleman
Copy link
Member

I am happy to re-introduce this and it doesn't sound like there is any objection.
/approve
@vincepri any concerns with merging this?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman, enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2020
@@ -264,6 +264,7 @@ func (c *Controller) reconcileHandler(ctx context.Context, obj interface{}) {
c.Queue.AddRateLimited(req)
ctrlmetrics.ReconcileErrors.WithLabelValues(c.Name).Inc()
ctrlmetrics.ReconcileTotal.WithLabelValues(c.Name, "error").Inc()
c.Log.Error(err, "Reconciler error")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
c.Log.Error(err, "Reconciler error")
log.Error(err, "Reconciler error")

So we get name and namespace information as well

This https://github.com/kubernetes-sigs/controller-runtime/pull/1054/files#diff-3f86588ef108e53ef346d3223ca106ccaa8efe72cb1055489665d7675f3da7b4R242 degaded the log level for the Reconcile implementation.

Then this completely remove the error logging https://github.com/kubernetes-sigs/controller-runtime/pull/1096/files#diff-3f86588ef108e53ef346d3223ca106ccaa8efe72cb1055489665d7675f3da7b4L242

When developing new controllers feels counterintuitive that controller runtime eats the error returned by the Reconcile implementation.

This proposes to log the error and let developers to create an additional log on the implementation if they choose to.
@vincepri
Copy link
Member

vincepri commented Nov 9, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020
@k8s-ci-robot k8s-ci-robot merged commit 2b423ec into kubernetes-sigs:master Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants