Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated status reporting via typed errors #8709

Closed
wants to merge 2 commits into from

Conversation

mwear
Copy link
Member

@mwear mwear commented Oct 19, 2023

Description:
This PR is an alternative to #8684. It's currently a rough draft to get feedback on whether or not this is a good direction to pursue.

This PR adds default status reporting by wrapping the Start, Stop, and Consume* methods of connectors, exporters, processors, and receivers. The wrappers intercept the errors returned from the underlying methods and report status based on the error type. To give users more control over automated status reporting, it introduces componenterror types that match the error status types, however users should rarely need to use them. By default, if an the returned error is nil, StatusOK will be reported. For Start and Stop if the error is not a componenterror, it will be assumed to be permanent, and StatusPermanentError will be reported. For Consume* if an error is not a component error, it will be assumed to be recoverable, and StatusRecoverableError will be reported. The componenterror types will only be needed to override the default assumptions.

This PR also adds support for automatically reporting StatusOK, on Start. Previously this was left to the user due to complications with Start methods that report status asynchronously (e.g. from a go routine). We can report status for components with synchronous Start methods, but we handle the async case, likely by allowing them to opt-out of automatic status reporting. The opt-out mechanism is not yet been implemented. One other change around Start is that a component can return an error of type componenterror.Recoverable that will signal a problem starting, but won't halt the collector from starting (which is what it currently does if any error is returned).

The list below outlines how automated status reporting will work for each of the wrapped methods.

  • Consume
    • If err == nil report StatusOK
    • If err matches a componenterror report the corresponding status for the error type
    • If err isn't a componenterror report StatusRecoverableError
  • Start
    • If err == nil report StatusOK
    • If err is componenterror.RecoverableError, continue starting other components and report StatusRecoverableError
    • If err is componenterror.Permament or componenterror.Fatal report StatusPermanentError or StatusFatalError respectively
    • If err is not a componenterror, report StatusPermanentError (this will preserve the current behavior and halt startup)
  • Stop
    • If err matches a componenterror report the corresponding status
    • If err is not a componenterror, report StatusPermanentError

From within one of these well defined methods, returning different componenterror types can control the automated status reporting, however, this should rarely be needed. From outside these methods, the TelemetrySettings.ReportComponentStatus API can be used directly (which will be the majority use case).

Edit:
I would prefer not to introduce the componenterror types as they complicate status reporting to some degree, by giving users two ways to report status. It would be ideal only use the API. If we wanted to assume that errors return from Consume* are recoverable, and permanent from Start and Shutdown then we can avoid introducing the componenterror types. Components can use the API to mark errors a permanent or fatal for Consume*, but they would not be able to make an error as recoverable from Start. While being able to have a recoverable error from start is a nice to have, it's not a capability that we have to day.

My concern about Start with components that have async start methods may not really play out in the real world. Here is an example of an async start from the zipkin receiver:

func (zr *zipkinReceiver) Start(_ context.Context, host component.Host) error {
        if host == nil {
                return errors.New("nil host")
        }


        var err error
        zr.server, err = zr.config.HTTPServerSettings.ToServer(host, zr.settings.TelemetrySettings, zr)
        if err != nil {
                return err
        }


        var listener net.Listener
        listener, err = zr.config.HTTPServerSettings.ToListener()
        if err != nil {
                return err
        }
        zr.shutdownWG.Add(1)
        go func() {
                defer zr.shutdownWG.Done()


                if errHTTP := zr.server.Serve(listener); !errors.Is(errHTTP, http.ErrServerClosed) && errHTTP != nil {
                        //host.ReportFatalError(errHTTP)
                       zr.settings.TelemetrySettings.ReportComponentStatus(component.NewFatalErrorEvent(errHTTP))
                }
        }()


        return nil
}

In this example returning nil will cause automatic status reporting to report StatusOK. If the server fails to start it will report StatusFatalError using the API. The order of these calls will be indeterminate, however, the end result will end up being the same. It could happen that StatusOK is reported first, followed by StatusFatalError, or it could happen that StatusFatalError is reported first, which would make reporting StatusOK effectively a noop (because you can not transition state from StatusFatalError to StatusOK). In both cases the end status of the component would be StatusFatalError. The same would be true if the component reported StatusPermanentError.

If, hypothetically, a component wanted to report StatusRecoverableError from within a go routine on Start there would be a problem, as the recoverable error could be cleared by reporting StatusOK. For this reason, I think we probably need an escape hatch of sorts, but I don't know that this problem actually exists in any of our code today.

Link to tracking Issue: #7682

Testing: < Describe what testing was performed and which tests were added.>

Documentation: < Describe the documentation added.>

@mwear mwear changed the title Automated status seporting via typed errors Automated status reporting via typed errors Oct 19, 2023
@mwear mwear force-pushed the automated-status-reporting branch 2 times, most recently from 1946829 to 6e68b99 Compare October 24, 2023 01:32
@mwear
Copy link
Member Author

mwear commented Oct 25, 2023

I'm closing this as it appears to be a bad idea based on our discussion at the collector SIG earlier. Thanks to all who were involved in the discussion, I found it very helpful. For those not at the SIG, the tl;dr is that we cannot assume that errors from Consume* relate to error statuses for components and Start is riddled with edge cases. While I was hoping to find more ways to automate status reporting, it seems like there is not any further reporting we can automate at this level. We may find that there is shareable behavior for classes of components, but for now each component will be responsible for reporting its status from the period after starting until stopping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant