Automated status reporting via typed errors #8709

mwear · 2023-10-19T00:44:49Z

Description:
This PR is an alternative to #8684. It's currently a rough draft to get feedback on whether or not this is a good direction to pursue.

This PR adds default status reporting by wrapping the Start, Stop, and Consume* methods of connectors, exporters, processors, and receivers. The wrappers intercept the errors returned from the underlying methods and report status based on the error type. To give users more control over automated status reporting, it introduces componenterror types that match the error status types, however users should rarely need to use them. By default, if an the returned error is nil, StatusOK will be reported. For Start and Stop if the error is not a componenterror, it will be assumed to be permanent, and StatusPermanentError will be reported. For Consume* if an error is not a component error, it will be assumed to be recoverable, and StatusRecoverableError will be reported. The componenterror types will only be needed to override the default assumptions.

This PR also adds support for automatically reporting StatusOK, on Start. Previously this was left to the user due to complications with Start methods that report status asynchronously (e.g. from a go routine). We can report status for components with synchronous Start methods, but we handle the async case, likely by allowing them to opt-out of automatic status reporting. The opt-out mechanism is not yet been implemented. One other change around Start is that a component can return an error of type componenterror.Recoverable that will signal a problem starting, but won't halt the collector from starting (which is what it currently does if any error is returned).

The list below outlines how automated status reporting will work for each of the wrapped methods.

Consume
- If err == nil report StatusOK
- If err matches a componenterror report the corresponding status for the error type
- If err isn't a componenterror report StatusRecoverableError
Start
- If err == nil report StatusOK
- If err is componenterror.RecoverableError, continue starting other components and report StatusRecoverableError
- If err is componenterror.Permament or componenterror.Fatal report StatusPermanentError or StatusFatalError respectively
- If err is not a componenterror, report StatusPermanentError (this will preserve the current behavior and halt startup)
Stop
- If err matches a componenterror report the corresponding status
- If err is not a componenterror, report StatusPermanentError

From within one of these well defined methods, returning different componenterror types can control the automated status reporting, however, this should rarely be needed. From outside these methods, the TelemetrySettings.ReportComponentStatus API can be used directly (which will be the majority use case).

Edit:
I would prefer not to introduce the componenterror types as they complicate status reporting to some degree, by giving users two ways to report status. It would be ideal only use the API. If we wanted to assume that errors return from Consume* are recoverable, and permanent from Start and Shutdown then we can avoid introducing the componenterror types. Components can use the API to mark errors a permanent or fatal for Consume*, but they would not be able to make an error as recoverable from Start. While being able to have a recoverable error from start is a nice to have, it's not a capability that we have to day.

My concern about Start with components that have async start methods may not really play out in the real world. Here is an example of an async start from the zipkin receiver:

func (zr *zipkinReceiver) Start(_ context.Context, host component.Host) error {
        if host == nil {
                return errors.New("nil host")
        }


        var err error
        zr.server, err = zr.config.HTTPServerSettings.ToServer(host, zr.settings.TelemetrySettings, zr)
        if err != nil {
                return err
        }


        var listener net.Listener
        listener, err = zr.config.HTTPServerSettings.ToListener()
        if err != nil {
                return err
        }
        zr.shutdownWG.Add(1)
        go func() {
                defer zr.shutdownWG.Done()


                if errHTTP := zr.server.Serve(listener); !errors.Is(errHTTP, http.ErrServerClosed) && errHTTP != nil {
                        //host.ReportFatalError(errHTTP)
                       zr.settings.TelemetrySettings.ReportComponentStatus(component.NewFatalErrorEvent(errHTTP))
                }
        }()


        return nil
}

In this example returning nil will cause automatic status reporting to report StatusOK. If the server fails to start it will report StatusFatalError using the API. The order of these calls will be indeterminate, however, the end result will end up being the same. It could happen that StatusOK is reported first, followed by StatusFatalError, or it could happen that StatusFatalError is reported first, which would make reporting StatusOK effectively a noop (because you can not transition state from StatusFatalError to StatusOK). In both cases the end status of the component would be StatusFatalError. The same would be true if the component reported StatusPermanentError.

If, hypothetically, a component wanted to report StatusRecoverableError from within a go routine on Start there would be a problem, as the recoverable error could be cleared by reporting StatusOK. For this reason, I think we probably need an escape hatch of sorts, but I don't know that this problem actually exists in any of our code today.

Link to tracking Issue: #7682

Testing: < Describe what testing was performed and which tests were added.>

Documentation: < Describe the documentation added.>

mwear · 2023-10-25T18:21:44Z

I'm closing this as it appears to be a bad idea based on our discussion at the collector SIG earlier. Thanks to all who were involved in the discussion, I found it very helpful. For those not at the SIG, the tl;dr is that we cannot assume that errors from Consume* relate to error statuses for components and Start is riddled with edge cases. While I was hoping to find more ways to automate status reporting, it seems like there is not any further reporting we can automate at this level. We may find that there is shareable behavior for classes of components, but for now each component will be responsible for reporting its status from the period after starting until stopping.

Add componenterror types that match error statuses

c0a39e7

mwear changed the title ~~Automated status seporting via typed errors~~ Automated status reporting via typed errors Oct 19, 2023

mwear mentioned this pull request Oct 19, 2023

Automatic status reporting for exporterhelper and core exporters #8684

Closed

mwear force-pushed the automated-status-reporting branch 2 times, most recently from 1946829 to 6e68b99 Compare October 24, 2023 01:32

WIP: Wrap everything

068efc0

mwear force-pushed the automated-status-reporting branch from 6e68b99 to 068efc0 Compare October 24, 2023 01:35

mwear closed this Oct 25, 2023

mwear mentioned this pull request Apr 12, 2024

[component] Runtime status reporting #9957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated status reporting via typed errors #8709

Automated status reporting via typed errors #8709

mwear commented Oct 19, 2023 •

edited

mwear commented Oct 25, 2023 •

edited

Automated status reporting via typed errors #8709

Automated status reporting via typed errors #8709

Conversation

mwear commented Oct 19, 2023 • edited

mwear commented Oct 25, 2023 • edited

mwear commented Oct 19, 2023 •

edited

mwear commented Oct 25, 2023 •

edited