[component] Runtime status reporting #9957

mwear · 2024-04-12T17:11:57Z

Runtime Status Reporting

As part of the component 1.0 milestone we should implement runtime status reporting for core components and come up with guidelines and best practices for incremental adoption by other components. This issue gives some background information on component status reporting, and outlines how it should work for different component types.

Component Status Events

Component status events can be broken down into two categories: lifecycle events (denoted by blue in the diagram) and runtime events (green and red). Lifecycle events e.g. StatusStarting, StatusStopping are reported automatically by the collector and it is the responsibility of components to report their runtime status. During runtime it will be common for a component to transition between StatusOK and StatusRecoverableError. In some situations, a component may detect an unrecoverable state, and transition into StatusPermanentError. This is a final state that cannot be transitioned out of and indicates a human will have to intervene to fix it.

The state transitions are governed by a finite state machine and the intention is that components should not have to keep track of their internal state when reporting status. Components can report StatusOK when an operation succeeds and StatusRecoverableError when an operation fails (in a recoverable way). Status events will only be emitted when a component's state changes. So repeat reports of the same status will have no effect. Likewise, if a component has transitioned into a final state (e.g. StatusPermanentError), subsequent attempts to report status will no-op.

Consumers of Status Events

There is a PR for a new version of the health check extension that is based on component status reporting. It uses lifecycle events to determine if the collector is ready and running and allows users to opt-in to having recoverable or permanent errors factored in to collector health. The OpAMP extension will make use of these events for component health at some point in the future. Any extension that implements the optional StatusWatcher interface can be a consumer of component status events.

Adoption and Best Practices

Components should be able to adopt runtime status reporting incrementally, but for the component 1.0 milestone we should establish guidelines for status reporting and implement them for some of the core components, at a minimum, the OTLP exporter, receiver, and the memory limiter processor. The guidelines should establish general rules for determining whether an error is permanent or recoverable for various component types. Below is a very rough and in-progress idea of how this could look. Many of these choices are likely to be controversial and completely open to debate and discussion. By implementing status reporting for core components we should be able to better establish and document best practices for future component adoption.

Receivers

Receivers should not report error statuses for bad data sent by clients. The errors should be explicitly related to the receiver itself. The following list identifies some scenarios and their statuses, but is likely very incomplete.

RecoverableError
- A failed scrape (for a scraping receiver)
- A transient error connecting to an external service
PermanentError
- Failure to bind to the configured port
- Failure to initialize a client to communicate with an external service

Processors

Processors are largely unique in the functionality they provide. Conventions for runtime status reporting will likely need to be considered on a case by case basis. We have an issue to link the memory limiter processor with the (new) health check extension, which will be a good use case for proof of concept.

Exporters

Exporter permanent errors fall into the following categories: bad or missing credentials, incorrectly configured or incompatible endpoint, requests or headers that are too large. All of these indicate misconfiguration at some level. The following list attempts to identify what response codes correspond to which component statuses for HTTP and GRPC.

HTTP
- RecoverableError
  - 400 errors not listed as permanent (below)
- PermanentError
  - http.StatusUnauthorized (401)
  - http.StatusForbidden (403)
  - http.StatusNotFound (404)
  - http.StatusMethodNotAllowed (405)
  - http.StatusRequestEntityTooLarge (413)
  - http.StatusRequestURITooLong (414)
  - http.StatusRequestHeaderFieldsTooLarge (431)
GRPC
- RecoverableError
  - Codes not listed as permanent below
- PermanentError
  - codes.NotFound (5)
  - codes.PermissionDenied (7)
  - codes.Unauthenticated (16)

There were two PRs that attempted to implement these guidelines for the OTLP exporters using different approaches (#8684 and #8788). There is likely a better option where we should be able to implement consistent handling for these codes via the exporter helper and this work to annotate consumer errors.

The text was updated successfully, but these errors were encountered:

mx-psi added this to the `go.opentelemetry.io/collector/component` 1.0 milestone Apr 15, 2024

evan-bradley mentioned this issue Apr 15, 2024

Support component health setting open-telemetry/opentelemetry-collector-contrib#32304

Open

mwear mentioned this issue May 3, 2024

Confused about the difference between PermanentError vs FatalError #9823

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[component] Runtime status reporting #9957

[component] Runtime status reporting #9957

mwear commented Apr 12, 2024 •

edited

[component] Runtime status reporting #9957

[component] Runtime status reporting #9957

Comments

mwear commented Apr 12, 2024 • edited

Runtime Status Reporting

Component Status Events

Consumers of Status Events

Adoption and Best Practices

Receivers

Processors

Exporters

mwear commented Apr 12, 2024 •

edited