Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[component] Runtime status reporting #9957

Open
mwear opened this issue Apr 12, 2024 · 0 comments
Open

[component] Runtime status reporting #9957

mwear opened this issue Apr 12, 2024 · 0 comments

Comments

@mwear
Copy link
Member

mwear commented Apr 12, 2024

Runtime Status Reporting

As part of the component 1.0 milestone we should implement runtime status reporting for core components and come up with guidelines and best practices for incremental adoption by other components. This issue gives some background information on component status reporting, and outlines how it should work for different component types.

Component Status Events

Screenshot 2024-04-12 at 9 52 14 AM

Component status events can be broken down into two categories: lifecycle events (denoted by blue in the diagram) and runtime events (green and red). Lifecycle events e.g. StatusStarting, StatusStopping are reported automatically by the collector and it is the responsibility of components to report their runtime status. During runtime it will be common for a component to transition between StatusOK and StatusRecoverableError. In some situations, a component may detect an unrecoverable state, and transition into StatusPermanentError. This is a final state that cannot be transitioned out of and indicates a human will have to intervene to fix it.

The state transitions are governed by a finite state machine and the intention is that components should not have to keep track of their internal state when reporting status. Components can report StatusOK when an operation succeeds and StatusRecoverableError when an operation fails (in a recoverable way). Status events will only be emitted when a component's state changes. So repeat reports of the same status will have no effect. Likewise, if a component has transitioned into a final state (e.g. StatusPermanentError), subsequent attempts to report status will no-op.

Consumers of Status Events

There is a PR for a new version of the health check extension that is based on component status reporting. It uses lifecycle events to determine if the collector is ready and running and allows users to opt-in to having recoverable or permanent errors factored in to collector health. The OpAMP extension will make use of these events for component health at some point in the future. Any extension that implements the optional StatusWatcher interface can be a consumer of component status events.

Adoption and Best Practices

Components should be able to adopt runtime status reporting incrementally, but for the component 1.0 milestone we should establish guidelines for status reporting and implement them for some of the core components, at a minimum, the OTLP exporter, receiver, and the memory limiter processor. The guidelines should establish general rules for determining whether an error is permanent or recoverable for various component types. Below is a very rough and in-progress idea of how this could look. Many of these choices are likely to be controversial and completely open to debate and discussion. By implementing status reporting for core components we should be able to better establish and document best practices for future component adoption.

Receivers

Receivers should not report error statuses for bad data sent by clients. The errors should be explicitly related to the receiver itself. The following list identifies some scenarios and their statuses, but is likely very incomplete.

  • RecoverableError
    • A failed scrape (for a scraping receiver)
    • A transient error connecting to an external service
  • PermanentError
    • Failure to bind to the configured port
    • Failure to initialize a client to communicate with an external service

Processors

Processors are largely unique in the functionality they provide. Conventions for runtime status reporting will likely need to be considered on a case by case basis. We have an issue to link the memory limiter processor with the (new) health check extension, which will be a good use case for proof of concept.

Exporters

Exporter permanent errors fall into the following categories: bad or missing credentials, incorrectly configured or incompatible endpoint, requests or headers that are too large. All of these indicate misconfiguration at some level. The following list attempts to identify what response codes correspond to which component statuses for HTTP and GRPC.

  • HTTP
    • RecoverableError
      • 400 errors not listed as permanent (below)
    • PermanentError
      • http.StatusUnauthorized (401)
      • http.StatusForbidden (403)
      • http.StatusNotFound (404)
      • http.StatusMethodNotAllowed (405)
      • http.StatusRequestEntityTooLarge (413)
      • http.StatusRequestURITooLong (414)
      • http.StatusRequestHeaderFieldsTooLarge (431)
  • GRPC
    • RecoverableError
      • Codes not listed as permanent below
    • PermanentError
      • codes.NotFound (5)
      • codes.PermissionDenied (7)
      • codes.Unauthenticated (16)

There were two PRs that attempted to implement these guidelines for the OTLP exporters using different approaches (#8684 and #8788). There is likely a better option where we should be able to implement consistent handling for these codes via the exporter helper and this work to annotate consumer errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants