Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

sorenlouv · 2024-05-12T13:13:59Z

When running an application consisting of multiple services we should make it easier for the user to understand, if the root cause of a problem is caused by a specific service that's gone down.

Scenario
When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service (and other services) to increase because they have a downstream dependency on the checkout service. This in turn causes alerts to be triggered.

Problem
Nowhere in the UI do we show that the checkout service has gone down. The checkout service itself is not emitting any alerts because the failure rate for this service is not increasing (it is no longer receiving traffic so it might look like failure rate is declining).
Navigating to the frontend service shows errors and alerts but these do not clearly indicate that the checkout service is the root cause.

Solution

AI Assistant insights
The UI should show clearly which service went down and is causing cascading failures
Detection mechanisms:
- throughput for the dead service is 0 for a longer period.
- Analysis of outgoing requests from upstream services will indicate that every request to the checkout service is failing.

Related:
#183215

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-05-12T13:16:43Z

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

…alert insights (#183215) Related: #183216 ### Changes - Exclude APM error docs from logs and retrieve APM errors separately - Get sample `trace.id` from logs and apm errors, and retrieve the downstream service name (if available) - Minor prompt tweaks ### Scenario When running the [Otel-Demo](https://github.com/elastic/opentelemetry-demo) the "checkout" service is killed on purpose. This causes the failure rate of the frontend service to increase because is has a downstream dependency on the checkout service. This in turn causes alerts to be triggered. When the user navigates to the alerts details page, and opens the insights they should be presented with the "checkout" service as the root cause. ### Before Before this change the alert insights did not capture that changes to the `checkout` service was the root cause ![image](https://github.com/elastic/kibana/assets/209966/9235aec0-fb69-42fc-a692-7bd132ed819a) ### After ![image](https://github.com/elastic/kibana/assets/209966/a29c826a-708c-4be2-8fea-5e014a4fc41b)

emma-raffenne · 2024-05-14T13:13:45Z

cc @drewpost @roshan-elastic @smith
This is a scenario we would like to investigate to provide an AI Assistant based solution. We don't know how much would that be a RCA workflow or if it would be in the scope of the ROO initiative as well. Comments and inputs welcome.

roshan-elastic · 2024-05-15T09:03:36Z

Thanks @emma-raffenne @sorenlouv (cc @chrisdistasio

I completely agree with this use case. Only a few weeks ago did I sketch out something with a customer where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@drewpost - The way I see this working is that we need a better 'status' indicator which can highlight if a service is having a problem. I think having a status is part of ROO but I think there is an opportunity for an RCA workflow to guide users to see the impact/dependencies here. Curious of your thoughts?

sorenlouv · 2024-05-15T09:20:46Z

where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@roshan-elastic That sounds very similar. At the moment our UIs often show alerts/errors for the dependent services, but not for the service that actually died/crashed. So while we show symptoms we don't show the cause. We also don't show anything OOTB meaning users have to setup the right rules. This could also be a significant barrier.

Zooming out, I think we as an Observability org should define multiple signal-agnostic scenarios that focus on common user problems that we can help them troubleshoot and understand OOTB.

Suggestions for scenarios:

Single service failure with cascading impact (the issue discussed here)
Slow response due to resource starvation
Version upgrade causing a sudden increase in logs/errors/failure rate
Unexpected surge in user traffic

In addition to defining these scenarios we should make it very easy for stakeholders to reproduce them. For reproducing the problem with a single service failure having cascading impact I simply used the OpenTelemetry demo and killed one of the docker containers (detailed setup notes here).

drewpost · 2024-05-15T09:59:06Z

Thanks everyone. There's some good stuff in here. I'm feeding this into the work we did at the offsite last week. Whilst I'm not sure this will ultimately live in the APM UI as we know it today, it absolutely will feed into RCA. Particularly the signal agnostic OOTB scenarios outlined.

roshan-elastic · 2024-05-15T10:19:13Z

Thanks @sorenlouv - this is a good suggestion. I think this is what you're saying but these could perhaps be test cases available in test environments we can validate our development against.

I'll think about this more when I have some headspace.

botelastic bot added the needs-team Issues missing a team label label May 12, 2024

sorenlouv added the Team:Obs AI Assistant label May 12, 2024

botelastic bot removed the needs-team Issues missing a team label label May 12, 2024

sorenlouv added needs-team Issues missing a team label Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels May 12, 2024

botelastic bot removed the needs-team Issues missing a team label label May 12, 2024

sorenlouv mentioned this issue May 12, 2024

Add downstream dependency service name to logs and errors to improve alert insights #183215

Merged

sorenlouv changed the title ~~Contextual insights for alerts should be able to tell the user when a dead service is the root cause~~ APM and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts May 12, 2024

sorenlouv changed the title ~~APM and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts~~ APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts May 12, 2024

sorenlouv changed the title ~~APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts~~ APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause May 13, 2024

sorenlouv changed the title ~~APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause~~ APM UI should more clearly indicate to the user when an outage in a service is the root cause May 13, 2024

sorenlouv changed the title ~~APM UI should more clearly indicate to the user when an outage in a service is the root cause~~ Observability should more clearly indicate to the user when an outage in a service is the root cause May 16, 2024

smith added enhancement New value added to drive a business result and removed Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

sorenlouv commented May 12, 2024 •

edited

elasticmachine commented May 12, 2024

emma-raffenne commented May 14, 2024

roshan-elastic commented May 15, 2024

sorenlouv commented May 15, 2024 •

edited

drewpost commented May 15, 2024

roshan-elastic commented May 15, 2024

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Comments

sorenlouv commented May 12, 2024 • edited

elasticmachine commented May 12, 2024

emma-raffenne commented May 14, 2024

roshan-elastic commented May 15, 2024

sorenlouv commented May 15, 2024 • edited

drewpost commented May 15, 2024

roshan-elastic commented May 15, 2024

sorenlouv commented May 12, 2024 •

edited

sorenlouv commented May 15, 2024 •

edited