Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability should more clearly indicate to the user when an outage in a service is the root cause #183216

Open
sorenlouv opened this issue May 12, 2024 · 6 comments
Labels
enhancement New value added to drive a business result Team:Obs AI Assistant

Comments

@sorenlouv
Copy link
Member

sorenlouv commented May 12, 2024

When running an application consisting of multiple services we should make it easier for the user to understand, if the root cause of a problem is caused by a specific service that's gone down.

Scenario
When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service (and other services) to increase because they have a downstream dependency on the checkout service. This in turn causes alerts to be triggered.

Problem
Nowhere in the UI do we show that the checkout service has gone down. The checkout service itself is not emitting any alerts because the failure rate for this service is not increasing (it is no longer receiving traffic so it might look like failure rate is declining).
Navigating to the frontend service shows errors and alerts but these do not clearly indicate that the checkout service is the root cause.

Solution

  • AI Assistant insights
  • The UI should show clearly which service went down and is causing cascading failures
  • Detection mechanisms:
    • throughput for the dead service is 0 for a longer period.
    • Analysis of outgoing requests from upstream services will indicate that every request to the checkout service is failing.

Related:
#183215

@botelastic botelastic bot added the needs-team Issues missing a team label label May 12, 2024
@botelastic botelastic bot removed the needs-team Issues missing a team label label May 12, 2024
@sorenlouv sorenlouv added needs-team Issues missing a team label Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels May 12, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

@botelastic botelastic bot removed the needs-team Issues missing a team label label May 12, 2024
@sorenlouv sorenlouv changed the title Contextual insights for alerts should be able to tell the user when a dead service is the root cause APM and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts May 12, 2024
@sorenlouv sorenlouv changed the title APM and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts May 12, 2024
sorenlouv added a commit that referenced this issue May 13, 2024
…alert insights (#183215)

Related: #183216

### Changes

- Exclude APM error docs from logs and retrieve APM errors separately
- Get sample `trace.id` from logs and apm errors, and retrieve the
downstream service name (if available)
- Minor prompt tweaks



### Scenario

When running the
[Otel-Demo](https://github.com/elastic/opentelemetry-demo) the
"checkout" service is killed on purpose. This causes the failure rate of
the frontend service to increase because is has a downstream dependency
on the checkout service. This in turn causes alerts to be triggered.

When the user navigates to the alerts details page, and opens the
insights they should be presented with the "checkout" service as the
root cause.

### Before

Before this change the alert insights did not capture that changes to
the `checkout` service was the root cause


![image](https://github.com/elastic/kibana/assets/209966/9235aec0-fb69-42fc-a692-7bd132ed819a)

### After


![image](https://github.com/elastic/kibana/assets/209966/a29c826a-708c-4be2-8fea-5e014a4fc41b)
@sorenlouv sorenlouv changed the title APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause of alerts APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause May 13, 2024
@sorenlouv sorenlouv changed the title APM UI and AI Assistant should more clearly indicate to the user when an outage in a service is the root cause APM UI should more clearly indicate to the user when an outage in a service is the root cause May 13, 2024
@emma-raffenne
Copy link
Contributor

cc @drewpost @roshan-elastic @smith
This is a scenario we would like to investigate to provide an AI Assistant based solution. We don't know how much would that be a RCA workflow or if it would be in the scope of the ROO initiative as well. Comments and inputs welcome.

@roshan-elastic
Copy link

Thanks @emma-raffenne @sorenlouv (cc @chrisdistasio

I completely agree with this use case. Only a few weeks ago did I sketch out something with a customer where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@drewpost - The way I see this working is that we need a better 'status' indicator which can highlight if a service is having a problem. I think having a status is part of ROO but I think there is an opportunity for an RCA workflow to guide users to see the impact/dependencies here. Curious of your thoughts?

@sorenlouv
Copy link
Member Author

sorenlouv commented May 15, 2024

where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here.

@roshan-elastic That sounds very similar. At the moment our UIs often show alerts/errors for the dependent services, but not for the service that actually died/crashed. So while we show symptoms we don't show the cause. We also don't show anything OOTB meaning users have to setup the right rules. This could also be a significant barrier.

Zooming out, I think we as an Observability org should define multiple signal-agnostic scenarios that focus on common user problems that we can help them troubleshoot and understand OOTB.

Suggestions for scenarios:

  • Single service failure with cascading impact (the issue discussed here)
  • Slow response due to resource starvation
  • Version upgrade causing a sudden increase in logs/errors/failure rate
  • Unexpected surge in user traffic

In addition to defining these scenarios we should make it very easy for stakeholders to reproduce them. For reproducing the problem with a single service failure having cascading impact I simply used the OpenTelemetry demo and killed one of the docker containers (detailed setup notes here).

@drewpost
Copy link

Thanks everyone. There's some good stuff in here. I'm feeding this into the work we did at the offsite last week. Whilst I'm not sure this will ultimately live in the APM UI as we know it today, it absolutely will feed into RCA. Particularly the signal agnostic OOTB scenarios outlined.

@roshan-elastic
Copy link

Thanks @sorenlouv - this is a good suggestion. I think this is what you're saying but these could perhaps be test cases available in test environments we can validate our development against.

I'll think about this more when I have some headspace.

@sorenlouv sorenlouv changed the title APM UI should more clearly indicate to the user when an outage in a service is the root cause Observability should more clearly indicate to the user when an outage in a service is the root cause May 16, 2024
@smith smith added enhancement New value added to drive a business result and removed Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Team:Obs AI Assistant
Projects
None yet
Development

No branches or pull requests

6 participants