New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observability should more clearly indicate to the user when an outage in a service is the root cause #183216
Comments
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services) |
…alert insights (#183215) Related: #183216 ### Changes - Exclude APM error docs from logs and retrieve APM errors separately - Get sample `trace.id` from logs and apm errors, and retrieve the downstream service name (if available) - Minor prompt tweaks ### Scenario When running the [Otel-Demo](https://github.com/elastic/opentelemetry-demo) the "checkout" service is killed on purpose. This causes the failure rate of the frontend service to increase because is has a downstream dependency on the checkout service. This in turn causes alerts to be triggered. When the user navigates to the alerts details page, and opens the insights they should be presented with the "checkout" service as the root cause. ### Before Before this change the alert insights did not capture that changes to the `checkout` service was the root cause ![image](https://github.com/elastic/kibana/assets/209966/9235aec0-fb69-42fc-a692-7bd132ed819a) ### After ![image](https://github.com/elastic/kibana/assets/209966/a29c826a-708c-4be2-8fea-5e014a4fc41b)
cc @drewpost @roshan-elastic @smith |
Thanks @emma-raffenne @sorenlouv (cc @chrisdistasio I completely agree with this use case. Only a few weeks ago did I sketch out something with a customer where they wanted to see the sequence of which services failed to understand which ones was the cause vs symptom. I know this isn't exactly the same but I see a relation here. @drewpost - The way I see this working is that we need a better 'status' indicator which can highlight if a service is having a problem. I think having a status is part of ROO but I think there is an opportunity for an RCA workflow to guide users to see the impact/dependencies here. Curious of your thoughts? |
@roshan-elastic That sounds very similar. At the moment our UIs often show alerts/errors for the dependent services, but not for the service that actually died/crashed. So while we show symptoms we don't show the cause. We also don't show anything OOTB meaning users have to setup the right rules. This could also be a significant barrier. Zooming out, I think we as an Observability org should define multiple signal-agnostic scenarios that focus on common user problems that we can help them troubleshoot and understand OOTB. Suggestions for scenarios:
In addition to defining these scenarios we should make it very easy for stakeholders to reproduce them. For reproducing the problem with a single service failure having cascading impact I simply used the OpenTelemetry demo and killed one of the docker containers (detailed setup notes here). |
Thanks everyone. There's some good stuff in here. I'm feeding this into the work we did at the offsite last week. Whilst I'm not sure this will ultimately live in the APM UI as we know it today, it absolutely will feed into RCA. Particularly the signal agnostic OOTB scenarios outlined. |
Thanks @sorenlouv - this is a good suggestion. I think this is what you're saying but these could perhaps be test cases available in test environments we can validate our development against. I'll think about this more when I have some headspace. |
When running an application consisting of multiple services we should make it easier for the user to understand, if the root cause of a problem is caused by a specific service that's gone down.
Scenario
When running the Otel-Demo the "checkout" service is killed on purpose. This causes the failure rate of the frontend service (and other services) to increase because they have a downstream dependency on the checkout service. This in turn causes alerts to be triggered.
Problem
Nowhere in the UI do we show that the
checkout
service has gone down. The checkout service itself is not emitting any alerts because the failure rate for this service is not increasing (it is no longer receiving traffic so it might look like failure rate is declining).Navigating to the
frontend
service shows errors and alerts but these do not clearly indicate that the checkout service is the root cause.Solution
Related:
#183215
The text was updated successfully, but these errors were encountered: