You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.
To Reproduce
Steps to reproduce the behavior:
Configure tempo 2.4.1 to trace itself (we do this through alloy, which also does tail sampling)
Configure service graph generation (again, we are doing this in alloy)
See the red section in the tempo-ingester service graph node.
Expected behavior
I would expect not to be shown continuous errors in our tempo installation
Environment:
Infrastructure: kubernetes
Deployment tool: helm
Additional Context
Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.
We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.
The text was updated successfully, but these errors were encountered:
Tempo does return success as soon as two of three writes to ingesters succeed, but it shouldn't be cancelling the third. It would be interesting to review metrics to see why this might be occurring.
Are all ingesters healthy? This can be viewed using the ingester ring page on distributors.
What is the latency of the push endpoint on your ingesters? Are some slower than others?
All ingesters are healthy. This happens in all three environments that we have (prod, staging, testing).
Latency is uniform and stable across all ingesters in all environments. Dashboard tells me 2.5ms for the median, 4.95ms for the 99th. They run in AWS EKS, and the cluster is healthy.
Some additional information, focusing on our testing instance now, as it is the smallest and has the lowest load. All the tempo pods run on a single node, dedicated to this tempo instance. There's more than enough CPU and memory (c6g.large: 2 core, 4GB). Running one distributor and three ingesters. Sustained load on the distributor is 50 spans / second.
Describe the bug
We are tracing all our monitoring stack, including tempo. We are also generating service graph which shows that a significant portion of tempo-distributor to tempo-ingester calls are errors, but those are only "context cancelled" calls, don't seem to be actual errors.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I would expect not to be shown continuous errors in our tempo installation
Environment:
Additional Context
Basically every trace shows distributor doing PushBytesV2 against 3 ingesters, and when 2 ingesters respond, the third call is cancelled on the distributor. Either this is the intended behavior, in that case this cancelled call should not be marked as an error in the span, or it is an actual issue, and then it needs to be fixed.
We are doing tail sampling of traces, primarily percentage based, but also forwarding all traces that have errors. This means that practically all traces from the distributor are sampled because all have errors.
The text was updated successfully, but these errors were encountered: