New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querier: panics in QueryResultRequest marshaling #7895
Comments
The way Marshal first gets the size of the required buffer by calling sovFrontend, then later |
Stats are modified using A straightforward fix would be to make a copy (using atomic Load ops) of the stats struct when attaching it to the |
I think we should be able to trust the stats we get from queries. So I'd say it's better to understand why the race is happening and why we're hitting it and fix it instead of patching the panic. These panics look like they're happening in the querier component, not the query-frontend (looking at the last 4 frames in the stack trace) - code. I can't see how the stats could be shared. Can it maybe be the actual HTTP response? |
Cool. After thinking idly about this for a few days I figured that was the ideal.
Ah, yeah, I see. Not query-frontend, but querier. Updated the issue tags.
By shared do you mean concurrent access? And what's your thought about the HTTP response? |
To be clear, the bit I wrote about inconsistent application of atomics being involved in this is partly a hypothesis. I don't know if it is the root cause of these panics, but I do think it could explain them with thread->CPU migrations and stuff going on. However, I've just spot-checked a handful of these panics and all of those are running on GKE arm64, on the other hand, uses the stronger If there's thread-concurrent reads/writes to |
yes
Maybe somehow that's shared as well and something appends to the response body after the response object has been returned. But this is just speculation; not sure how helpful it is |
Found a solid explanation for this:
Logs of recent panics also show step 3 happening right before the crash. |
@dimitarvdimitrov Yes, but higher layers are the culprit. This passage is in the stack above Lines 363 to 378 in faf2b84
There could be more higher up - haven't scrutinized it. |
Describe the bug
We've observed a slow and consistent stream (O(10)/day) of panics in query frontend's proto marshaling running in various services that run the querier code. Specifically when marshaling
frontendv2pb.(*QueryResultRequest)
.It seems that some values within the
QueryResultRequest
are tripping up the marshaling/size math generated by gogoproto.To Reproduce
I don't have a repro. (Yet.)
Expected behavior
Don't panic!
Environment
Additional Context
I've seen two variants of the panic backtraces. Here are two I've noticed - basically panics happening at two different points within
MarshalToSizedBuffer
.Both of these panics are running commit 4ffcb00, but this problem has been steadily happening since January 2024. I haven't looked earlier.
Sometimes the out-of-range index is -1, others it is -3, -4, -6 etc.
The text was updated successfully, but these errors were encountered: