Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querier: panics in QueryResultRequest marshaling #7895

Closed
seizethedave opened this issue Apr 12, 2024 · 9 comments · Fixed by #8140
Closed

Querier: panics in QueryResultRequest marshaling #7895

seizethedave opened this issue Apr 12, 2024 · 9 comments · Fixed by #8140
Assignees
Labels
bug Something isn't working component/querier

Comments

@seizethedave
Copy link
Contributor

seizethedave commented Apr 12, 2024

Describe the bug

We've observed a slow and consistent stream (O(10)/day) of panics in query frontend's proto marshaling running in various services that run the querier code. Specifically when marshaling frontendv2pb.(*QueryResultRequest).
It seems that some values within the QueryResultRequest are tripping up the marshaling/size math generated by gogoproto.

To Reproduce

I don't have a repro. (Yet.)

Expected behavior

Don't panic!

Environment

  • Infrastructure: Kubernetes

Additional Context

I've seen two variants of the panic backtraces. Here are two I've noticed - basically panics happening at two different points within MarshalToSizedBuffer.

Both of these panics are running commit 4ffcb00, but this problem has been steadily happening since January 2024. I haven't looked earlier.

Sometimes the out-of-range index is -1, others it is -3, -4, -6 etc.

2024-04-12 03:19:46.114	panic: runtime error: index out of range [-1]
2024-04-12 03:19:46.114	
2024-04-12 03:19:46.114	goroutine 419595405 [running]:
2024-04-12 03:19:46.114	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).MarshalToSizedBuffer(0xc063a1b5c0, {0xc0600794a0, 0x1ca, 0x1ca?})
2024-04-12 03:19:46.114		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:846 +0x24d
2024-04-12 03:19:46.114	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).Marshal(0xc06407f0c0?)
2024-04-12 03:19:46.114		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:802 +0x4d
2024-04-12 03:19:46.114	google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x3ab2690, 0xc063a0d460}, {0x0, 0x0, 0x0}, 0x0})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/protobuf/internal/impl/legacy_message.go:407 +0x9e
2024-04-12 03:19:46.114	google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0x60?, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x3ab2690, 0xc063a0d460})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/protobuf/proto/encode.go:166 +0x25d
2024-04-12 03:19:46.114	google.golang.org/protobuf/proto.Marshal({0x3a604a0, 0xc063a0d460})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/protobuf/proto/encode.go:80 +0x50
2024-04-12 03:19:46.114	google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x31cddc0?, 0xc063a1b5c0?})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/encoding/proto/proto.go:47 +0xb5
2024-04-12 03:19:46.114	google.golang.org/grpc.encode({0x7eceebd755c8?, 0x5426480?}, {0x31cddc0?, 0xc063a1b5c0?})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/rpc_util.go:647 +0x3e
2024-04-12 03:19:46.114	google.golang.org/grpc.prepareMsg({0x31cddc0?, 0xc063a1b5c0?}, {0x7eceebd755c8?, 0x5426480?}, {0x0, 0x0}, {0x0, 0x0})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/stream.go:1771 +0xc7
2024-04-12 03:19:46.114	google.golang.org/grpc.(*clientStream).SendMsg(0xc063a60d80, {0x31cddc0?, 0xc063a1b5c0})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/stream.go:893 +0xf2
2024-04-12 03:19:46.114	google.golang.org/grpc.invoke({0x3a7f6e0?, 0xc063a629c0?}, {0x33875b4?, 0xc063a5ab00?}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0x2000107?, {0xc05ec841a0, ...})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/call.go:70 +0x9f
2024-04-12 03:19:46.114	github.com/grafana/dskit/middleware.UnaryClientInstrumentInterceptor.func1({0x3a7f6e0, 0xc063a629c0}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0xc06407f6c8?, 0x35c63f8, ...)
2024-04-12 03:19:46.114		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:56 +0x117
2024-04-12 03:19:46.114	google.golang.org/grpc.getChainUnaryInvoker.func1({0x3a7f6e0, 0xc063a629c0}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0x0?, {0xc05ec841a0, ...})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/clientconn.go:459 +0x155
2024-04-12 03:19:46.114	github.com/grafana/dskit/middleware.ClientUserHeaderInterceptor({0x3a7f6e0?, 0xc063a62900?}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0x14b8229045c96b3b?, 0xc063a62930, ...)
2024-04-12 03:19:46.114		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_auth.go:22 +0xd7
2024-04-12 03:19:46.114	google.golang.org/grpc.getChainUnaryInvoker.func1({0x3a7f6e0, 0xc063a62900}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0x2?, {0xc05ec841a0, ...})
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/clientconn.go:459 +0x155
2024-04-12 03:19:46.114	github.com/opentracing-contrib/go-grpc.OpenTracingClientInterceptor.func1({0x3a7f750, 0xc063a0d370}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0x8ef43a?, 0xc063a62810, ...)
2024-04-12 03:19:46.114		/drone/src/vendor/github.com/opentracing-contrib/go-grpc/client.go:61 +0x47a
2024-04-12 03:19:46.114	google.golang.org/grpc.newClient.chainUnaryClientInterceptors.func1({0x3a7f750, 0xc063a0d370}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc063a1b5c0}, {0x3185f60, 0x5426480}, 0xc017b0bfa8?, 0x35c63f8, ...)
2024-04-12 03:19:46.114		/drone/src/vendor/google.golang.org/grpc/clientconn.go:447 +0x117
2024-04-12 03:19:46.114	google.golang.org/grpc.(*ClientConn).Invoke(0xc01398e000, {0x3a7f750?, 0xc063a0d370?}, {0x33875b4?, 0x18?}, {0x31cddc0?, 0xc063a1b5c0?}, {0x3185f60?, 0x5426480?}, {0x0, ...})
2024-04-12 03:19:46.115		/drone/src/vendor/google.golang.org/grpc/call.go:35 +0x205
2024-04-12 03:19:46.115	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*frontendForQuerierClient).QueryResult(0xc063a1b5c0?, {0x3a7f750?, 0xc063a0d370?}, 0x40e86c?, {0x0?, 0x2f15360?, 0x0?})
2024-04-12 03:19:46.115		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:674 +0x5d
2024-04-12 03:19:46.115	github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).runRequest(0xc0132ebc80, {0x3a7f6e0?, 0xc063c0d590?}, {0x3a574c0, 0xc0614db9f0}, 0x710f6231f1a48058, {0xc027dcc078, 0x13}, 0x0?, 0xc0614db7c0, ...)
2024-04-12 03:19:46.115		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:305 +0x9bb
2024-04-12 03:19:46.115	github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).querierLoop.func3()
2024-04-12 03:19:46.115		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:233 +0x549
2024-04-12 03:19:46.115	created by github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).querierLoop in goroutine 3219
2024-04-12 03:19:46.115		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:203 +0x1d5
2024-04-12 11:39:56.168	panic: runtime error: index out of range [-1]
2024-04-12 11:39:56.168	
2024-04-12 11:39:56.168	goroutine 870616614 [running]:
2024-04-12 11:39:56.168	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.encodeVarintFrontend(...)
2024-04-12 11:39:56.168		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:1039
2024-04-12 11:39:56.168	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).MarshalToSizedBuffer(0xc03018c048, {0xc024920b40, 0xb2, 0xb2?})
2024-04-12 11:39:56.168		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:844 +0x265
2024-04-12 11:39:56.168	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*QueryResultRequest).Marshal(0xc02f3af0c0?)
2024-04-12 11:39:56.168		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:802 +0x4d
2024-04-12 11:39:56.168	google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x3ab2690, 0xc04fb2d0b0}, {0x0, 0x0, 0x0}, 0x0})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/protobuf/internal/impl/legacy_message.go:407 +0x9e
2024-04-12 11:39:56.168	google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0xb0?, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x3ab2690, 0xc04fb2d0b0})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/protobuf/proto/encode.go:166 +0x25d
2024-04-12 11:39:56.168	google.golang.org/protobuf/proto.Marshal({0x3a604a0, 0xc04fb2d0b0})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/protobuf/proto/encode.go:80 +0x50
2024-04-12 11:39:56.168	google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x31cddc0?, 0xc03018c048?})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/encoding/proto/proto.go:47 +0xb5
2024-04-12 11:39:56.168	google.golang.org/grpc.encode({0x793666caa3b0?, 0x5426480?}, {0x31cddc0?, 0xc03018c048?})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/rpc_util.go:647 +0x3e
2024-04-12 11:39:56.168	google.golang.org/grpc.prepareMsg({0x31cddc0?, 0xc03018c048?}, {0x793666caa3b0?, 0x5426480?}, {0x0, 0x0}, {0x0, 0x0})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/stream.go:1771 +0xc7
2024-04-12 11:39:56.168	google.golang.org/grpc.(*clientStream).SendMsg(0xc0b81f9440, {0x31cddc0?, 0xc03018c048})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/stream.go:893 +0xf2
2024-04-12 11:39:56.168	google.golang.org/grpc.invoke({0x3a7f6e0?, 0xc05c51ac60?}, {0x33875b4?, 0xc03a94f080?}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0x3000106?, {0xc011a27660, ...})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/call.go:70 +0x9f
2024-04-12 11:39:56.168	github.com/grafana/dskit/middleware.UnaryClientInstrumentInterceptor.func1({0x3a7f6e0, 0xc05c51ac60}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0xc02f3af6c8?, 0x35c63f8, ...)
2024-04-12 11:39:56.168		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:56 +0x117
2024-04-12 11:39:56.168	google.golang.org/grpc.getChainUnaryInvoker.func1({0x3a7f6e0, 0xc05c51ac60}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0x0?, {0xc011a27660, ...})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/clientconn.go:459 +0x155
2024-04-12 11:39:56.168	github.com/grafana/dskit/middleware.ClientUserHeaderInterceptor({0x3a7f6e0?, 0xc05c51aba0?}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0x1cca7ffdcb6a7e9?, 0xc05c51abd0, ...)
2024-04-12 11:39:56.168		/drone/src/vendor/github.com/grafana/dskit/middleware/grpc_auth.go:22 +0xd7
2024-04-12 11:39:56.168	google.golang.org/grpc.getChainUnaryInvoker.func1({0x3a7f6e0, 0xc05c51aba0}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0x3657eb6640?, {0xc011a27660, ...})
2024-04-12 11:39:56.168		/drone/src/vendor/google.golang.org/grpc/clientconn.go:459 +0x155
2024-04-12 11:39:56.169	github.com/opentracing-contrib/go-grpc.OpenTracingClientInterceptor.func1({0x3a7f750, 0xc04fb2cfc0}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0x8ef43a?, 0xc05c51aab0, ...)
2024-04-12 11:39:56.169		/drone/src/vendor/github.com/opentracing-contrib/go-grpc/client.go:61 +0x47a
2024-04-12 11:39:56.169	google.golang.org/grpc.newClient.chainUnaryClientInterceptors.func1({0x3a7f750, 0xc04fb2cfc0}, {0x33875b4, 0x2c}, {0x31cddc0, 0xc03018c048}, {0x3185f60, 0x5426480}, 0xc018d742e8?, 0x35c63f8, ...)
2024-04-12 11:39:56.169		/drone/src/vendor/google.golang.org/grpc/clientconn.go:447 +0x117
2024-04-12 11:39:56.169	google.golang.org/grpc.(*ClientConn).Invoke(0xc01d188e00, {0x3a7f750?, 0xc04fb2cfc0?}, {0x33875b4?, 0x18?}, {0x31cddc0?, 0xc03018c048?}, {0x3185f60?, 0x5426480?}, {0x0, ...})
2024-04-12 11:39:56.169		/drone/src/vendor/google.golang.org/grpc/call.go:35 +0x205
2024-04-12 11:39:56.169	github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb.(*frontendForQuerierClient).QueryResult(0xc03018c048?, {0x3a7f750?, 0xc04fb2cfc0?}, 0x40e86c?, {0x0?, 0x2f15360?, 0x0?})
2024-04-12 11:39:56.169		/drone/src/vendor/github.com/grafana/mimir/pkg/frontend/v2/frontendv2pb/frontend.pb.go:674 +0x5d
2024-04-12 11:39:56.169	github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).runRequest(0xc0186d8600, {0x3a7f6e0?, 0xc035f8d950?}, {0x3a574c0, 0xc0b52d4960}, 0x799adbdf44b41fe9, {0xc03a27ec00, 0x13}, 0x18?, 0xc0b52d4730, ...)
2024-04-12 11:39:56.169		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:305 +0x9bb
2024-04-12 11:39:56.169	github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).querierLoop.func3()
2024-04-12 11:39:56.169		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:233 +0x549
2024-04-12 11:39:56.169	created by github.com/grafana/mimir/pkg/querier/worker.(*schedulerProcessor).querierLoop in goroutine 686369868
2024-04-12 11:39:56.169		/drone/src/vendor/github.com/grafana/mimir/pkg/querier/worker/scheduler_processor.go:203 +0x1d5
@seizethedave seizethedave changed the title Query frontend: intermittent panics in QueryResultRequest marshaling Query frontend: panics in QueryResultRequest marshaling Apr 12, 2024
@seizethedave
Copy link
Contributor Author

seizethedave commented Apr 13, 2024

The way Marshal first gets the size of the required buffer by calling sovFrontend, then later MarshalToSizedBuffer recomputes those value sizes again to figure out where in the byte slice to put the values, if any of the values grew in varint repr size between those two calls this bug would definitely happen. (Because the byte slice is loaded R->L, the values growing would cause it to overflow off the left end of the slice into negative indices.) Wondering if this Stats struct is concurrently updated somewhere while this marshaling is going on.

@seizethedave
Copy link
Contributor Author

Stats are modified using sync/atomic functions, but the gRPC/proto machinery marshals them by looking straight at the struct fields. (And not using atomic.LoadXyz()). All writes to the stats are done before the marshal happens, but it's a bug to write but not read the values using atomics. (The happens-before relationship isn't guaranteed when atomics are used inconsistently.) This is a plausible cause of these panics that we should fix.

A straightforward fix would be to make a copy (using atomic Load ops) of the stats struct when attaching it to the QueryResultRequest payload. This won't be exactly accurate as there are downstream middlewares in the request that continue to increment stats metrics. So a question is whether it's important that those stats updates are present in the stats we're about to send.

@seizethedave seizethedave added component/query-frontend bug Something isn't working labels Apr 18, 2024
@dimitarvdimitrov
Copy link
Contributor

So a question is whether it's important that those stats updates are present in the stats we're about to send.

I think we should be able to trust the stats we get from queries. So I'd say it's better to understand why the race is happening and why we're hitting it and fix it instead of patching the panic.

These panics look like they're happening in the querier component, not the query-frontend (looking at the last 4 frames in the stack trace) - code. I can't see how the stats could be shared. Can it maybe be the actual HTTP response?

@seizethedave
Copy link
Contributor Author

seizethedave commented Apr 22, 2024

I think we should be able to trust the stats we get from queries. So I'd say it's better to understand why the race is happening and why we're hitting it and fix it instead of patching the panic.

Cool. After thinking idly about this for a few days I figured that was the ideal.

These panics look like they're happening in the querier component, not the query-frontend (looking at the last 4 frames in the stack trace) - code.

Ah, yeah, I see. Not query-frontend, but querier. Updated the issue tags.

I can't see how the stats could be shared. Can it maybe be the actual HTTP response?

By shared do you mean concurrent access? And what's your thought about the HTTP response?

@seizethedave seizethedave changed the title Query frontend: panics in QueryResultRequest marshaling Querier: panics in QueryResultRequest marshaling Apr 22, 2024
@seizethedave
Copy link
Contributor Author

seizethedave commented Apr 22, 2024

To be clear, the bit I wrote about inconsistent application of atomics being involved in this is partly a hypothesis. I don't know if it is the root cause of these panics, but I do think it could explain them with thread->CPU migrations and stuff going on. However, I've just spot-checked a handful of these panics and all of those are running on GKE n2 machines, which are amd64 and use the same instructions for loading a uint64 value atomically and non-atomically. So, unlikely that it's the root cause of these panics.

arm64, on the other hand, uses the stronger LDAR instruction for atomic.LoadUint64 (godbolt) so this non-atomic load thing is worth fixing anyway.

If there's thread-concurrent reads/writes to Stats going on on top of that, I haven't found them.

@dimitarvdimitrov
Copy link
Contributor

I can't see how the stats could be shared. Can it maybe be the actual HTTP response?

By shared do you mean concurrent access?

yes

And what's your thought about the HTTP response

Maybe somehow that's shared as well and something appends to the response body after the response object has been returned. But this is just speculation; not sure how helpful it is

@seizethedave seizethedave self-assigned this May 15, 2024
@seizethedave
Copy link
Contributor Author

Found a solid explanation for this:

  1. scheduler processor invokes the querier handler.
  2. the querier handler (way down the call tree) forks off goroutines to do things. Those goroutines increment query stats. (like here)
  3. the handler context is canceled for whatever reason. (connection to frontend dies, timeout, ...)
  4. scheduler processor invokes QueryResult containing the error, response, and the same stats object those goroutines ^^ are still incrementing because they don't yet know the context has died. So the varint repr of the stats being marshaled are changing/growing during marshaling of the QueryResultRequest and the panic happens.

Logs of recent panics also show step 3 happening right before the crash.

@dimitarvdimitrov
Copy link
Contributor

  • the querier handler (way down the call tree) forks off goroutines to do things. Those goroutines increment query stats. (like here)

doesn't fetchSeriesFromStores return only after all such goroutines have returned?

@seizethedave
Copy link
Contributor Author

@dimitarvdimitrov Yes, but higher layers are the culprit. This passage is in the stack above fetchSeriesFromStores:

sets := make(chan storage.SeriesSet, len(queriers))
for _, querier := range queriers {
go func(querier storage.Querier) {
sets <- querier.Select(ctx, true, sp, matchers...)
}(querier)
}
var result []storage.SeriesSet
for range queriers {
select {
case set := <-sets:
result = append(result, set)
case <-ctx.Done():
return storage.ErrSeriesSet(ctx.Err())
}
}

There could be more higher up - haven't scrutinized it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working component/querier
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants