Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ccl/testccl/sqlccl/sqlccl_test: TestReadCommittedStmtRetry failed #123661

Closed
cockroach-teamcity opened this issue May 6, 2024 · 6 comments · Fixed by #123957
Closed

pkg/ccl/testccl/sqlccl/sqlccl_test: TestReadCommittedStmtRetry failed #123661

cockroach-teamcity opened this issue May 6, 2024 · 6 comments · Fixed by #123957
Assignees
Labels
A-testing Testing tools and infrastructure branch-release-24.1.0-rc Used to mark GA and release blockers and technical advisories for 24.1.0-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 6, 2024

pkg/ccl/testccl/sqlccl/sqlccl_test.TestReadCommittedStmtRetry failed on release-24.1.0-rc @ 34bbed697e77466f52012a80f0f6ae496c0d2889:

Fatal error:

panic: test timed out after 41m37s
running tests:
	TestReadCommittedStmtRetry (41m37s)

Stack:

goroutine 84012 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2366 +0x265
created by time.goFunc
	GOROOT/src/time/sleep.go:177 +0x45
Log preceding fatal error

github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler.func1({0xdbe8dd8, 0xc003e179b0}, {0x9e49540, 0xc006610000})
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10632 +0xbf
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.ServerInterceptor.func12({0xdbe8dd8, 0xc003e179b0}, {0x9e49540, 0xc006610000}, 0xc002eea3a0, 0xc00194a7f8)
	github.com/cockroachdb/cockroach/pkg/util/tracing/grpcinterceptor/grpc_interceptor.go:97 +0x72c
google.golang.org/grpc.getChainUnaryHandler.func1({0xdbe8dd8, 0xc003e179b0}, {0x9e49540, 0xc006610000})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0x12d
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func3({0xdbe8dd8, 0xc003e179b0}, {0x9e49540, 0xc006610000}, 0xc002eea3a0, 0xc00723d6c0)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:169 +0xf1
google.golang.org/grpc.getChainUnaryHandler.func1({0xdbe8dd8, 0xc003e179b0}, {0x9e49540, 0xc006610000})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0x12d
github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.unaryInterceptor({0xc002308000?, {{0x0?}, {0xdc34180?, 0xc001179500?}}}, {0xdbe8dd8, 0xc003e178c0}, {0x9e49540, 0xc006610000}, 0xc002eea3a0, 0xc00723d680)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/auth.go:95 +0x2b7
google.golang.org/grpc.getChainUnaryHandler.func1({0xdbe8dd8, 0xc003e178c0}, {0x9e49540, 0xc006610000})
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1163 +0x12d
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1.1({0xdbe8dd8, 0xc003e178c0})
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:136 +0x6b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc00138dea0, {0xdbe8dd8, 0xc003e178c0}, {0xc00141c808?, 0xc00723d640?}, 0xc0094f1700)
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:337 +0x186
github.com/cockroachdb/cockroach/pkg/rpc.NewServerEx.func1({0xdbe8dd8, 0xc003e178c0}, {0x9e49540, 0xc006610000}, 0xc002eea3a0, 0xc00723d640)
	github.com/cockroachdb/cockroach/pkg/rpc/pkg/rpc/context.go:134 +0x151
google.golang.org/grpc.NewServer.chainUnaryServerInterceptors.chainUnaryInterceptors.func1({0xdbe8dd8, 0xc003e178c0}, {0x9e49540, 0xc006610000}, 0xc002eea3a0, 0xc00194a7f8)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1154 +0xe8
github.com/cockroachdb/cockroach/pkg/kv/kvpb._Internal_Batch_Handler({0x9e18b20, 0xc004727008}, {0xdbe8dd8, 0xc003e178c0}, 0xc009a4e9a0, 0xc001b05aa0)
	github.com/cockroachdb/cockroach/pkg/kv/kvpb/bazel-out/k8-fastbuild/bin/pkg/kv/kvpb/kvpb_go_proto_/github.com/cockroachdb/cockroach/pkg/kv/kvpb/api.pb.go:10634 +0x1f4
google.golang.org/grpc.(*Server).processUnaryRPC(0xc003d55a40, {0xdc3f320, 0xc004fda1a0}, 0xc0077170e0, 0xc001875860, 0x11be5860, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1336 +0x1814
google.golang.org/grpc.(*Server).handleStream(0xc003d55a40, {0xdc3f320, 0xc004fda1a0}, 0xc0077170e0, 0x0)
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:1704 +0x1013
google.golang.org/grpc.(*Server).serveStreams.func1.2()
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:965 +0xe5
created by google.golang.org/grpc.(*Server).serveStreams.func1 in goroutine 2094
	google.golang.org/grpc/external/org_golang_google_grpc/server.go:963 +0x4a8

goroutine 79143 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0xc007b640a8, 0x367)
	GOROOT/src/runtime/sema.go:569 +0x159
sync.(*Cond).Wait(0xc007b64098)
	GOROOT/src/sync/cond.go:70 +0x75
github.com/cockroachdb/pebble/record.(*flusherCond).Wait(...)
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:369
github.com/cockroachdb/pebble/record.(*LogWriter).flushLoop(0xc007b64000, {0xc003eb6cc8?, 0x94efce0?})
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:643 +0xb79
runtime/pprof.Do({0xdbe8c18, 0x132807c0}, {{0xc000de9e60?, 0x1a1cd00?, 0xc001e74d10?}}, 0xc005321fc0)
	GOROOT/src/runtime/pprof/runtime.go:51 +0x118
github.com/cockroachdb/pebble/record.NewLogWriter.func2()
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:562 +0x92
created by github.com/cockroachdb/pebble/record.NewLogWriter in goroutine 133
	github.com/cockroachdb/pebble/record/external/com_github_cockroachdb_pebble/record/log_writer.go:561 +0x8e5

goroutine 26607 [select]:

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=3
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

Jira issue: CRDB-38457

@cockroach-teamcity cockroach-teamcity added branch-release-24.1.0-rc Used to mark GA and release blockers and technical advisories for 24.1.0-rc C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-queries SQL Queries Team labels May 6, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone May 6, 2024
@mgartner
Copy link
Collaborator

mgartner commented May 8, 2024

Not sure if this is relevant:

I240506 10:36:59.119411 2474 13@kv/kvserver/replicate_queue.go:767 ⋮
  [T1,Vsystem,n1,replicate,s1,r41/1:‹/Table/3{8-9}›]
  1  error processing replica: 0 of 1 live stores are able to take
  a new replica for the range (1 already has a voter, 0 already have
  a non-voter); likely not enough nodes in cluster

@mgartner
Copy link
Collaborator

mgartner commented May 8, 2024

Looks like it is waiting 41 minutes for the UPDATE to finish:

_, err = tx.Exec(`UPDATE kv SET v = v+10 WHERE k = 'a'`)

goroutine 16 [IO wait, 41 minutes]:
internal/poll.runtime_pollWait(0x7fcdfdd52c00, 0x72)
	GOROOT/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc003704520, 0x72, 0x0)
	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
	GOROOT/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc003704500, {0xc00373dc00, 0x700, 0x700})
	GOROOT/src/internal/poll/fd_unix.go:164 +0x466
net.(*netFD).Read(0xc003704500, {0xc00373dc00, 0x700, 0x700})
	GOROOT/src/net/fd_posix.go:55 +0x4b
net.(*conn).Read(0xc00604c680, {0xc00373dc00, 0x700, 0x700})
	GOROOT/src/net/net.go:179 +0xad
crypto/tls.(*atLeastReader).Read(0xc006c89290, {0xc00373dc00, 0x700, 0x700})
	GOROOT/src/crypto/tls/conn.go:806 +0x89
bytes.(*Buffer).ReadFrom(0xc00809beb0, {0xdb9fba0, 0xc006c89290})
	GOROOT/src/bytes/buffer.go:211 +0x113
crypto/tls.(*Conn).readFromUntil(0xc00809bc08, {0xdb9a300, 0xc00604c680}, 0x5)
	GOROOT/src/crypto/tls/conn.go:828 +0x1da
crypto/tls.(*Conn).readRecordOrCCS(0xc00809bc08, 0x0)
	GOROOT/src/crypto/tls/conn.go:626 +0x365
crypto/tls.(*Conn).readRecord(...)
	GOROOT/src/crypto/tls/conn.go:588
crypto/tls.(*Conn).Read(0xc00809bc08, {0xc0082d5000, 0x1000, 0x0?})
	GOROOT/src/crypto/tls/conn.go:1370 +0x2dd
bufio.(*Reader).Read(0xc002c4a120, {0xc005f95b28, 0x5, 0x200})
	GOROOT/src/bufio/bufio.go:241 +0x4bf
io.ReadAtLeast({0xdb95b60, 0xc002c4a120}, {0xc005f95b28, 0x5, 0x200}, 0x5)
	GOROOT/src/io/io.go:335 +0xd0
io.ReadFull(...)
	GOROOT/src/io/io.go:354
github.com/lib/pq.(*conn).recvMessage(0xc005f95b08, 0xc0092b4990)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1004 +0x1ec
github.com/lib/pq.(*conn).recv1Buf(0xc005f95b08, 0xc0092b4990)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1059 +0x35
github.com/lib/pq.(*conn).recv1(...)
	github.com/lib/pq/external/com_github_lib_pq/conn.go:1086
github.com/lib/pq.(*conn).simpleExec(0xc005f95b08, {0x9fa8fe3, 0x24})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:664 +0x2b1
github.com/lib/pq.(*conn).Exec(0xc005f95b08, {0x9fa8fe3, 0x24}, {0x132807c0, 0x0, 0x0})
	github.com/lib/pq/external/com_github_lib_pq/conn.go:921 +0x331
github.com/lib/pq.(*conn).ExecContext(0xc005f95b08, {0xdbe8c18, 0x132807c0}, {0x9fa8fe3, 0x24}, {0x132807c0, 0x0, 0x0?})
	github.com/lib/pq/external/com_github_lib_pq/conn_go18.go:46 +0x21d
database/sql.ctxDriverExec({0xdbe8c18, 0x132807c0}, {0x7fcdab3a9270, 0xc005f95b08}, {0x0, 0x0}, {0x9fa8fe3, 0x24}, {0x132807c0, 0x0, ...})
	GOROOT/src/database/sql/ctxutil.go:31 +0xeb
database/sql.(*DB).execDC.func2()
	GOROOT/src/database/sql/sql.go:1703 +0x266
database/sql.withLock({0xdbb0548, 0xc004b8d050}, 0xc009303000)
	GOROOT/src/database/sql/sql.go:3530 +0xa7
database/sql.(*DB).execDC(0x9567d20?, {0xdbe8c18, 0x132807c0}, 0xc004b8d050, 0xc001207270, {0x9fa8fe3, 0x24}, {0x0, 0x0, 0x0})
	GOROOT/src/database/sql/sql.go:1698 +0x405
database/sql.(*Tx).ExecContext(0xc0062f9880, {0xdbe8c18, 0x132807c0}, {0x9fa8fe3, 0x24}, {0x0, 0x0, 0x0})
	GOROOT/src/database/sql/sql.go:2506 +0x129
database/sql.(*Tx).Exec(...)
	GOROOT/src/database/sql/sql.go:2515
pkg/ccl/testccl/sqlccl/sqlccl_test_test.TestReadCommittedStmtRetry(0xc0021daea0)
	pkg/ccl/testccl/sqlccl/sqlccl_test_test/pkg/ccl/testccl/sqlccl/read_committed_test.go:110 +0xdeb
testing.tRunner(0xc0021daea0, 0xa3326f8)
	GOROOT/src/testing/testing.go:1689 +0x21f
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1742 +0x826

@mgartner
Copy link
Collaborator

mgartner commented May 8, 2024

@michae2 or @rafiss do you have any ideas?

@rytaft rytaft added this to Triage in SQL Foundations via automation May 8, 2024
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label May 8, 2024
@rytaft rytaft removed the T-sql-queries SQL Queries Team label May 8, 2024
@rafiss
Copy link
Collaborator

rafiss commented May 8, 2024

This doesn't appear to be an issue with retries. Rather, the UPDATE statement was blocked from ever completing in the first place. That seems more likely to be an issue outside of SQL.

It seems like some latches were held for longer than expected. Perhaps that could prevent the necessary locks from ever being acquired.

W240506 10:36:56.820356 133 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r4/1:‹/System{/tsd-tse}›,raft] 121  Merge [/System/tsd/‹cr.node.admission.errored.elastic-cpu.normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kvadmission.flow_controller.regular_requests_bypassed›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.rangefeed.mem_shared›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.endtxn.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.adminchangereplicas.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.batch_requests.cross_zone.bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-count›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-sum›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-avg›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-max›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p99.999›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p99.99›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p99.9›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p99›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p90›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p75›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.read.latency-p50›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.errored.sql-root-start›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.liveness.heartbeatsuccesses›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.l0_compacted_bytes.kv›/‹10s›/2024-05-06T10:00:00Z/‹1›],... 2626 skipped ..., Merge [/System/tsd/‹cr.store.range.snapshots.rebalancing.sent-bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.admission.raft.paused_replicas›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.queue.consistency.pending›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.queue.replicate.process.success›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.rebalancing.state.imbalanced_overfull_options_exhausted›/‹10s›/2024-05-06T10:00:00Z/‹1›] has held latch for 3470354780 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

W240506 10:38:11.416771 128 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r4/1:‹/System{/tsd-tse}›,raft] 381  Merge [/System/tsd/‹cr.node.admission.requested.kv-stores›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.requested.kv-stores.bulk-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.admitted.kv.normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.admitted.elastic-cpu.bulk-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.checkconsistency.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.admitted.kv-stores.bulk-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.sql-kv-response.locking-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kvadmission.flow_controller.regular_requests_waiting›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.adminrelocaterange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.resolveintentrange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.export.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kv.prober.write.failures›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.requested.sql-sql-response.locking-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.elastic-cpu›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.timeseries.write.bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.revertrange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.kv-stores.normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.raft.transport.reverse-sent›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.clearrange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.linkexternalsstable.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›],... 2691 skipped ..., Merge [/System/tsd/‹cr.store.raft.rcvd.snap›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.raft.rcvd.dropped›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.raft.rcvd.queued_bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.storage.shared-storage.write›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.kv.allocator.load_based_lease_transfers.cannot_find_better_candidate›/‹10s›/2024-05-06T10:00:00Z/‹1›] has held latch for 7595404358 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

W240506 10:38:22.756780 132 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r4/1:‹/System{/tsd-tse}›,raft] 447  Merge [/System/tsd/‹cr.node.admission.granter.io_tokens_exhausted_duration.kv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.sql-root-start.normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kvadmission.flow_controller.regular_tokens_deducted›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.deleterange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.refreshrange.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.batch_responses.cross_zone.bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.kv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.requested.kv.locking-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kvadmission.flow_controller.elastic_requests_bypassed›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.timeseries.write.errors›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.granter.used_slots.sql-leaf-start›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.tenant.consumption.write_batches›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.subsume.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-count›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-sum›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-avg›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-max›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-p99.999›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-p99.99›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv.high-pri-p99.9›/‹10s›/2024-05-06T10:00:00Z/‹1›],... 2691 skipped ..., Merge [/System/tsd/‹cr.store.lockcount›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.rebalancing.writebytespersecond›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.storage.batch-commit.commit-wait.duration›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.range.snapshots.cross-zone.sent-bytes›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.kv.allocator.load_based_replica_rebalancing.should_transfer›/‹10s›/2024-05-06T10:00:00Z/‹1›] has held latch for 9102625592 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

W240506 10:38:29.199857 128 kv/kvserver/spanlatch/manager.go:654 ⋮ [T1,Vsystem,n1,s1,r4/1:‹/System{/tsd-tse}›,raft] 474  Merge [/System/tsd/‹cr.node.liveness.heartbeatsinflight›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.admitted.kv.locking-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.sql-sql-response›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.scheduler_latency_listener.p99_nanos›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.tenant.consumption.request_units›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.requested.kv-stores.locking-normal-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.adminsplit.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.admintransferlease.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.transferlease.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.rpc.method.barrier.recv›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.kv.high-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.requested.kv-stores.ttl-low-pri›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.kvadmission.flow_controller.regular_requests_admitted›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.gossip.connections.incoming›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_queue_length.kv-stores›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv-count›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv-sum›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv-avg›/‹10s›/202
4-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv-max›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.node.admission.wait_durations.kv-p99.999›/‹10s›/2024-05-06T10:00:00Z/‹1›],... 2691 skipped ..., Merge [/System/tsd/‹cr.store.raft.process.commandcommit.latency-p50›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.raft.heartbeats.pending›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.queue.replicate.purgatory›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.kv.rangefeed.processors_goroutine›/‹10s›/2024-05-06T10:00:00Z/‹1›], Merge [/System/tsd/‹cr.store.txnrecovery.successes.committed›/‹10s›/2024-05-06T10:00:00Z/‹1›] has held latch for 5347128994 ns. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

There are also 416 latency jump warnings:

W240506 11:17:02.931215 2080 2@rpc/clock_offset.go:291 ⋮ [T10,Vcluster-10,nsql1,rnode=1,raddr=‹127.0.0.1:35451›,class=default,rpc] 1122  latency jump (prev avg 57.16ms, current 141.18ms)

It seems like something might have been making this testcluster slow. I'm removing the blocker label, and moving to KV in case they want to investigate why.

@rafiss rafiss added T-kv KV Team and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels May 8, 2024
@blathers-crl blathers-crl bot added this to Incoming in KV May 8, 2024
@mgartner
Copy link
Collaborator

mgartner commented May 8, 2024

Maybe we should skip this test under race?

@nvanbenschoten
Copy link
Member

The UPDATE appears to be stuck on the hook that the test injects:

goroutine 4129 [semacquire, 41 minutes]:
sync.runtime_Semacquire(0xc002212e68?)
	GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc002212e60)
	GOROOT/src/sync/waitgroup.go:116 +0xa5
pkg/ccl/testccl/sqlccl/sqlccl_test_test.TestReadCommittedStmtRetry.func1({0xc006610000?, 0xdbe8dd8?}, 0xc006610000)
	pkg/ccl/testccl/sqlccl/sqlccl_test_test/pkg/ccl/testccl/sqlccl/read_committed_test.go:69 +0x232
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).SendWithWriteBytes(0xc00324db08, {0xdbe8dd8, 0xc003e17d70}, 0xc006610000)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_send.go:170 +0x507
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).SendWithWriteBytes(0xc004032008, {0xdbe8dd8, 0xc003e17d40}, 0xc006610000)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_send.go:193 +0xd12
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).SendWithWriteBytes(0xc002277a70, {0xdbe8dd8, 0xc003e17d40}, 0xc006610000)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:203 +0xdf
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc004727008, {0xdbe8dd8, 0xc003e17cb0}, {0x6?}, 0xc006610000)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1437 +0x931
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc004727008, {0xdbe8dd8, 0xc003e179b0}, 0xc006610000)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1637 +0x4a5

The hook seems to be firing before the test expects it to, during the UPDATE to "a".

@nvanbenschoten nvanbenschoten self-assigned this May 10, 2024
@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-testing Testing tools and infrastructure labels May 10, 2024
craig bot pushed a commit that referenced this issue May 13, 2024
123957: sql: deflake TestReadCommittedStmtRetry r=nvanbenschoten a=nvanbenschoten

Fixes #123661.

This commit deflakes `TestReadCommittedStmtRetry` by not enabling the hook to block writes in the Read Committed transaction until after the first write. This avoids flakiness if the first write is ever retried due to slowness in the cluster, lease instability, or some other reason.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in 421a23b May 13, 2024
SQL Foundations automation moved this from Triage to Done May 13, 2024
blathers-crl bot pushed a commit that referenced this issue May 13, 2024
Fixes #123661.

This commit deflakes `TestReadCommittedStmtRetry` by not enabling the
hook to block writes in the Read Committed transaction until after the
first write. This avoids flakiness if the first write is ever retried
due to slowness in the cluster, lease instability, or some other reason.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-release-24.1.0-rc Used to mark GA and release blockers and technical advisories for 24.1.0-rc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
KV
Incoming
Development

Successfully merging a pull request may close this issue.

5 participants