Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/gossip/locality-address failed #123978

Open
cockroach-teamcity opened this issue May 11, 2024 · 4 comments
Open

roachtest: acceptance/gossip/locality-address failed #123978

cockroach-teamcity opened this issue May 11, 2024 · 4 comments
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 11, 2024

roachtest.acceptance/gossip/locality-address failed with artifacts on master @ 59b261a579cbe2c032a5dd3e182ff67aeee900b9:

(test_runner.go:1237).runTest: test timed out (10m0s)
test artifacts and logs in: /artifacts/acceptance/gossip/locality-address/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38636

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 11, 2024
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV May 11, 2024
@nvanbenschoten
Copy link
Member

goroutine 11113 [sync.Mutex.Lock, 1 minutes]:
sync.runtime_SemacquireMutex(0x747b95e?, 0x6?, 0x7f0b029297f8?)
	GOROOT/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc001c92150)
	GOROOT/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(0x10?)
	GOROOT/src/sync/mutex.go:90 +0x32
github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce.Init.NewDNSProvider.func1(0xc003466dc0)
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce/dns.go:59 +0x49
github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce.(*dnsProvider).CreateRecords(0xc0035f6048, {0x9078530, 0xc001e90b40}, {0xc0014e61c0, 0x2, 0xc0014e61c0?})
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce/dns.go:116 +0x73e
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).RegisterServices.func1({0x7f0b00df4c50, 0xda89220})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/services.go:324 +0x354
github.com/cockroachdb/cockroach/pkg/roachprod/vm.ForDNSProvider({0xc0049f5fc0, 0x3}, 0xc002dd8da8)
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/dns.go:116 +0x122
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).RegisterServices(0xc0031bc360, {0x9078530, 0xc001e90b40}, {0xc0027146c0, 0x2, 0x7714e01?})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/services.go:308 +0x348
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).maybeRegisterServices(0xc0031bc360, {0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, {0x7714e01, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:292 +0x2a5
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start(0xc0031bc360, {0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, {0x7714e01, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:408 +0x376
github.com/cockroachdb/cockroach/pkg/roachprod.Start({0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0xc0054da360?, 0xc002180008?}, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:751 +0xba
main.(*clusterImpl).StartE(_, {_, _}, _, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, ...}, ...)
	main/pkg/cmd/roachtest/cluster.go:2076 +0x46e
main.(*clusterImpl).Start(_, {_, _}, _, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, ...}, ...)
	main/pkg/cmd/roachtest/cluster.go:2236 +0xbe
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runCheckLocalityIPAddress({0x9078530, 0xc001e90b40}, {0x911f5a0, 0xc0018b9760}, {0x916bf30, 0xc002844488})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/gossip.go:516 +0x283
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerAcceptance.func1({0x9078530?, 0xc001e90b40?}, {0x911f5a0?, 0xc0018b9760?}, {0x916bf30?, 0xc002844488?})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/acceptance.go:152 +0x3a
main.(*testRunner).runTest.func2()
	main/pkg/cmd/roachtest/test_runner.go:1208 +0xf2
created by main.(*testRunner).runTest in goroutine 74
	main/pkg/cmd/roachtest/test_runner.go:1192 +0x927

Node startup failed. Based on the stacks in __stacks.log, this looks like some kind of infra failure. I'll move this to test-eng, who may know more.

@nvanbenschoten nvanbenschoten removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels May 13, 2024
@nvanbenschoten nvanbenschoten added this to Triage in Test Engineering via automation May 13, 2024
@blathers-crl blathers-crl bot added the T-testeng TestEng Team label May 13, 2024
@nvanbenschoten nvanbenschoten removed this from roachtest/unit test backlog in KV May 13, 2024
Copy link

blathers-crl bot commented May 13, 2024

cc @cockroachdb/test-eng

@srosenberg
Copy link
Member

@herkolategan This is the mutex in NewDNSProviderWithExec,

return NewDNSProviderWithExec(func(cmd *exec.Cmd) ([]byte, error) {
		// Limit to one gcloud command at a time. At this time we are unsure if it's
		// safe to make concurrent calls to the `gcloud` CLI to mutate DNS records
		// in the same zone. We don't mutate the same record in parallel, but we do
		// mutate different records in the same zone. See: #122180 for more details.
		gcloudMu.Lock()
		defer gcloudMu.Unlock()
		return cmd.CombinedOutput()
	})

which contends with WipeForReuse, and causes the test to time out in the process. Did we hear back from GCE support on whether the global lock is required?

goroutine 68 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0x0?)
        GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc002a1ad80?)
        GOROOT/src/sync/waitgroup.go:116 +0x48
golang.org/x/sync/errgroup.(*Group).Wait(0xc001efe440)
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:56 +0x25
github.com/cockroachdb/cockroach/pkg/roachprod/vm.FanOutDNS({0xc005848808, 0x4, 0x26?}, 0xc001cb88a0)
        github.com/cockroachdb/cockroach/pkg/roachprod/vm/dns.go:99 +0x33c
github.com/cockroachdb/cockroach/pkg/roachprod.DestroyDNS({0x9078568, 0xc001ac5770}, 0x0?, {0xc0024c0090?, 0x4?})
        github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:2291 +0xaa
main.(*clusterImpl).DestroyDNS(...)
        main/pkg/cmd/roachtest/cluster.go:2955
main.(*clusterImpl).WipeForReuse(0xc0014e3688, {_, _}, _, {{0x0, 0x0}, 0x4, 0x4, 0x0, 0x0, ...})
        main/pkg/cmd/roachtest/cluster.go:2943 +0x46e
main.(*testRunner).runWorker(0xc002714360, {0x9078568?, 0xc001771140?}, {0xc001c939ac, _}, _, _, _, _, {0x1, ...}, ...)
        main/pkg/cmd/roachtest/test_runner.go:631 +0x5d8
main.(*testRunner).Run.func1({0x9078568, 0xc001771140})
        main/pkg/cmd/roachtest/test_runner.go:366 +0x252
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x13a
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 1
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x3fe

@herkolategan
Copy link
Collaborator

@srosenberg Thanks for the extra info, Renato and I were looking at this earlier. I'll create a support ticket; I planned to only create one if there were issues around the mutex, and unfortunately it doesn't seem to scale.

@herkolategan herkolategan self-assigned this May 15, 2024
@herkolategan herkolategan moved this from Triage to Backlog in Test Engineering May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Projects
Development

No branches or pull requests

4 participants