Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/leasing-benchmark failed [azure; n2 failed to start due to connection refused error] #123947

Closed
cockroach-teamcity opened this issue May 10, 2024 · 8 comments · Fixed by #124613
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 10, 2024

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 16d41751607b92234351c1ab27053c3875a4f2b7:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-38620

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels May 10, 2024
@rafiss
Copy link
Collaborator

rafiss commented May 10, 2024

It appears that n2 failed to ever startup, due to connectivity issues in the cluster

W240510 13:59:49.014602 15 gossip/client.go:121 ⋮ [T1,Vsystem,n2] 48  failed to start gossip client to ‹40.76.187.244:26257›: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:49.014641 16 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 49  failed connection attempt‹ (last connected 0s ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
E240510 13:59:50.010528 188 2@rpc/peer.go:598 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 50  failed connection attempt‹ (last connected 996ms ago)›: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.2.0.10:26257: connect: connection refused"› [code 2/Unknown]
I240510 13:59:51.877241 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 51  unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded
I240510 13:59:52.875722 339 gossip/client.go:127 ⋮ [T1,Vsystem,n2] 52  started gossip client to n0 (‹40.76.187.244:26257›)
I240510 13:59:52.890874 143 1@server/server.go:1791 ⋮ [T1,Vsystem,n2] 53  node connected via gossip
I240510 13:59:52.891410 90 kv/kvserver/stores.go:283 ⋮ [T1,Vsystem,n2] 54  wrote 1 node addresses to persistent storage
I240510 13:59:52.891555 339 gossip/client.go:136 ⋮ [T1,Vsystem,n2] 55  closing client to n1 (‹40.76.187.244:26257›): recv msg error: grpc: ‹duplicate connection from node at 10.2.0.10:26257› [code 2/Unknown]
E240510 13:59:53.162512 315 2@rpc/peer.go:577 ⋮ [T1,Vsystem,n2,rnode=?,raddr=‹40.76.187.244:26257›,class=system,rpc] 56  disconnected (was healthy for 1.016s): grpc: ‹initial connection heartbeat failed: grpc: client requested node ID 2 doesn't match server node ID 3 [code 2/Unknown]› [code 2/Unknown]
I240510 13:59:54.878328 273 kv/kvserver/liveness/liveness.go:648 ⋮ [T1,Vsystem,n2,liveness-hb] 57  unable to get liveness record from KV: unable to get liveness: aborted in DistSender: result is ambiguous: context deadline exceeded

I'll move this to TestEng, in case this is something worth investigating in the new Azure infra. Otherwise, feel free to close this as a non-actionable flake.

@rafiss rafiss changed the title roachtest: schemachange/leasing-benchmark failed roachtest: schemachange/leasing-benchmark failed [azure; n2 failed to start due to connection refused error] May 10, 2024
@rafiss rafiss added T-testeng TestEng Team and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels May 10, 2024
Copy link

blathers-crl bot commented May 10, 2024

cc @cockroachdb/test-eng

@blathers-crl blathers-crl bot added this to Triage in Test Engineering May 10, 2024
@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4c2e7761acd050aaee565443932b6b0eca55620b:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 4cc0bfcc14771331fea57de01e1ea78b07393f3d:

(test_runner.go:1237).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:

(test_runner.go:1243).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ d146ecff6f687e438706cf63591cafca60cc116d:

(test_runner.go:1253).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/leasing-benchmark failed with artifacts on master @ c580e634736b2d2b6da544eecf16664d3caca740:

(test_runner.go:1255).runTest: test timed out (2h0m0s)
test artifacts and logs in: /artifacts/schemachange/leasing-benchmark/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

This test on roachdash | Improve this report!

@DarrylWong
Copy link
Contributor

DarrylWong commented May 23, 2024

Looks like this is failing every time, but is usually skipped because Azure doesn't have enough capacity. Seeing this quite often for westus2.

compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="SkuNotAvailable" Message="The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_D4ds_v5' is currently not available in location 'westus2'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details." Target="vmSize"

Looks like the actual issue though is that roachprod doesn't support geo dist clusters for Azure yet. I tried adding support but ran into further issues with how we handle network peering that seemed non trivial to fix. I think I'll put out a PR to:

  1. Disable this test on Azure.
  2. Switch the default location from westus2 to westus3.
  3. Make an issue to support geo zones for Azure.

@craig craig bot closed this as completed in 526fc7f May 24, 2024
Test Engineering automation moved this from Triage to Done May 24, 2024
SQL Foundations automation moved this from Triage to Done May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-testeng TestEng Team
Development

Successfully merging a pull request may close this issue.

3 participants