Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/multiregionccl: TestMrSystemDatabase failed #122790

Closed
cockroach-teamcity opened this issue Apr 22, 2024 · 4 comments · Fixed by #123905
Closed

ccl/multiregionccl: TestMrSystemDatabase failed #122790

cockroach-teamcity opened this issue Apr 22, 2024 · 4 comments · Fixed by #123905
Assignees
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 22, 2024

ccl/multiregionccl.TestMrSystemDatabase failed on master @ 347cdc76d4c5abb2e872f325e944337a46b5883f:

=== RUN   TestMrSystemDatabase
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestMrSystemDatabase2776683919
    test_log_scope.go:81: use -show-logs to present logs inline
    multiregion_system_table_test.go:381: -- test log scope end --
test logs left over in: outputs.zip/logTestMrSystemDatabase2776683919
--- FAIL: TestMrSystemDatabase (111.23s)
=== RUN   TestMrSystemDatabase/Sqlliveness_system_database
    multiregion_system_table_test.go:114: error scanning '&{0xc0195b0a20 <nil>}': pq: unsupported comparison: bytes to crdb_internal_region
    --- FAIL: TestMrSystemDatabase/Sqlliveness_system_database (0.19s)

Parameters:

  • attempt=1
  • race=true
  • run=3
  • shard=3
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-38068

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Apr 22, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Apr 22, 2024
@fqazi fqazi self-assigned this Apr 22, 2024
@fqazi
Copy link
Collaborator

fqazi commented Apr 22, 2024

We previously, this was being hit because the transition to a multi-region database was not well tolerated in the system database. The latest variant is something else, since I don't see any messages from the counting logic

@fqazi fqazi removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 23, 2024
@fqazi
Copy link
Collaborator

fqazi commented Apr 23, 2024

This is also limited to the system database case, which isn't fully supported as of yet, so dropping the release-blocker tag. Still trying to reproduce it since, its not clear why the SELECT is seeing this error.

@exalate-issue-sync exalate-issue-sync bot added the P-2 Issues/test failures with a fix SLA of 3 months label Apr 25, 2024
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMrSystemDatabase failed on master @ 2f498d0abcf4130e5c872b73200086d6a3af9f02:

=== RUN   TestMrSystemDatabase
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestMrSystemDatabase3383108453
    test_log_scope.go:81: use -show-logs to present logs inline
    multiregion_system_table_test.go:381: -- test log scope end --
test logs left over in: outputs.zip/logTestMrSystemDatabase3383108453
--- FAIL: TestMrSystemDatabase (274.12s)
=== RUN   TestMrSystemDatabase/Sqlliveness_system_database
    multiregion_system_table_test.go:114: error scanning '&{0xc02757d9e0 <nil>}': pq: unsupported comparison: bytes to crdb_internal_region
    --- FAIL: TestMrSystemDatabase/Sqlliveness_system_database (0.12s)

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=3
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@fqazi
Copy link
Collaborator

fqazi commented May 9, 2024

This is happening because we accidentally reverted: https://github.com/cockroachdb/cockroach/pull/117139/files. Let me get a PR to address it again

fqazi added a commit to fqazi/cockroach that referenced this issue May 9, 2024
Previously, when converting the system database to multiregion its
possible for table statistics to contain the existing type of
crdb_region as bytes. This could happen if automatic statistics
collection happened concurrently with the conversion to a multi-region
system database. The conversion had logic to clear table statistics, but
it was still possible for statistics collection to happen in between.
This could cause queries against RBR system tables to fail because,
since the statistics types information no longer matches with the table
descriptor after. We started seeing this for the system database inside
TestMrSystemDatabase, once conversion was added for the system tenant.
To address this, this patch first adds extra logic in the schema changer
to force a refresh of stats on system tables, which will force a refresh
of statistics after the schema change, in case a stats refresh occurs
before the job completes. We also modify the TestMrSystemDatabase to
intentionally generate stats before changing the system database under
the system tenant to avoid the risk of hitting this issue. With these
changes we expect the test to no longer flake and any real world
occurrence to be less transient.

Fixes: cockroachdb#122790

Release note: None
fqazi added a commit to fqazi/cockroach that referenced this issue May 9, 2024
Previously, when converting the system database to multiregion its
possible for table statistics to contain the existing type of
crdb_region as bytes. This could happen if automatic statistics
collection happened concurrently with the conversion to a multi-region
system database. The conversion had logic to clear table statistics, but
it was still possible for statistics collection to happen in between.
This could cause queries against RBR system tables to fail because,
since the table_statistics type information no longer matches with the table
descriptor after. We started seeing this for the system database inside
TestMrSystemDatabase, once conversion was added for the system tenant.
To address this, this patch first adds extra logic in the schema changer
to force a refresh of stats on system tables, which will force a refresh
of statistics after the schema change, in case a stats refresh occurs
before the job completes. We also modify the TestMrSystemDatabase to
intentionally generate stats before changing the system database under
the system tenant to avoid the risk of hitting this issue. With these
changes we expect the test to no longer flake and any real world
occurrence to be less transient.

Fixes: cockroachdb#122790

Release note: None
fqazi added a commit to fqazi/cockroach that referenced this issue May 9, 2024
Previously, when converting the system database to multiregion its
possible for table statistics to contain the existing type of
crdb_region as bytes. This could happen if automatic statistics
collection happened concurrently with the conversion to a multi-region
system database. The conversion had logic to clear table statistics, but
it was still possible for statistics collection to happen in between.
This could cause queries against RBR system tables to fail because,
since the table_statistics type information no longer matches with the table
descriptor after. We started seeing this for the system database inside
TestMrSystemDatabase, once conversion was added for the system tenant.
To address this, this patch first adds extra logic in the schema changer
to force a refresh of stats on system tables, which will force a refresh
of statistics after the schema change, in case a stats refresh occurs
before the job completes. We also modify the TestMrSystemDatabase to
intentionally generate stats before changing the system database under
the system tenant to avoid the risk of hitting this issue. With these
changes we expect the test to no longer flake and any real world
occurrence to be less transient.

Fixes: cockroachdb#122790

Release note: None
craig bot pushed a commit that referenced this issue May 15, 2024
123905: sql: refresh stats for multi-tenant system database conversions r=fqazi a=fqazi

Previously, when converting the system database to multiregion its possible for table statistics to contain the existing type of crdb_region as bytes. This could happen if automatic statistics collection happened concurrently with the conversion to a multi-region system database. The conversion had logic to clear table statistics, but it was still possible for statistics collection to happen in between. This could cause queries against RBR system tables to fail because, since the table_statistics type information no longer matches with the table descriptor after. We started seeing this for the system database inside TestMrSystemDatabase, once conversion was added for the system tenant. To address this, this patch first adds extra logic in the schema changer to force a refresh of stats on system tables, which will force a refresh of statistics after the schema change, in case a stats refresh occurs before the job completes. We also modify the TestMrSystemDatabase to intentionally generate stats before changing the system database under the system tenant to avoid the risk of hitting this issue. With these changes we expect the test to no longer flake and any real world occurrence to be less transient.

Fixes: #122790

Release note: None

Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
@craig craig bot closed this as completed in dd4ea19 May 15, 2024
SQL Foundations automation moved this from Triage to Done May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
Development

Successfully merging a pull request may close this issue.

2 participants