Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrade from 5.4 to 7.5, my TiFlash data gone after the TiKV node disconnected #8777

Closed
uzuki27 opened this issue Feb 21, 2024 · 25 comments · Fixed by #8785
Closed

After upgrade from 5.4 to 7.5, my TiFlash data gone after the TiKV node disconnected #8777

uzuki27 opened this issue Feb 21, 2024 · 25 comments · Fixed by #8785
Labels

Comments

@uzuki27
Copy link

uzuki27 commented Feb 21, 2024

Bug Report

I’m facing an issue with TiFlash after upgrade my TiDB cluster from v5.4 to v7.5.
A days after the upgrade, I realized that the data became out of sync between TiKV and TiFlash. The query via TiFlash returned almost empty data. When show tiflash_replica, all of my table progress is back to 0 or something not 1 but avaiable still 1.

+----------+---------------+-----------+----------+
| TABLE_ID | REPLICA_COUNT | AVAILABLE | PROGRESS |
+----------+---------------+-----------+----------+
|   460166 |             2 |         1 |      0.5 |
|   396090 |             2 |         1 |        0 |
|   396886 |             2 |         1 |        0 |
|   397221 |             2 |         1 |        0 |
|   397295 |             2 |         1 |        0 |
+----------+---------------+-----------+----------+

After that, I decided to set all replica to 0, scale-in all 2 of my TiFlash node, waiting end of Pending Offline, prune its and scale-out them into cluster again. After set replica to 2 for my tables again, the replication is run normally. After replication of all tables completed, my query on TiFlash is on-sync again. But after about 1 days, the issue occurred again I did more 2 times scale-in scale-out, the issue still hasn’t been resolved.

1. Minimal reproduce step (Required)

After several attempts, I realized they all shared the same scenario. After all of tables had completed the replication about 1 days. 1 of TiKV node evicted all leader region and then rebalance again
image

The evicted TiKV node log raised some ERROR like this

[2024/02/21 08:37:27.278 +07:00] [ERROR] [peer.rs:5351] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 1020098876 store_id: 5"] [peer_id=1020785306] [region_id=1020098875] [type=MsgHibernateResponse] [thread_id=0x5]
[2024/02/21 08:37:27.280 +07:00] [ERROR] [peer.rs:5351] ["failed to send extra message"] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target="id: 1065360228 store_id: 5"] [peer_id=977883503] [region_id=977883502] [type=MsgHibernateRequest] [thread_id=0x5]

and re-join with this log

[2024/02/21 08:37:27.369 +07:00] [INFO] [raft.rs:2345] ["[term 219] received MsgTimeoutNow from 1019902187 and starts an election to get leadership."] [from=1019902187] [term=219] [raft_id=1019902186] [region_id=1019902184] [thread_id=0x5]
[2024/02/21 08:37:27.369 +07:00] [INFO] [raft.rs:1532] ["starting a new election"] [term=219] [raft_id=1019902186] [region_id=1019902184] [thread_id=0x5]
[2024/02/21 08:37:27.369 +07:00] [INFO] [raft.rs:1151] ["became candidate at term 220"] [term=220] [raft_id=1019902186] [region_id=1019902184] [thread_id=0x5]

I don’t think it have OOM kill or restart occurred here. After some minutes, TiKV have this log

[2024/02/21 09:06:11.244 +07:00] [INFO] [scheduler.rs:753] ["get snapshot failed"] [err="Error(Request(message: \"EpochNotMatch current epoch of region 977165220 is conf_ver: 116913 version: 98886, but you sent conf_ver: 116913 version: 98885\" epoch_not_match { current_regions { id: 977165220 start_key: 74800000000000AAFF575F7280000004A0FF8C68B20000000000FA end_key: 74800000000000AAFF5800000000000000F8 region_epoch { conf_ver: 116913 version: 98886 } peers { id: 1022383514 store_id: 959355551 } peers { id: 1022395425 store_id: 4 } peers { id: 1029101388 store_id: 5 } } current_regions { id: 1257436660 start_key: 74800000000000AAFF575F7280000004A0FF893D2D0000000000FA end_key: 74800000000000AAFF575F7280000004A0FF8C68B20000000000FA region_epoch { conf_ver: 116913 version: 98886 } peers { id: 1257436661 store_id: 959355551 } peers { id: 1257436662 store_id: 4 } peers { id: 1257436663 store_id: 5 } } }))"] [cid=215606] [thread_id=0x5]

Following that, 2 TiFlash node have some strange log and data on TiFlash become empty.

2024/02/21 09:06:17.072 +07:00] [INFO] [TiDBSchemaSyncer.cpp:72] ["Start to sync schemas. current version is: 759513 and try to sync schema version to: 759516"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.072 +07:00] [INFO] [SchemaBuilder.cpp:1235] ["Storage instance does not exist, applyDropTable is ignored, table_id=462125"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.072 +07:00] [INFO] [SchemaBuilder.cpp:1235] ["Storage instance does not exist, applyDropTable is ignored, table_id=462125"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.072 +07:00] [INFO] [SchemaBuilder.cpp:1235] ["Storage instance does not exist, applyDropTable is ignored, table_id=462125"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.073 +07:00] [INFO] [TiDBSchemaSyncer.cpp:106] ["End sync schema, version has been updated to 759516"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.073 +07:00] [INFO] [SchemaSyncService.cpp:218] ["Schema GC begin, last_safepoint=447867365903564800 safepoint=447867787011424256"] [source="keyspace=4294967295"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_102758 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_109756 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_242598 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_244574 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_244959 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_266566 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
[2024/02/21 09:06:17.074 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_266569 database_tombstone=447867472962912264 table_tombstone=0 safepoint=447867787011424256"] [thread_id=547]
...
[2024/02/21 09:06:17.519 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=4298 epoch=2 range=[150120936,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=4298 epoch=2"] [thread_id=547]
[2024/02/21 09:06:17.523 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=3743 epoch=2 range=[149416825,149771152) next_segment_id=4358 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_443 stable_rows=354327 stable_bytes=447953897 dmf_rows=354327 dmf_bytes=447953897 dmf_packs=44>"] [source="keyspace=4294967295 table_id=102758 segment_id=3743 epoch=2"] [thread_id=547]
[2024/02/21 09:06:17.523 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=4358 epoch=3 range=[149771152,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=4358 epoch=3"] [thread_id=547]
[2024/02/21 09:06:17.527 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=4355 epoch=1 range=[149066160,149416825) next_segment_id=3743 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_8594 stable_rows=350665 stable_bytes=446287464 dmf_rows=350665 dmf_bytes=446287464 dmf_packs=44>"] [source="keyspace=4294967295 table_id=102758 segment_id=4355 epoch=1"] [thread_id=547]
[2024/02/21 09:06:17.527 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=3743 epoch=3 range=[149416825,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=3743 epoch=3"] [thread_id=547]
[2024/02/21 09:06:17.530 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=4349 epoch=2 range=[148724627,149066160) next_segment_id=4355 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_8593 stable_rows=341533 stable_bytes=442448793 dmf_rows=341533 dmf_bytes=442448793 dmf_packs=43>"] [source="keyspace=4294967295 table_id=102758 segment_id=4349 epoch=2"] [thread_id=547]
...

and then

[2024/02/21 09:06:29.384 +07:00] [ERROR] [TiDBSchemaSyncer.cpp:252] ["Can not find related database_id and logical_table_id from table_id_map, sync table schema fail. physical_table_id=315048"] [source="keyspace=4294967295"] [thread_id=538]
[2024/02/21 09:06:29.401 +07:00] [ERROR] [TiDBSchemaSyncer.cpp:252] ["Can not find related database_id and logical_table_id from table_id_map, sync table schema fail. physical_table_id=315054"] [source="keyspace=4294967295"] [thread_id=536]
[2024/02/21 09:06:30.412 +07:00] [ERROR] [TiDBSchemaSyncer.cpp:252] ["Can not find related database_id and logical_table_id from table_id_map, sync table schema fail. physical_table_id=266566"] [source="keyspace=4294967295"] [thread_id=540]
...

As image below, data on 2 TiFlash (10.0.0.4-5) node is gone but region size still not decrease. All TiFlash query at this time is return an not accurate result but TiKV query is normal.
image
If set replica to 0 and re-set them to 2 again, to TiFlash node become Disconnect and then Down with some not-sync error.
Retry scale-in scale-out at 3rd attemp still have same issue.

2. What did you expect to see? (Required)

TiFlash not being lost data and return correct query result

3. What did you see instead (Required)

TiFlash data un-sync with TiFlash lead to wrong TiFlash query

4. What is your TiFlash version? (Required)

v7.5.0

@uzuki27 uzuki27 added the type/bug Issue for bug label Feb 21, 2024
@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Feb 21, 2024

Hi @uzuki27 , thanks for reporting the problem

  1. Is "1 of TiKV node evicted all leader region and then rebalance again" around 08:26 an expected behavior in your cluster?
  2. Do you see any TiFlash logging like "Database db_102178 dropped during ..."? Can you provide the logging between "Start to sync schemas" and "End sync schema" when that database was dropped in TiFlash?

@uzuki27
Copy link
Author

uzuki27 commented Feb 22, 2024

Hi @JaySon-Huang, thank for reply,
I'm just test one more last night and it's just occurred a few minutes ago.

  1. Of cource it's an unexpected behaviour, because everytime re-elect occurs my cluster write performance become slow. And i'm not sure is TiKV re-elect cause of TiFlash drop data.
  2. I think the data drop in TIFlash begin with this
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:218] ["Schema GC begin, last_safepoint=447891268901011456 safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]

and then

[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_102758 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_109756 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_242598 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_244574 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_244959 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.108 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_102178 table_name=t_266566 database_tombstone=447891396407853074 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
...
[2024/02/22 10:31:20.121 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_74225 table_name=t_98395 database_tombstone=447891396420960286 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.121 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_74225 table_name=t_98397 database_tombstone=447891396420960286 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.121 +07:00] [INFO] [SchemaSyncService.cpp:256] ["Detect stale table, database_name=db_74225 table_name=t_98399 database_tombstone=447891396420960286 table_tombstone=0 safepoint=447891772217491456"] [thread_id=593]
[2024/02/22 10:31:20.121 +07:00] [INFO] [SchemaSyncService.cpp:295] ["Physically dropping table, table_tombstone=0 safepoint=447891772217491456 <some_db_name>.<some_tbl_name>, database_id=102178 table_id=102758"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:20.121 +07:00] [INFO] [DeltaMergeStore.cpp:403] ["Clear DeltaMerge segments data"] [source="keyspace=4294967295 table_id=102758"] [thread_id=593]
[2024/02/22 10:31:20.122 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=1664 epoch=2 range=[176060708,176276910) next_segment_id=986 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_3140 stable_rows=216202 stable_bytes=290189788 dmf_rows=216202 dmf_bytes=290189788 dmf_packs=28>"] [source="keyspace=4294967295 table_id=102758 segment_id=1664 epoch=2"] [thread_id=593]
[2024/02/22 10:31:20.122 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=986 epoch=3 range=[176276910,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=986 epoch=3"] [thread_id=593]
[2024/02/22 10:31:20.130 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=1538 epoch=2 range=[175706332,176060708) next_segment_id=1664 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_3139 stable_rows=315104 stable_bytes=431113557 dmf_rows=315104 dmf_bytes=431113557 dmf_packs=40>"] [source="keyspace=4294967295 table_id=102758 segment_id=1538 epoch=2"] [thread_id=593]
[2024/02/22 10:31:20.130 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=1664 epoch=3 range=[176060708,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=1664 epoch=3"] [thread_id=593]
[2024/02/22 10:31:20.135 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=1793 epoch=2 range=[175386040,175706332) next_segment_id=1538 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_3138 stable_rows=320292 stable_bytes=433465766 dmf_rows=320292 dmf_bytes=433465766 dmf_packs=41>"] [source="keyspace=4294967295 table_id=102758 segment_id=1793 epoch=2"] [thread_id=593]
...
[2024/02/22 10:31:22.418 +07:00] [INFO] [Segment.cpp:2193] ["Finish segment drop its next segment, segment=<segment_id=3041 epoch=1 range=[2906537,3257240) next_segment_id=2537 delta_rows=0 delta_bytes=0 delta_deletes=0 stable_file=dmf_7048 stable_rows=350703 stable_bytes=445345327 dmf_rows=350703 dmf_bytes=445345327 dmf_packs=45>"] [source="keyspace=4294967295 table_id=102758 segment_id=3041 epoch=1"] [thread_id=593]
[2024/02/22 10:31:22.418 +07:00] [DEBUG] [Segment.h:564] ["Abandon segment, segment=<segment_id=2537 epoch=2 range=[3257240,9223372036854775807)>"] [source="keyspace=4294967295 table_id=102758 segment_id=2537 epoch=2"] [thread_id=593]
[2024/02/22 10:31:22.420 +07:00] [INFO] [ApplySnapshot.cpp:297] ["Begin apply snapshot, new_region=[region_id=976810142 applied_term=268 applied_index=948]"] [thread_id=566]
[2024/02/22 10:31:22.420 +07:00] [INFO] [DeltaMergeStore_Ingest.cpp:711] ["Table ingest files - begin, use_split_replace=false files=[] rows=0 bytes=0 bytes_on_disk=0 range=[7689365420,7689598497) clear=true"] [source="keyspace=4294967295 table_id=43607"] [thread_id=566]
[2024/02/22 10:31:22.420 +07:00] [INFO] [DeltaMergeStore_Ingest.cpp:801] ["Table ingest files - finished ingested files into segments, ingested_files=[] updated_segments=[<segment_id=1 epoch=65 range=[-9223372036854775808,10372977084)>] clear=true"] [source="keyspace=4294967295 table_id=43607"] [thread_id=566]
...
[2024/02/22 10:31:55.553 +07:00] [INFO] [PathPool.cpp:284] ["Begin to drop [dir=/data/tidb/tidb_real/tidb-data/tiflash-9000/data/t_442064] from main_path_infos"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [DeltaMergeStore.cpp:423] ["Drop DeltaMerge done"] [source="keyspace=4294967295 table_id=442064"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [RegionTable.cpp:137] ["remove table from RegionTable success, keyspace=4294967295 table_id=442064"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [SchemaSyncService.cpp:306] ["Physically dropped table sharing_pro_da.pmt_users_history, database_id=70637 table_id=442064"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [DeltaMergeStore.cpp:312] ["Release DeltaMerge Store start"] [source="keyspace=4294967295 table_id=442064"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [DeltaMergeStore.cpp:316] ["Release DeltaMerge Store end"] [source="keyspace=4294967295 table_id=442064"] [thread_id=593]
[2024/02/22 10:31:55.553 +07:00] [INFO] [SchemaSyncService.cpp:295] ["Physically dropping table, table_tombstone=0 safepoint=447891772217491456 sharing_pro_da.vga_providers_legacy, database_id=70637 table_id=448049"] [source="keyspace=4294967295"] [thread_id=593]

and database drop

[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396407853074 db_102178"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_102178, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396407853075 db_103491"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_103491, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396407853076 db_152926"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_152926, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396407853077 db_170962"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_170962, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593] 
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396407853078 db_192148"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.368 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_192148, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]
...
[2024/02/22 10:31:56.371 +07:00] [INFO] [SchemaSyncService.cpp:350] ["Physically dropping database, database_tombstone=447891396420960286 db_74225"] [source="keyspace=4294967295"] [thread_id=593]
[2024/02/22 10:31:56.371 +07:00] [INFO] [SchemaSyncService.cpp:360] ["Physically dropped database db_74225, safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]

end with

[2024/02/22 10:31:56.371 +07:00] [INFO] [SchemaSyncService.cpp:383] ["Schema GC done, tables_removed=1859 databases_removed=49 safepoint=447891772217491456"] [source="keyspace=4294967295"] [thread_id=593]

After that is ERROR what I post above, my TiFlash data is gone althought its' region still there.

[2024/02/22 10:31:56.512 +07:00] [WARN] [TiDBSchemaSyncer.cpp:239] ["Can not find related database_id and logical_table_id from table_id_map, try to syncSchemas. physical_table_id=315054"] [source="keyspace=4294967295"] [thread_id=583] 
[2024/02/22 10:31:56.512 +07:00] [ERROR] [TiDBSchemaSyncer.cpp:252] ["Can not find related database_id and logical_table_id from table_id_map, sync table schema fail. physical_table_id=315054"] [source="keyspace=4294967295"] [thread_id=583]
[2024/02/22 10:31:56.512 +07:00] [INFO] [PartitionStreams.cpp:211] ["sync schema cost 0 ms, keyspace=4294967295 table_id=315054"] [thread_id=583]
[2024/02/22 10:31:56.527 +07:00] [INFO] [TiDBSchemaSyncer.cpp:226] ["Sync table schema begin, table_id=315048"] [source="keyspace=4294967295"] [thread_id=579]
[2024/02/22 10:31:56.527 +07:00] [WARN] [TiDBSchemaSyncer.cpp:239] ["Can not find related database_id and logical_table_id from table_id_map, try to syncSchemas. physical_table_id=315048"] [source="keyspace=4294967295"] [thread_id=579] 
[2024/02/22 10:31:56.528 +07:00] [ERROR] [TiDBSchemaSyncer.cpp:252] ["Can not find related database_id and logical_table_id from table_id_map, sync table schema fail. physical_table_id=315048"] [source="keyspace=4294967295"] [thread_id=579]
[2024/02/22 10:31:56.528 +07:00] [INFO] [PartitionStreams.cpp:211] ["sync schema cost 0 ms, keyspace=4294967295 table_id=315048"] [thread_id=579]
[2024/02/22 10:31:56.591 +07:00] [INFO] [TiDBSchemaSyncer.cpp:226] ["Sync table schema begin, table_id=418701"] [source="keyspace=4294967295"] [thread_id=585]

@JaySon-Huang
Copy link
Contributor

@uzuki27 The logging you post here is physically dropping data for those given databases. Before that, those databases should be marked as "tombstone" around 2024/02/22 09:48 +07:00 (from the logging database_tombstone=447891396407853076).
I'm wondering why those databases became "tombstone" at that time. And there should be some logging like "Database xxx dropped during ...". Can you post the logging around that time so I can check what happened.

@uzuki27
Copy link
Author

uzuki27 commented Feb 22, 2024

@JaySon-Huang, i attached log around that time, and as you said, it marked some db as tombstone and have "Database xxx dropped during ..." log. Above its some row, we have

[2024/02/22 09:48:05.202 +07:00] [INFO] [SchemaBuilder.cpp:1316] ["Table <table_name> syncing during sync all schemas, database_id=425722 table_id=459949"] [source="keyspace=4294967295"] [thread_id=419]
[2024/02/22 09:48:05.202 +07:00] [DEBUG] [SchemaBuilder.cpp:1331] ["register table to table_id_map, database_id=425722 table_id=459949"] [source="keyspace=4294967295"] [thread_id=419]
[2024/02/22 09:48:05.202 +07:00] [INFO] [SchemaBuilder.cpp:1099] ["Create table <table_name> begin, database_id=425722, table_id=459949"] [source="keyspace=4294967295"] [thread_id=419]
[2024/02/22 09:48:05.202 +07:00] [DEBUG] [SchemaBuilder.cpp:1114] ["Trying to create table <table_name>, but it already exists and is not marked as tombstone, database_id=425722 table_id=459949"] [source="keyspace=4294967295"] [thread_id=419]
...

but the tables in this list include the table i haven't set it to have TiFlash replica.
tiflash.log.gz

@uzuki27
Copy link
Author

uzuki27 commented Feb 23, 2024

@JaySon-Huang I just realized that TiFlash re-sync schema begin with

[2024/02/22 09:48:04.900 +07:00] [INFO] [TiDBSchemaSyncer.cpp:72] ["Start to sync schemas. current version is: 762861 and try to sync schema version to: 762867"] [source="keyspace=4294967295"] [thread_id=588]
[2024/02/22 09:48:04.900 +07:00] [WARN] [SchemaGetter.cpp:251] ["The schema diff is empty, schema_version=762862 key=Diff:762862"] [thread_id=588]
[2024/02/22 09:48:04.900 +07:00] [INFO] [TiDBSchemaSyncer.cpp:132] ["Meets a schema diff with regenerate_schema_map flag"] [source="keyspace=4294967295"] [thread_id=588]
[2024/02/22 09:48:04.901 +07:00] [INFO] [SchemaBuilder.cpp:1272] ["Sync all schemas begin"] [source="keyspace=4294967295"] [thread_id=588]

and it around the time that the disconnected (evicted) TiKV node begin be re-elected leader regions again

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Feb 23, 2024

@uzuki27 Is there any special task like FLASHBACK CLUSTER or br restore that is executed around the time tiflash run into "Meets a schema diff with regenerate_schema_map flag"?

@uzuki27
Copy link
Author

uzuki27 commented Feb 23, 2024

@JaySon-Huang I don't think that it have any triggered task because each time the TiFlash table drop occurred, it was an difference time in day. I have only daily br backup db job at 2:30 AM (+07:00). I will continue to investigating log around that time and will send them to you if I find anything suspicious.
But could you explain me what "Meets a schema diff with regenerate_schema_map flag" means? And when is it happen in normal state?

@CalvinNeo
Copy link
Member

CalvinNeo commented Feb 23, 2024

Hi, @uzuki27. The "Meets a schema diff with regenerate_schema_map flag" error means the regenerate_schema_map field in a message that has been sent to TiFlash is set to true. The message usually contains information about schema diffs.

In most cases, regenerate_schema_map is always false. Only in several cases like br restore and flashback cluster, will the regenerate_schema_map be set to true, indicating we have to perform some "rebuilding schema map" work. That why we want to know if such operation had taken place.

@uzuki27
Copy link
Author

uzuki27 commented Feb 23, 2024

@CalvinNeo I see then it's really strange because i don't have any intend to run any task like that.

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Feb 23, 2024

@uzuki27 I still don't figure out why tiflash meets regenerate_schema_map == true in your env. But I find the reason why your database is dropped unexpectedly.

Normally, after users executed DDL statements, the TiDB server will generate some "SchemaDiff" to indicate what happens. And TiFlash will follow the "SchemaDiff" to figure out what DDL operations need to be executed.
But when there are special DDL flashback cluster (which generate regenerate_schema_map == true), or the network to TiKV region leader is not stable (like region leader dropped as you reported), TiFlash will try to list all existing databases and tables in the cluster and enter syncAllSchema to check all existing databases and tables one by one and execute proper DDL on them.

However, in the DDL framwork refactor since v7.2(#7437), there is a mistake in the implementation. In the following codes, only newly created database that does not exist in TiFlash local are put in created_db_set. In line 1383, if the database is not existed in created_db_set, then the database will be marked as tombstone unexpectedly. And after the gc_safepoint advance, the data on TiFlash replica get physically removed unexpectedly. :(

template <typename Getter, typename NameMapper>
void SchemaBuilder<Getter, NameMapper>::syncAllSchema()
{
LOG_INFO(log, "Sync all schemas begin");
/// Create all databases.
std::vector<DBInfoPtr> all_schemas = getter.listDBs();
std::unordered_set<String> created_db_set;
//We can't use too large default_num_threads, otherwise, the lock grabbing time will be too much.
size_t default_num_threads = std::max(4UL, std::thread::hardware_concurrency());
auto sync_all_schema_thread_pool
= ThreadPool(default_num_threads, default_num_threads / 2, default_num_threads * 2);
auto sync_all_schema_wait_group = sync_all_schema_thread_pool.waitGroup();
std::mutex created_db_set_mutex;
for (const auto & db : all_schemas)
{
auto task = [this, db, &created_db_set, &created_db_set_mutex] {
{
std::shared_lock<std::shared_mutex> shared_lock(shared_mutex_for_databases);
if (databases.find(db->id) == databases.end())
{
shared_lock.unlock();
applyCreateSchema(db);
{
std::unique_lock<std::mutex> created_db_set_lock(created_db_set_mutex);
created_db_set.emplace(name_mapper.mapDatabaseName(*db));
}
LOG_INFO(
log,
"Database {} created during sync all schemas, database_id={}",
name_mapper.debugDatabaseName(*db),
db->id);
}
}
std::vector<TableInfoPtr> tables = getter.listTables(db->id);
for (auto & table : tables)
{
LOG_INFO(
log,
"Table {} syncing during sync all schemas, database_id={} table_id={}",
name_mapper.debugCanonicalName(*db, *table),
db->id,
table->id);
/// Ignore view and sequence.
if (table->is_view || table->is_sequence)
{
LOG_INFO(
log,
"Table {} is a view or sequence, ignoring. database_id={} table_id={}",
name_mapper.debugCanonicalName(*db, *table),
db->id,
table->id);
continue;
}
table_id_map.emplaceTableID(table->id, db->id);
LOG_DEBUG(log, "register table to table_id_map, database_id={} table_id={}", db->id, table->id);
applyCreateStorageInstance(db, table);
if (table->isLogicalPartitionTable())
{
for (const auto & part_def : table->partition.definitions)
{
LOG_DEBUG(
log,
"register table to table_id_map for partition table, logical_table_id={} "
"physical_table_id={}",
table->id,
part_def.id);
table_id_map.emplacePartitionTableID(part_def.id, table->id);
}
}
}
};
sync_all_schema_wait_group->schedule(task);
}
sync_all_schema_wait_group->wait();
// TODO:can be removed if we don't save the .sql
/// Drop all unmapped tables.
auto storage_map = context.getTMTContext().getStorages().getAllStorage();
for (auto it = storage_map.begin(); it != storage_map.end(); it++)
{
const auto & table_info = it->second->getTableInfo();
if (table_info.keyspace_id != keyspace_id)
continue;
if (!table_id_map.tableIDInTwoMaps(table_info.id))
{
applyDropPhysicalTable(it->second->getDatabaseName(), table_info.id);
LOG_INFO(
log,
"Table {}.{} dropped during sync all schemas, table_id={}",
it->second->getDatabaseName(),
name_mapper.debugTableName(table_info),
table_info.id);
}
}
/// Drop all unmapped dbs.
const auto & dbs = context.getDatabases();
for (auto it = dbs.begin(); it != dbs.end(); it++)
{
auto db_keyspace_id = SchemaNameMapper::getMappedNameKeyspaceID(it->first);
if (db_keyspace_id != keyspace_id)
{
continue;
}
if (created_db_set.count(it->first) == 0 && !isReservedDatabase(context, it->first))
{
applyDropSchema(it->first);
LOG_INFO(log, "Database {} dropped during sync all schemas", it->first);
}
}
LOG_INFO(log, "Sync all schemas end");
}

@JaySon-Huang
Copy link
Contributor

@uzuki27 We will fix this issue ASAP and sorry for any negative impact it has caused you.

@uzuki27
Copy link
Author

uzuki27 commented Feb 23, 2024

Thank you for your enthusiastic support, I'm really looking forward to it and gladly to test it.

@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Feb 23, 2024

@uzuki27 Hi, I still want to figure out why tiflash meets regenerate_schema_map == true in your env. Is point-in-time recovery (PITR) run on the cluster? Specifically, br restore point

Because I found that unstable network between TiFlash and TiKV does not make TiFlash run into syncAllSchema.

@uzuki27
Copy link
Author

uzuki27 commented Feb 23, 2024

@JaySon-Huang I have just scanned on all node of working cluster and haven't found any job, task that excute br restore. I have only br backup job and have never tried to restore it with br restore yet. I will investigate more activity around the time that regenerate_schema_map flag logging occurred.

@JaySon-Huang
Copy link
Contributor

@uzuki27 get it. Thanks

@uzuki27
Copy link
Author

uzuki27 commented Feb 26, 2024

@JaySon-Huang After reproducing the issue twice more, I can tell the regenerate_schema_map flag logging is happen around the time TiKV re-balancing occured. I'm tracing the reason my TiKV being rebalanced when some large batch process of Data team write into its. They said, these job was ran before my upgrade and I don't see any rebalance on TiKV on those time.

ti-chi-bot bot pushed a commit that referenced this issue Feb 26, 2024
@JaySon-Huang
Copy link
Contributor

@uzuki27 It would be helpful if you could provide the DDL actions on TiDB when TiFlash receive regenerate_schema_map flag.

For example, the following TiFlash logging shows that regenerate_schema_map flag happens between schema version 53 to 55.

[2024/02/26 15:26:50.820 +08:00] [INFO] [TiDBSchemaSyncer.cpp:72] ["Start to sync schemas. current version is: 53 and try to sync schema version to: 55"] [source="keyspace=4294967295"] [thread_id=289]
[2024/02/26 15:26:50.820 +08:00] [INFO] [SchemaBuilder.cpp:460] ["Ignore change type: ActionFlashbackCluster, diff_version=54"] [source="keyspace=4294967295"] [thread_id=289]
[2024/02/26 15:26:50.821 +08:00] [INFO] [TiDBSchemaSyncer.cpp:132] ["Meets a schema diff with regenerate_schema_map flag"] [source="keyspace=4294967295"] [thread_id=289]
[2024/02/26 15:26:50.821 +08:00] [INFO] [SchemaBuilder.cpp:1272] ["Sync all schemas begin"] [source="keyspace=4294967295"] [thread_id=289]
...
[2024/02/26 15:26:50.834 +08:00] [INFO] [SchemaBuilder.cpp:1386] ["Database db_2 dropped during sync all schemas"] [source="keyspace=4294967295"] [thread_id=289]
[2024/02/26 15:26:50.834 +08:00] [INFO] [SchemaBuilder.cpp:1390] ["Sync all schemas end"] [source="keyspace=4294967295"] [thread_id=289]
[2024/02/26 15:26:50.834 +08:00] [INFO] [TiDBSchemaSyncer.cpp:106] ["End sync schema, version has been updated to 55"] [source="keyspace=4294967295"] [thread_id=289]

Then you can filter what DDL is executed in the tidb logging by the following keywords:
grep -e 'CRUCIAL OPERATION' -e 'ddl_worker' -e 'SchemaVersion' tidb.log

For example, following logging show that what DDL was executed between currentSchemaVersion 53 to 54 in my reproduced env

[2024/02/26 15:26:22.399 +08:00] [INFO] [ddl_worker.go:252] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:106, Type:flashback cluster, State:queueing, SchemaState:none, SchemaID:0, TableID:0, RowCount:0, ArgLen:10, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2024/02/26 15:26:22.406 +08:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 12, tp general"] [category=ddl] [jobID=106] [conn=3133145092] [category=ddl] [job="ID:106, Type:flashback cluster, State:queueing, SchemaState:none, SchemaID:0, TableID:0, RowCount:0, ArgLen:0, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCount:0, Snapshot
Version:0"]
[2024/02/26 15:26:22.408 +08:00] [INFO] [ddl_worker.go:1196] ["schema version doesn't change"] [category=ddl]
[2024/02/26 15:26:22.412 +08:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 12, tp general"] [category=ddl] [jobID=106] [conn=3133145092] [category=ddl] [job="ID:106, Type:flashback cluster, State:running, SchemaState:delete only, SchemaID:0, TableID:0, RowCount:0, ArgLen:0, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCount:0, Sn
apshotVersion:0"]
[2024/02/26 15:26:22.456 +08:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=53] [neededSchemaVersion=54] ["start time"=392.837µs] [gotSchemaVersion=54] [phyTblIDs="[-1]"] [actionTypes="[62]"]
[2024/02/26 15:26:22.505 +08:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=54] ["take time"=51.782226ms] [job="ID:106, Type:flashback cluster, State:running, SchemaState:write only, SchemaID:0, TableID:0, RowCount:0, ArgLen:10, start time: 2024-02-26 15:26:22
.382 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/26 15:26:22.512 +08:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 12, tp general"] [category=ddl] [jobID=106] [conn=3133145092] [category=ddl] [job="ID:106, Type:flashback cluster, State:running, SchemaState:write only, SchemaID:0, TableID:0, RowCount:0, ArgLen:0, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCount:0, Sna
pshotVersion:0"]
[2024/02/26 15:26:22.726 +08:00] [INFO] [ddl_worker.go:1196] ["schema version doesn't change"] [category=ddl]
[2024/02/26 15:26:22.731 +08:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 12, tp general"] [category=ddl] [jobID=106] [conn=3133145092] [category=ddl] [job="ID:106, Type:flashback cluster, State:running, SchemaState:write reorganization, SchemaID:0, TableID:0, RowCount:0, ArgLen:0, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCo
unt:0, SnapshotVersion:0"]
[2024/02/26 15:26:22.862 +08:00] [INFO] [domain.go:295] ["full load InfoSchema success"] [currentSchemaVersion=54] [neededSchemaVersion=55] ["start time"=12.553217ms]
[2024/02/26 15:26:22.900 +08:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=55] ["take time"=52.30033ms] [job="ID:106, Type:flashback cluster, State:done, SchemaState:public, SchemaID:0, TableID:0, RowCount:0, ArgLen:10, start time: 2024-02-26 15:26:22.382 +08
00 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/26 15:26:22.938 +08:00] [INFO] [ddl_worker.go:609] ["finish DDL job"] [worker="worker 12, tp general"] [category=ddl] [jobID=106] [conn=3133145092] [job="ID:106, Type:flashback cluster, State:synced, SchemaState:public, SchemaID:0, TableID:0, RowCount:0, ArgLen:9, start time: 2024-02-26 15:26:22.382 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

@uzuki27
Copy link
Author

uzuki27 commented Feb 26, 2024

@JaySon-Huang then, back to this TiFlash log

[2024/02/22 09:48:04.900 +07:00] [INFO] [TiDBSchemaSyncer.cpp:72] ["Start to sync schemas. current version is: 762861 and try to sync schema version to: 762867"] [source="keyspace=4294967295"] [thread_id=588]
[2024/02/22 09:48:04.900 +07:00] [WARN] [SchemaGetter.cpp:251] ["The schema diff is empty, schema_version=762862 key=Diff:762862"] [thread_id=588]
[2024/02/22 09:48:04.900 +07:00] [INFO] [TiDBSchemaSyncer.cpp:132] ["Meets a schema diff with regenerate_schema_map flag"] [source="keyspace=4294967295"] [thread_id=588]
[2024/02/22 09:48:04.901 +07:00] [INFO] [SchemaBuilder.cpp:1272] ["Sync all schemas begin"] [source="keyspace=4294967295"] [thread_id=588]

version between 762861 and 762867
and this is log with those grep on my 2 tidb nodes between those version

node1:

[2024/02/22 09:46:43.386 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762860] ["start time"=771.538µs] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:05.861 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762861] ["start time"=1.465048ms] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:30.609 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762863] ["start time"=2.370861ms] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:40.591 +07:00] [INFO] [session.go:3916] ["CRUCIAL OPERATION"] [conn=1389764008] [schemaVersion=762859] [cur_db=um] [sql="truncate <db_name>.<table_name>"] [user=<user_name>]
[2024/02/22 09:47:42.397 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762865] ["start time"=5.439588ms] [gotSchemaVersion=762865] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:42.977 +07:00] [INFO] [ddl_worker.go:252] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:465134, Type:truncate table, State:queueing, SchemaState:none, SchemaID:398143, TableID:464929, RowCount:0, ArgLen:3, start time: 2024-02-22 09:47:40.924 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2024/02/22 09:47:44.079 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762865] [neededSchemaVersion=762866] ["start time"=2.128014ms] [gotSchemaVersion=762866] [phyTblIDs="[464929,465133]"] [actionTypes="[11,11]"]
[2024/02/22 09:47:56.606 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762866] [neededSchemaVersion=762867] ["start time"=2.728088ms] [gotSchemaVersion=762867] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:49:07.770 +07:00] [INFO] [session.go:3916] ["CRUCIAL OPERATION"] [conn=1389781290] [schemaVersion=762867] [cur_db=um] [sql="truncate <db_name>.<table_name>"] [user=<user_name>]
[2024/02/22 09:49:07.957 +07:00] [INFO] [ddl_worker.go:252] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:465137, Type:truncate table, State:queueing, SchemaState:none, SchemaID:398143, TableID:465122, RowCount:0, ArgLen:3, start time: 2024-02-22 09:49:07.824 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2024/02/22 09:49:08.319 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762867] [neededSchemaVersion=762868] ["start time"=2.220342ms] [gotSchemaVersion=762868] [phyTblIDs="[465122,465136]"] [actionTypes="[11,11]"]

and node2

[2024/02/22 09:46:31.991 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762860] ["start time"=529.595µs] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]                                                                                                                                                                                            
[2024/02/22 09:46:51.778 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762860] ["start time"=614.656µs] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:01.126 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465132] [category=ddl] [job="ID:465132, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:0, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:10.986 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762862] ["start time"=3.562782ms] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:18.399 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465132] [category=ddl] [job="ID:465132, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:0, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:32.107 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762863] ["start time"=2.058379ms] [gotSchemaVersion=762859] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:34.772 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465132] [category=ddl] [job="ID:465132, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:0, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:42.380 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762859] [neededSchemaVersion=762865] ["start time"=4.875809ms] [gotSchemaVersion=762865] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:47:42.398 +07:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=762865] ["take time"=1.454231124s] [job="ID:465132, Type:update tiflash replica status, State:done, SchemaState:public, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:2, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:42.409 +07:00] [INFO] [ddl_worker.go:609] ["finish DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465132] [job="ID:465132, Type:update tiflash replica status, State:synced, SchemaState:public, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:0, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:42.981 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465134] [conn=1389764008] [category=ddl] [job="ID:465134, Type:truncate table, State:queueing, SchemaState:none, SchemaID:398143, TableID:464929, RowCount:0, ArgLen:0, start time: 2024-02-22 09:47:40.924 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:44.935 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762865] [neededSchemaVersion=762866] ["start time"=2.490681ms] [gotSchemaVersion=762866] [phyTblIDs="[464929,465133]"] [actionTypes="[11,11]"]
[2024/02/22 09:47:45.449 +07:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=762866] ["take time"=1.68671685s] [job="ID:465134, Type:truncate table, State:done, SchemaState:public, SchemaID:398143, TableID:464929, RowCount:0, ArgLen:2, start time: 2024-02-22 09:47:40.924 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:45.606 +07:00] [INFO] [ddl_worker.go:609] ["finish DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465134] [conn=1389764008] [job="ID:465134, Type:truncate table, State:synced, SchemaState:public, SchemaID:398143, TableID:464929, RowCount:0, ArgLen:2, start time: 2024-02-22 09:47:40.924 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:55.900 +07:00] [INFO] [ddl_worker.go:252] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:465135, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:70637, TableID:465130, RowCount:0, ArgLen:2, start time: 2024-02-22 09:47:55.724 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]
[2024/02/22 09:47:56.092 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465135] [category=ddl] [job="ID:465135, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:70637, TableID:465130, RowCount:0, ArgLen:0, start time: 2024-02-22 09:47:55.724 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:56.587 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762866] [neededSchemaVersion=762867] ["start time"=3.23178ms] [gotSchemaVersion=762867] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/22 09:49:07.985 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465137] [conn=1389781290] [category=ddl] [job="ID:465137, Type:truncate table, State:queueing, SchemaState:none, SchemaID:398143, TableID:465122, RowCount:0, ArgLen:0, start time: 2024-02-22 09:49:07.824 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:49:08.298 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=762867] [neededSchemaVersion=762868] ["start time"=1.886606ms] [gotSchemaVersion=762868] [phyTblIDs="[465122,465136]"] [actionTypes="[11,11]"]
[2024/02/22 09:49:08.339 +07:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=762868] ["take time"=52.764014ms] [job="ID:465137, Type:truncate table, State:done, SchemaState:public, SchemaID:398143, TableID:465122, RowCount:0, ArgLen:2, start time: 2024-02-22 09:49:07.824 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]

@uzuki27
Copy link
Author

uzuki27 commented Feb 26, 2024

another attempt:
tiflash flag:

[2024/02/23 10:04:16.768 +07:00] [INFO] [TiDBSchemaSyncer.cpp:72] ["Start to sync schemas. current version is: 765781 and try to sync schema version to: 765785"] [source="keyspace=4294967295"] [thread_id=549]
[2024/02/23 10:04:16.768 +07:00] [WARN] [SchemaGetter.cpp:251] ["The schema diff is empty, schema_version=765782 key=Diff:765782"] [thread_id=549]
[2024/02/23 10:04:16.768 +07:00] [INFO] [TiDBSchemaSyncer.cpp:132] ["Meets a schema diff with regenerate_schema_map flag"] [source="keyspace=4294967295"] [thread_id=549]
[2024/02/23 10:04:16.768 +07:00] [INFO] [SchemaBuilder.cpp:1272] ["Sync all schemas begin"] [source="keyspace=4294967295"] [thread_id=549]

tidb node1:

[2024/02/23 10:03:45.046 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765783] ["start time"=8.98702ms] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:03:58.359 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765783] ["start time"=896.62µs] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:04:20.859 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765785] ["start time"=2.018378ms] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:04:26.587 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765786] ["start time"=2.459582ms] [gotSchemaVersion=765786] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:04:28.841 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765786] [neededSchemaVersion=765787] ["start time"=512.509µs] [gotSchemaVersion=765787] [phyTblIDs="[]"] [actionTypes="[]"]

tidb node2:

[2024/02/23 10:02:47.822 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765780] [neededSchemaVersion=765781] ["start time"=640.954µs] [gotSchemaVersion=765781] [phyTblIDs="[467637]"] [actionTypes="[4]"]
[2024/02/23 10:02:47.868 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467643] [conn=2993411446] [category=ddl] [job="ID:467643, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:02:47.879 +07:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=765781] ["take time"=7.582233898s] [job="ID:467644, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467637, RowCount:0, ArgLen:2, start time: 2024-02-23 10:02:29.274 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:03:08.033 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467643] [conn=2993411446] [category=ddl] [job="ID:467643, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]                                   
[2024/02/23 10:03:18.969 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765782] ["start time"=552.312µs] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:03:27.030 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467643] [conn=2993411446] [category=ddl] [job="ID:467643, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:03:45.060 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765783] ["start time"=914.435µs] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:03:56.659 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467643] [conn=2993411446] [category=ddl] [job="ID:467643, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:04:02.106 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765783] ["start time"=954.762µs] [gotSchemaVersion=765781] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:04:12.816 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467643] [conn=2993411446] [category=ddl] [job="ID:467643, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:04:24.110 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=467644] [conn=1389048268] [category=ddl] [job="ID:467644, Type:drop table, State:running, SchemaState:write only, SchemaID:70637, TableID:467637, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:29.274 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:04:25.665 +07:00] [INFO] [session.go:3916] ["CRUCIAL OPERATION"] [conn=2993410792] [schemaVersion=765781] [cur_db=<db_name>] [sql="drop table <table_name>"] [user=<user_name>][2024/02/23 10:04:26.600 +07:00] [INFO] [domain.go:262] ["diff load InfoSchema success"] [currentSchemaVersion=765781] [neededSchemaVersion=765786] ["start time"=2.406777ms] [gotSchemaVersion=765786] [phyTblIDs="[]"] [actionTypes="[]"]
[2024/02/23 10:04:26.624 +07:00] [INFO] [ddl_worker.go:1219] ["wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)"] [category=ddl] [ver=765786] ["take time"=2.563222442s] [job="ID:467643, Type:drop table, State:running, SchemaState:delete only, SchemaID:70637, TableID:467638, RowCount:0, ArgLen:0, start time: 2024-02-23 10:02:23.974 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/23 10:04:27.288 +07:00] [INFO] [ddl_worker.go:252] ["add DDL jobs"] [category=ddl] ["batch count"=1] [jobs="ID:467645, Type:drop table, State:queueing, SchemaState:public, SchemaID:403534, TableID:467633, RowCount:0, ArgLen:2, start time: 2024-02-23 10:04:25.624 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0; "] [table=true]

@JaySon-Huang
Copy link
Contributor

Is there exist some tidb logging like "resolveLock rollback" or "handle ddl job failed"?

@uzuki27
Copy link
Author

uzuki27 commented Feb 26, 2024

@JaySon-Huang, yes around these log we have a lot of

[2024/02/22 09:47:34.769 +07:00] [INFO] [job_table.go:418] ["handle ddl job failed"] [category=ddl] [error="[kv:8022]Error: KV error safe to retry Error(Txn(Error(Mvcc(Error(TxnLockNotFound { start_ts: TimeStamp(447891384139513860), commit_ts: TimeStamp(447891388438675460), key: [109, 68, 66, 58, 51, 57, 56, 49, 52, 255, 51, 0, 0, 0, 0, 0, 0, 0, 248, 0, 0, 0, 0, 0, 0, 0, 104, 84, 97, 98, 108, 101, 58, 52, 54, 255, 53, 49, 50, 56, 0, 0, 0, 0, 251] }))))) {metaKey=true, key=DB:398143, field=Table:465128} [try again later]"] [job="ID:465132, Type:update tiflash replica status, State:done, SchemaState:public, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:2, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:34.772 +07:00] [INFO] [ddl_worker.go:985] ["run DDL job"] [worker="worker 1, tp general"] [category=ddl] [jobID=465132] [category=ddl] [job="ID:465132, Type:update tiflash replica status, State:queueing, SchemaState:none, SchemaID:398143, TableID:465128, RowCount:0, ArgLen:0, start time: 2024-02-22 09:44:08.725 +0700 +07, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
[2024/02/22 09:47:37.442 +07:00] [INFO] [lock_resolver.go:1132] ["resolveLock rollback"] [lock="key: 7480000000000000175f72800000000174fea0, primary: 7480000000000000175f72800000000174fe99, txnStartTS: 447891384506515457, lockForUpdateTS:0, minCommitTs:447891384506515458, ttl: 3002, type: Put, UseAsyncCommit: false, txnSize: 12"]
[2024/02/22 09:47:37.443 +07:00] [INFO] [lock_resolver.go:1132] ["resolveLock rollback"] [lock="key: 7480000000000000175f72800000000174fea1, primary: 7480000000000000175f72800000000174fe99, txnStartTS: 447891384506515457, lockForUpdateTS:0, minCommitTs:447891384506515458, ttl: 3002, type: Put, UseAsyncCommit: false, txnSize: 12"]
[2024/02/22 09:47:39.659 +07:00] [INFO] [lock_resolver.go:1132] ["resolveLock rollback"] [lock="key: 7480000000000000175f72800000000174fea2, primary: 7480000000000000175f72800000000174fe99, txnStartTS: 447891384506515457, lockForUpdateTS:0, minCommitTs:447891384506515458, ttl: 3002, type: Put, UseAsyncCommit: false, txnSize: 12"]
[2024/02/22 09:47:39.660 +07:00] [INFO] [lock_resolver.go:1132] ["resolveLock rollback"] [lock="key: 7480000000000000175f72800000000174fea3, primary: 7480000000000000175f72800000000174fe99, txnStartTS: 447891384506515457, lockForUpdateTS:0, minCommitTs:447891384506515458, ttl: 3002, type: Put, UseAsyncCommit: false, txnSize: 12"]
[2024/02/22 09:47:39.661 +07:00] [INFO] [lock_resolver.go:1132] ["resolveLock rollback"] [lock="key: 7480000000000000175f72800000000174fea4, primary: 7480000000000000175f72800000000174fe99, txnStartTS: 447891384506515457, lockForUpdateTS:0, minCommitTs:447891384506515458, ttl: 3002, type: Put, UseAsyncCommit: false, txnSize: 12"]

@JaySon-Huang
Copy link
Contributor

@uzuki27 OK, I got it. It is related to another TiFlash bug: #8578.

When multiple DDL jobs run concurrently, or the TiKV is not stable, TiDB could generate some "empty" schema diff. And the TiFlash bug #8578, will try to access the regenerate_schema_map of a NULL pointer, which may make the TiFlash process panic or just read out a random value. If it reads a random value as "regenerate_schema_map == true", then tiflash will run into syncAllSchema and trigger this bug.

@JaySon-Huang
Copy link
Contributor

Both of these bugs are fixed in the next patch version v7.5.1, which is planned to be released in early March. You can try to upgrade to the new version.

@uzuki27
Copy link
Author

uzuki27 commented Feb 26, 2024

Thank you, I'm really looking forward to it.

@seiya-annie
Copy link

/found community

@ti-chi-bot ti-chi-bot bot added the found/community The community has encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants