Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch page items link to already released pages #402

Open
shippomx opened this issue Feb 9, 2023 · 19 comments
Open

Branch page items link to already released pages #402

shippomx opened this issue Feb 9, 2023 · 19 comments

Comments

@shippomx
Copy link

shippomx commented Feb 9, 2023

Feb  7 02:22:16 localhost dockerd[9618]: time="2023-02-07T02:22:16.932095276+08:00" level=info msg="createNetwork setupMulticastOn"
Feb  7 02:22:20 localhost dockerd[9618]: runtime: goroutine stack exceeds 1000000000-byte limit
Feb  7 02:22:20 localhost dockerd[9618]: fatal error: stack overflow
Feb  7 02:22:20 localhost dockerd[9618]: runtime stack:
Feb  7 02:22:20 localhost dockerd[9618]: runtime.throw(0x193e06d, 0xe)
Feb  7 02:22:20 localhost dockerd[9618]: /usr/local/go/src/runtime/panic.go:617 +0x72
Feb  7 02:22:20 localhost dockerd[9618]: runtime.newstack()
Feb  7 02:22:20 localhost dockerd[9618]: /usr/local/go/src/runtime/stack.go:1041 +0x6f0
Feb  7 02:22:20 localhost dockerd[9618]: runtime.morestack()
Feb  7 02:22:20 localhost dockerd[9618]: /usr/local/go/src/runtime/asm_amd64.s:429 +0x8f
Feb  7 02:22:20 localhost dockerd[9618]: goroutine 1 [running]:
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).searchPage.func1(0x7, 0x0)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:299 +0xda fp=0xc024000350 sp=0xc024000348 pc=0xd5c97a
Feb  7 02:22:20 localhost dockerd[9618]: sort.Search(0xf, 0xc0240003c8, 0x0)
Feb  7 02:22:20 localhost dockerd[9618]: /usr/local/go/src/sort/search.go:66 +0x58 fp=0xc024000388 sp=0xc024000350 pc=0x4b84c8
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).searchPage(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x7f41ac043000)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:299 +0xd0 fp=0xc024000418 sp=0xc024000388 pc=0xd4fb10
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).search(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x1b)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:271 +0x18b fp=0xc0240004b8 sp=0xc024000418 pc=0xd4f72b
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).searchPage(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x7f41ac044000)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:314 +0x153 fp=0xc024000548 sp=0xc0240004b8 pc=0xd4fb93
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).search(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x1c)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:271 +0x18b fp=0xc0240005e8 sp=0xc024000548 pc=0xd4f72b
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).searchPage(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x7f41ac03f000)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:314 +0x153 fp=0xc024000678 sp=0xc0240005e8 pc=0xd4fb93
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).search(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x17)
Feb  7 02:22:20 localhost dockerd[9618]: /root/rpmbuild/BUILD/docker-ce/.gopath/src/github.com/docker/docker/vendor/github.com/boltdb/bolt/cursor.go:271 +0x18b fp=0xc024000718 sp=0xc024000678 pc=0xd4f72b
Feb  7 02:22:20 localhost dockerd[9618]: github.com/docker/docker/vendor/github.com/boltdb/bolt.(*Cursor).searchPage(0xc043ffe338, 0xc000ff16b0, 0x24, 0x30, 0x7f41ac043000)
...

I use an old version of docker 17.03, and get this error mesage. And yet I add some logs to record this collapse.

 // etcd-io/bbolt/cursor.go:271
func (c *Cursor) search(key []byte, pgid pgid) {
	fmt.Printf("\nsearch: key = %s, pgid = %d\n", string(key), pgid)

 // etcd-io/bbolt/cursor.go:312
func (c *Cursor) searchPage(key []byte, p *page) {
	fmt.Printf("searchPage: key = %s, pgid = %d\n", string(key), p.id)

 // etcd-io/bbolt/cursor.go:323
                ret := bytes.Compare(inodes[i].key(), key)
		if ret == 0 {
			exact = true
		}
		fmt.Printf("\tinodes[%d].key() = %s, ret = %d\n", i, string(inodes[i].key()), ret)
		return ret != -1

and get this logs:

search: key = docker/network/v1.0/macvlan/network/, pgid = 34
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 34
    inodes[1].key() = docker/network/v1.0/endpoint/b33697a6c83bdf8949f8f6c91060a15114399af3a274090a05c2853558526ad1/97f1c83d06a0cb2ec9a2705829c89964ab611f8d3ff004e1a6eb7336a745de45/, ret = -1
    index = 1, inodes[index].pgid = 29

search: key = docker/network/v1.0/macvlan/network/, pgid = 29
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 29
    inodes[8].key() = docker/network/v1.0/sandbox/10c76cdf0b7a99e32375cc1d3c4eb38e0f9b564e0f98524c91beabdcff6a67cb/, ret = 1
    inodes[4].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/97f47d111ec6a95a53494fd2f3b3354e7e6c27d8bfb1a5471fa25f12859e0d2e/, ret = -1
    inodes[6].key() = docker/network/v1.0/network/3aefb08cc24b4dae33c35b7f16edfcce3667a10fb9dfa9e23f5ee3c6a5dde4ff/, ret = 1
    inodes[5].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/f43012f4ce2d86fc86897bacc4593d38c4bf607ffd85ac7fb03b9f7a6eae845a/, ret = -1
    index = 5, inodes[index].pgid = 23

search: key = docker/network/v1.0/macvlan/network/, pgid = 23
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 23
    inodes[8].key() = docker/network/v1.0/sandbox/0a36df8c90515a4a81d8f2279eb4e03371441534e52befb6f8dbad5e604d0d73/, ret = 1
    inodes[4].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/97f47d111ec6a95a53494fd2f3b3354e7e6c27d8bfb1a5471fa25f12859e0d2e/, ret = -1
    inodes[6].key() = docker/network/v1.0/network/3aefb08cc24b4dae33c35b7f16edfcce3667a10fb9dfa9e23f5ee3c6a5dde4ff/, ret = 1
    inodes[5].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/f43012f4ce2d86fc86897bacc4593d38c4bf607ffd85ac7fb03b9f7a6eae845a/, ret = -1
    index = 5, inodes[index].pgid = 28

search: key = docker/network/v1.0/macvlan/network/, pgid = 28
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 28
    inodes[1].key() = docker/network/v1.0/endpoint/b33697a6c83bdf8949f8f6c91060a15114399af3a274090a05c2853558526ad1/97f1c83d06a0cb2ec9a2705829c89964ab611f8d3ff004e1a6eb7336a745de45/, ret = -1
    index = 1, inodes[index].pgid = 27

search: key = docker/network/v1.0/macvlan/network/, pgid = 27
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 27
    inodes[7].key() = docker/network/v1.0/network/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/, ret = 1
    inodes[3].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/1158eed39c59a5dfff638074e48e26ce36f4dce086389af02f0ef87a852195b8/, ret = -1
    inodes[5].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/f43012f4ce2d86fc86897bacc4593d38c4bf607ffd85ac7fb03b9f7a6eae845a/, ret = -1
    inodes[6].key() = docker/network/v1.0/network/3aefb08cc24b4dae33c35b7f16edfcce3667a10fb9dfa9e23f5ee3c6a5dde4ff/, ret = 1
    index = 5, inodes[index].pgid = 23

search: key = docker/network/v1.0/macvlan/network/, pgid = 23
searchPage: key = docker/network/v1.0/macvlan/network/, pgid = 23
    inodes[8].key() = docker/network/v1.0/sandbox/0a36df8c90515a4a81d8f2279eb4e03371441534e52befb6f8dbad5e604d0d73/, ret = 1
    inodes[4].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/97f47d111ec6a95a53494fd2f3b3354e7e6c27d8bfb1a5471fa25f12859e0d2e/, ret = -1
    inodes[6].key() = docker/network/v1.0/network/3aefb08cc24b4dae33c35b7f16edfcce3667a10fb9dfa9e23f5ee3c6a5dde4ff/, ret = 1
    inodes[5].key() = docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/f43012f4ce2d86fc86897bacc4593d38c4bf607ffd85ac7fb03b9f7a6eae845a/, ret = -1
    index = 5, inodes[index].pgid = 28

I have dumped the db file as follows.
local-kv.db.tar.gz

I know something may result the db broken, but is there some methods to avoid this?

@shippomx
Copy link
Author

@benbjohnson Can you give me some advices?

@ahrtr ahrtr added the type/bug label Feb 10, 2023
@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2023

The page 23 is freed, but it's still somehow referenced by page 29.

$ bolt pages ./local-kv.db 
ID       TYPE       ITEMS  OVRFLW
======== ========== ====== ======
0        meta       0            
1        meta       0            
2        leaf       3            
3        leaf       5            
4        leaf       3            
5        leaf       2            
6        leaf       2            
7        leaf       2            
8        leaf       2            
9        leaf       3            
10       leaf       5            
11       leaf       2            
12       leaf       2            
13       leaf       3            
14       leaf       4            
15       leaf       4            
16       leaf       2            
17       leaf       4            
18       leaf       7            
19       leaf       4            
20       leaf       3            
21       leaf       3            
22       leaf       2            
23       free                    
24       leaf       6            
25       leaf       3            
26       leaf       2            
27       free                    
28       free                    
29       branch     16           
30       branch     13 
......
$ bolt page ./local-kv.db  29
Page ID:    29
Page Type:  branch
Total Size: 4096 bytes
Overflow pages: 0
Item Count: 16

"docker/network/v1.0/endpoint/b33697a6c83bdf8949f8f6c91060a15114399af3a274090a05c2853558526ad1/97f1c83d06a0cb2ec9a2705829c89964ab611f8d3ff004e1a6eb7336a745de45/": <pgid=35>
"docker/network/v1.0/endpoint/b33697a6c83bdf8949f8f6c91060a15114399af3a274090a05c2853558526ad1/b02586531c17670ca4690990dcc50d87dfd05c5e7f131d5417a528e67497a783/": <pgid=22>
"docker/network/v1.0/endpoint/b33697a6c83bdf8949f8f6c91060a15114399af3a274090a05c2853558526ad1/b9a2e9ca2bc1c7b00d60f3301e6a56993d9b468626e7a1b9ef65c530b9ac5f3c/": <pgid=38>
"docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/1158eed39c59a5dfff638074e48e26ce36f4dce086389af02f0ef87a852195b8/": <pgid=11>
"docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/97f47d111ec6a95a53494fd2f3b3354e7e6c27d8bfb1a5471fa25f12859e0d2e/": <pgid=25>
"docker/network/v1.0/endpoint/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/f43012f4ce2d86fc86897bacc4593d38c4bf607ffd85ac7fb03b9f7a6eae845a/": <pgid=23>
"docker/network/v1.0/network/3aefb08cc24b4dae33c35b7f16edfcce3667a10fb9dfa9e23f5ee3c6a5dde4ff/": <pgid=5>
"docker/network/v1.0/network/fb49c1610068fcbb52209ff50b144ea2a6b97bf8e70985e4971df61463508f3c/": <pgid=2>
"docker/network/v1.0/sandbox/10c76cdf0b7a99e32375cc1d3c4eb38e0f9b564e0f98524c91beabdcff6a67cb/": <pgid=24>
"docker/network/v1.0/sandbox/183088884b55acbda71c93a883b4fab1d7a0af6533779742b16949fe289b6b16/": <pgid=18>
"docker/network/v1.0/sandbox/3e9292be50bb4fc02369770ed956c6d05f678dd3334a1a289857518c11fd4c74/": <pgid=19>
"docker/network/v1.0/sandbox/6066d5f48d5afe47f47b94f0f41bc69ad9eaaf2a9f050a834789d2e1f76429ec/": <pgid=3>
"docker/network/v1.0/sandbox/a9e41d9429735297326d4d14239b7131e93dcf0af293b5e07bb7af2e5582be09/": <pgid=14>
"docker/network/v1.0/sandbox/c69467c92606a45653f2cd4133b84da325f03d1b7fbba83743bfe1dd1b44df8c/": <pgid=4>
"docker/network/v1.0/sandbox/d8e74c9d9f58689e8abfecbc78197f617c1e95a4643a0005d14b5ce2fb035619/": <pgid=20>
"docker/network/v1.0/sandbox/e5dfbe123d17d8205ee1892b8e452d092501eb05c9338eed7180ef402894320b/": <pgid=32>

There must be something wrong when releasing a page (page 23 in this case). Will take a look later when I get free cycle.

@ahrtr
Copy link
Member

ahrtr commented Feb 16, 2023

The root cause is still under investigation, but there is a tool to fix the corrupted db file. Please read #408, and let me know whether it works for you. @shippomx

@ahrtr
Copy link
Member

ahrtr commented Feb 16, 2023

@shippomx do you have detailed steps or a program to reproduce this issue, and what's the bbolt version on which you reproduced this issue?

@ahrtr
Copy link
Member

ahrtr commented Feb 17, 2023

Usually the difference between the two Txn ID in meta pages should be 1; it doesn't make sense it's 7 (54498 - 54491). Do you run a customized bbolt?

$ bbolt page ./local-kv.db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=36>
Freelist:   <pgid=40>
HWM:        <pgid=57>
Txn ID:     54498
Checksum:   a7487e9a0514ba80

$ bbolt page ./local-kv.db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=39>
Freelist:   <pgid=40>
HWM:        <pgid=57>
Txn ID:     54491
Checksum:   d19106352fdf1722

@shippomx
Copy link
Author

I just use moby (docker). Some unpredictable complex situation may cause the db file broken.

@ahrtr
Copy link
Member

ahrtr commented Feb 17, 2023

Can you answer my previous question? #402 (comment)

  1. What's the exact bbolt version on which you ran into this issue?
  2. Could you provide detailed step (and program) to reproduce this issue?

@shippomx
Copy link
Author

Can you answer my previous question? #402 (comment)

  1. What's the exact bbolt version on which you ran into this issue?
    https://github.com/moby/moby/blob/master/vendor.mod#L89
    go.etcd.io/bbolt v1.3.6
  1. Could you provide detailed step (and program) to reproduce this issue?
    Step1. build and run the dockerd
    Step2. stop dockerd, replace the /var/lib/docker/network/files/local-kv.db with the file which i attached in this issue.
    Step3. restart dockerd

But i have no idea with how this local-kv.db file broken.

@ptabor
Copy link
Contributor

ptabor commented Mar 2, 2023

Moby seems to use raw bbolt:
https://github.com/moby/moby/blob/e068c38618ff94e3be9e572c568be3bed396c263/vendor.mod#L89

It's interesting how this happened. I have no other idea than 7 rolled back transactions and 8th successful in scope of the same binary running. That might suggests rollback are not that safe... and maybe it's how the race on meta-page fixed recently (in 1.3.7) manifests.

@ahrtr
Copy link
Member

ahrtr commented Mar 16, 2023

I have no other idea than 7 rolled back transactions and 8th successful in scope of the same binary running

meta page will not be updated at all (accordingly txid will not increase) if the transaction fails. So it doesn't make sense. It can't explain #402 (comment) .

@ahrtr ahrtr pinned this issue Mar 20, 2023
@ahrtr ahrtr changed the title goroutine stack exceeds 1000000000-byte limit Branch page items link to already released pages Mar 20, 2023
@ahrtr
Copy link
Member

ahrtr commented Mar 20, 2023

@shippomx just to double confirm, till the point you ran into this issue, you did not modify the bolt's source code. Instead, you just added some debug info after seeing this issue. Could you confirm this?

I also assume that you did not modify the db file manually or programmatically. Please kindly let me know if it isn't the case.

@ningmingxiao
Copy link

ningmingxiao commented Apr 18, 2023

****> Moby seems to use raw bbolt: https://github.com/moby/moby/blob/e068c38618ff94e3be9e572c568be3bed396c263/vendor.mod#L89

It's interesting how this happened. I have no other idea than 7 rolled back transactions and 8th successful in scope of the same binary running. That might suggests rollback are not that safe... and maybe it's how the race on meta-page fixed recently (in 1.3.7) manifests.

I'm sure it is because of power failure,here is a new broken db
brokendb.tar.gz

bolt version : github.com/boltdb/bolt fff57c1 on docker-ce17.03

@ahrtr
Copy link
Member

ahrtr commented Apr 19, 2023

@ningmingxiao thx for providing the corrupted db file. The bbolt version is too old.

How was the broken db generated? Do you have a detailed steps to reproduce? Could you reproduce this issue with 1.3.7 or latest code on master branch?

BTW, the steps to fix the corrupted db file are:

$ ./bbolt  surgery clear-page-elements ./local-kv.db --from-index 0 --to-index 1 --output new1.db --pageId 14
$ ./bbolt  surgery clear-page-elements ./new1.db --from-index 4 --to-index 5 --output new2.db --pageId 14
$ ./bbolt surgery freelist abandon ./new2.db --output ./new3.db

$ ./bbolt check ./new3.db 
OK

@ahrtr
Copy link
Member

ahrtr commented Apr 20, 2023

@ningmingxiao any update? thx

I am also happy to jump into a zoom call to discuss this issue with you.

@ningmingxiao
Copy link

@ningmingxiao any update? thx

I am also happy to jump into a zoom call to discuss this issue with you.

I will update boltdb to new version.If it appears, I'll let you know
these step is cumbersome, I thinks we can use one command to fix broken db such as
bbolt fix local-kv.db --output local-kv_fix.db

$ ./bbolt  surgery clear-page-elements ./local-kv.db --from-index 0 --to-index 1 --output new1.db --pageId 14
$ ./bbolt  surgery clear-page-elements ./new1.db --from-index 4 --to-index 5 --output new2.db --pageId 14
$ ./bbolt surgery freelist abandon ./new2.db --output ./new3.db

@ahrtr
Copy link
Member

ahrtr commented Apr 23, 2023

I will update boltdb to new version.If it appears, I'll let you know

Thanks. Could you provide rough steps on how was the previous corrupted db generated, so that I can also try to reproduce it?

these step is cumbersome, I thinks we can use one command to fix broken db such as
bbolt fix local-kv.db --output local-kv_fix.db

$ ./bbolt  surgery clear-page-elements ./local-kv.db --from-index 0 --to-index 1 --output new1.db --pageId 14
$ ./bbolt  surgery clear-page-elements ./new1.db --from-index 4 --to-index 5 --output new2.db --pageId 14
$ ./bbolt surgery freelist abandon ./new2.db --output ./new3.db

The bbolt surgery ... are very low level commands, which are only supposed to be executed by advanced users. I am afraid there isn't easier way to fix corrupted db files other than the bbolt surgery ... for now, unless you have a backup.

@ahrtr
Copy link
Member

ahrtr commented May 10, 2023

Thanks. Could you provide rough steps on how was the previous corrupted db generated, so that I can also try to reproduce it?

@shippomx can you share more details on how was the corrupt db file generated? I couldn't reproduce this issue by intentionally killing the program (#490) or shutting down VM when the concurrent test is running.

@ahrtr
Copy link
Member

ahrtr commented May 10, 2023

@ningmingxiao

@ningmingxiao
Copy link

@ningmingxiao
It's difficult to reproduce, it is because of power failure. If it appers i will tell you.

@github-actions github-actions bot added the stale label Apr 17, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
@ahrtr ahrtr reopened this May 9, 2024
@ahrtr ahrtr removed the stale label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

4 participants