Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

querier may crash when cache directory is full #12947

Open
umezawatakeshi opened this issue May 13, 2024 · 0 comments
Open

querier may crash when cache directory is full #12947

umezawatakeshi opened this issue May 13, 2024 · 0 comments
Labels
component/querier type/bug Somehing is not working as expected

Comments

@umezawatakeshi
Copy link

Describe the bug

querier may crash after the "short write" of index cache when their cache directory (such as /data/tsdb-cache) are full.

2024-05-07 18:16:15.420	level=error ts=2024-05-07T09:16:15.394133257Z caller=index_set.go:285 table-name=index_19689 msg="sync failed, retrying it" err="short write"
2024-05-07 18:16:15.489	panic: runtime error: invalid memory address or nil pointer dereference
2024-05-07 18:16:15.489		panic: runtime error: invalid memory address or nil pointer dereference
2024-05-07 18:16:15.489	[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1db9dba]
2024-05-07 18:16:15.489	
2024-05-07 18:16:15.489	goroutine 7049936555 [running]:
2024-05-07 18:16:15.489	github.com/grafana/loki/pkg/storage/stores/indexshipper/storage.DownloadFileFromStorage.func1()
2024-05-07 18:16:15.489		/work/loki/pkg/storage/stores/indexshipper/storage/util.go:55 +0x1a
2024-05-07 18:16:15.489	panic({0x23895c0?, 0x4577fe0?})
2024-05-07 18:16:15.489		/usr/local/go/src/runtime/panic.go:770 +0x132
2024-05-07 18:16:15.489	bufio.(*Reader).Read(0xc19b80a540, {0xc0d0bff595, 0xa, 0xc0d0bff804?})
2024-05-07 18:16:15.489		/usr/local/go/src/bufio/bufio.go:241 +0x183
2024-05-07 18:16:15.489	io.ReadAtLeast({0x306cb60, 0xc19b80a540}, {0xc0d0bff595, 0xa, 0x200}, 0xa)
2024-05-07 18:16:15.489		/usr/local/go/src/io/io.go:335 +0x90
2024-05-07 18:16:15.489	io.ReadFull(...)
2024-05-07 18:16:15.489		/usr/local/go/src/io/io.go:354
2024-05-07 18:16:15.489	github.com/klauspost/pgzip.(*Reader).readHeader(0xc0d0bff508, 0x1)
2024-05-07 18:16:15.489		/work/loki/vendor/github.com/klauspost/pgzip/gunzip.go:257 +0x71
2024-05-07 18:16:15.489	github.com/klauspost/pgzip.(*Reader).Reset(0xc0d0bff508, {0x0, 0x0})
2024-05-07 18:16:15.489		/work/loki/vendor/github.com/klauspost/pgzip/gunzip.go:190 +0x2b3
2024-05-07 18:16:15.489	github.com/grafana/loki/pkg/storage/stores/indexshipper/storage.getGzipReader({0x0, 0x0})
2024-05-07 18:16:15.489		/work/loki/pkg/storage/stores/indexshipper/storage/util.go:26 +0x50
2024-05-07 18:16:15.489	github.com/grafana/loki/pkg/storage/stores/indexshipper/storage.DownloadFileFromStorage({0xc0df80e780, 0x4b}, 0x1, 0x1, {0x306c360, 0xc18da2e460}, 0xc0f81fddf0)
2024-05-07 18:16:15.489		/work/loki/pkg/storage/stores/indexshipper/storage/util.go:72 +0x7c5
2024-05-07 18:16:15.489	github.com/grafana/loki/pkg/storage/stores/indexshipper/downloads.(*indexSet).downloadFileFromStorage(0xc17f55c280, {0x308db30, 0xc18e2a0be0}, {0xc093a7b4b2, 0x31}, {0xc102aa2120?, 0x1c?})
2024-05-07 18:16:15.489		/work/loki/pkg/storage/stores/indexshipper/downloads/index_set.go:394 +0x246
2024-05-07 18:16:15.489	github.com/grafana/loki/pkg/storage/stores/indexshipper/downloads.(*indexSet).doConcurrentDownload.func1({0x308db30?, 0xc18e2a0be0?}, 0xc017364768?)
2024-05-07 18:16:15.489		/work/loki/pkg/storage/stores/indexshipper/downloads/index_set.go:412 +0x7c
2024-05-07 18:16:15.489	github.com/grafana/dskit/concurrency.ForEachJob.func1()
2024-05-07 18:16:15.489		/work/loki/vendor/github.com/grafana/dskit/concurrency/runner.go:105 +0x83
2024-05-07 18:16:15.489	golang.org/x/sync/errgroup.(*Group).Go.func1()
2024-05-07 18:16:15.489		/work/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x56
2024-05-07 18:16:15.489	created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 7049867623
2024-05-07 18:16:15.490		/work/loki/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0x96

At the next startup, the querier tried to clean up broken cache entries but finally failed to start. It continued several times.

2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.141690617Z caller=main.go:108 msg="Starting Loki" version="(version=2.9.5, branch=HEAD, revision=e759cae)"
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.142533159Z caller=server.go:322 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.143718079Z caller=memberlist_client.go:434 msg="Using memberlist cluster label and node name" cluster_label= node=querier-15-c077b4fd
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.146199999Z caller=memberlist_client.go:540 msg="memberlist fast-join starting" nodes_found=1 to_join=2
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.161659298Z caller=table_manager.go:271 index-store=boltdb-shipper-2020-10-24 msg="query readiness setup completed" duration=4.037µs distinct_users_len=0 distinct_users=
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.161720903Z caller=shipper.go:165 index-store=boltdb-shipper-2020-10-24 msg="starting index shipper in RO mode"
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.161731403Z caller=shipper_index_client.go:76 index-store=boltdb-shipper-2020-10-24 msg="starting boltdb shipper in RO mode"
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.162592269Z caller=table_manager.go:271 index-store=tsdb-2023-05-29 msg="query readiness setup completed" duration=701ns distinct_users_len=0 distinct_users=
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.162606746Z caller=shipper.go:165 index-store=tsdb-2023-05-29 msg="starting index shipper in RO mode"
(similar log lines continue)
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.173170252Z caller=table_manager.go:271 index-store=tsdb-2023-11-01 msg="query readiness setup completed" duration=531ns distinct_users_len=0 distinct_users=
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.173181603Z caller=shipper.go:165 index-store=tsdb-2023-11-01 msg="starting index shipper in RO mode"
2024-05-07 18:16:21.240	level=info ts=2024-05-07T09:16:21.174938601Z caller=table_manager.go:427 index-store=tsdb-2023-11-11 msg="loading local table index_19675"
2024-05-07 18:16:21.240	ts=2024-05-07T09:16:21.177419229Z caller=spanlogger.go:86 level=info msg="building table names cache"
2024-05-07 18:16:21.339	level=info ts=2024-05-07T09:16:21.263934864Z caller=memberlist_client.go:560 msg="memberlist fast-join finished" joined_nodes=28 elapsed_time=117.736768ms
2024-05-07 18:16:21.339	level=info ts=2024-05-07T09:16:21.26397028Z caller=memberlist_client.go:573 msg="joining memberlist cluster" join_members=loki-gossip-ring.logging.svc:7946
2024-05-07 18:16:21.439	level=info ts=2024-05-07T09:16:21.370329443Z caller=memberlist_client.go:592 msg="joining memberlist cluster succeeded" reached_nodes=28 elapsed_time=106.359273ms
2024-05-07 18:16:21.439	ts=2024-05-07T09:16:21.37882461Z caller=spanlogger.go:86 level=info msg="table names cache built" duration=201.395093ms
2024-05-07 18:16:21.439	ts=2024-05-07T09:16:21.37885665Z caller=spanlogger.go:86 level=info msg="building table cache"
2024-05-07 18:16:21.439	ts=2024-05-07T09:16:21.42304299Z caller=spanlogger.go:86 level=info msg="table cache built" duration=44.18123ms
2024-05-07 18:16:22.339	level=info ts=2024-05-07T09:16:22.260609019Z caller=table_manager.go:427 index-store=tsdb-2023-11-11 msg="loading local table index_19677"
2024-05-07 18:16:22.339	ts=2024-05-07T09:16:22.261837651Z caller=spanlogger.go:86 level=info msg="building table cache"
2024-05-07 18:16:22.339	ts=2024-05-07T09:16:22.333606586Z caller=spanlogger.go:86 level=info msg="table cache built" duration=71.761091ms
2024-05-07 18:16:23.140	level=info ts=2024-05-07T09:16:23.0405501Z caller=table_manager.go:271 index-store=tsdb-2023-11-11 msg="query readiness setup completed" duration=3.978µs distinct_users_len=0 distinct_users=
2024-05-07 18:16:23.140	level=info ts=2024-05-07T09:16:23.040594533Z caller=shipper.go:165 index-store=tsdb-2023-11-11 msg="starting index shipper in RO mode"
2024-05-07 18:16:23.140	level=info ts=2024-05-07T09:16:23.04104867Z caller=table_manager.go:427 index-store=tsdb-2023-11-21 msg="loading local table index_19684"
2024-05-07 18:16:23.140	ts=2024-05-07T09:16:23.042454973Z caller=spanlogger.go:86 level=info msg="building table names cache"
2024-05-07 18:16:23.339	ts=2024-05-07T09:16:23.265336369Z caller=spanlogger.go:86 level=info msg="table names cache built" duration=222.872959ms
2024-05-07 18:16:23.339	ts=2024-05-07T09:16:23.265372596Z caller=spanlogger.go:86 level=info msg="building table cache"
2024-05-07 18:16:23.440	ts=2024-05-07T09:16:23.391070365Z caller=spanlogger.go:86 level=info msg="table cache built" duration=125.689183ms
2024-05-07 18:16:24.142	level=info ts=2024-05-07T09:16:24.093429276Z caller=table_manager.go:427 index-store=tsdb-2023-11-21 msg="loading local table index_19689"
2024-05-07 18:16:24.142	ts=2024-05-07T09:16:24.094239437Z caller=spanlogger.go:86 level=info msg="building table cache"
2024-05-07 18:16:24.239	ts=2024-05-07T09:16:24.168556436Z caller=spanlogger.go:86 level=info msg="table cache built" duration=74.311658ms
2024-05-07 18:16:24.239	level=error ts=2024-05-07T09:16:24.169194136Z caller=index_set.go:132 table-name=index_19689 msg="failed to open existing index file /data/tsdb-cache/index_19689/1701128713-ingester-0-1697093678075716335.tsdb, removing the file and continuing without it to let the sync operation catch up" err="read TOC: read TOC: invalid checksum"
(similar log lines continue)
2024-05-07 18:16:24.839	level=error ts=2024-05-07T09:16:24.749320632Z caller=index_set.go:132 table-name=index_19689 msg="failed to open existing index file /data/tsdb-cache/index_19689/1701204313-ingester-0-1697093678075716335.tsdb, removing the file and continuing without it to let the sync operation catch up" err="read TOC: read TOC: invalid checksum"
2024-05-07 18:16:24.839	level=error ts=2024-05-07T09:16:24.811851182Z caller=index_set.go:285 table-name=index_19689 msg="sync failed, retrying it" err="short write"
2024-05-07 18:16:24.939	level=error ts=2024-05-07T09:16:24.866456999Z caller=index_set.go:285 table-name=index_19689 msg="sync failed, retrying it" err="short write"
2024-05-07 18:16:24.939	level=error ts=2024-05-07T09:16:24.866493217Z caller=index_set.go:103 table-name=index_19689 msg="failed to initialize table index_19689, cleaning it up" err="short write"
2024-05-07 18:16:25.038	short write
2024-05-07 18:16:25.038	error initialising module: store
2024-05-07 18:16:25.038	github.com/grafana/dskit/modules.(*Manager).initModule
2024-05-07 18:16:25.038		/work/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
2024-05-07 18:16:25.038	github.com/grafana/dskit/modules.(*Manager).InitModuleServices
2024-05-07 18:16:25.038		/work/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
2024-05-07 18:16:25.038	github.com/grafana/loki/pkg/loki.(*Loki).Run
2024-05-07 18:16:25.038		/work/loki/pkg/loki/loki.go:461
2024-05-07 18:16:25.038	main.main
2024-05-07 18:16:25.038		/work/loki/cmd/loki/main.go:110
2024-05-07 18:16:25.038	runtime.main
2024-05-07 18:16:25.038		/usr/local/go/src/runtime/proc.go:271
2024-05-07 18:16:25.038	runtime.goexit
2024-05-07 18:16:25.038		/usr/local/go/src/runtime/asm_amd64.s:1695
2024-05-07 18:16:25.038	level=error ts=2024-05-07T09:16:25.038582159Z caller=log.go:230 msg="error running loki" err="short write\nerror initialising module: store\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/work/loki/vendor/github.com/grafana/dskit/modules/modules.go:138\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/work/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/work/loki/pkg/loki/loki.go:461\nmain.main\n\t/work/loki/cmd/loki/main.go:110\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"

In my environment, 15 queriers were running and the cache directories of all of the querier was (almost) full, but only one of them crashed and the others were still running after the "short write".

To Reproduce

I could not reproduce. Probably this behavior is very rare.

Steps to reproduce the behavior:

  1. Started Loki v2.9.5 (which was built by myself)
  2. fill /data directory
  3. Execute some query

Expected behavior
Queriers continue running without crashing (but may be slow).

Environment:

  • Infrastructure: Kubernetes v1.27.10
  • Deployment tool: jsonnet

Screenshots, Promtail config, or terminal output
(none)

@JStickler JStickler added component/querier type/bug Somehing is not working as expected labels May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/querier type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants