Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Compactor] panic: unexpected seriesToChunkEncoder lack of iterations #6775

Closed
piotrhryszko-img opened this issue Oct 5, 2023 · 12 comments · Fixed by #7334
Closed

[Compactor] panic: unexpected seriesToChunkEncoder lack of iterations #6775

piotrhryszko-img opened this issue Oct 5, 2023 · 12 comments · Fixed by #7334

Comments

@piotrhryszko-img
Copy link

piotrhryszko-img commented Oct 5, 2023

Thanos, Prometheus and Golang version used:

thanos, version 0.31.0 (branch: HEAD, revision: 50c464132c265eef64254a9fd063b1e2419e09b7)
  build user:       root@63f5f37ee4e8
  build date:       20230323-10:13:38
  go version:       go1.19.7
  platform:         linux/amd64

Object Storage Provider: S3

What happened:
Thanos compact throws panic: unexpected seriesToChunkEncoder lack of iterations and exists
What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

Uncomment if you would like to post collapsible logs:

Logs

panic: unexpected seriesToChunkEncoder lack of iterations

goroutine 50 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000b56bd0)
	/go/pkg/mod/github.com/prometheus/prometheus@v0.42.0/storage/merge.go:753 +0x88c
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc00091f260, {0xc00061a120, 0x2, 0x69?}, 0xc0003128f0, {0x2b54960, 0xc000562580}, {0x2b4dc80, 0xc000f63310})
	/go/pkg/mod/github.com/prometheus/prometheus@v0.42.0/tsdb/compact.go:771 +0x1488
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc00091f260, {0xc000bc62c0, 0x37}, 0xc0003128f0, {0xc00061a120, 0x2, 0x2})
	/go/pkg/mod/github.com/prometheus/prometheus@v0.42.0/tsdb/compact.go:597 +0x64d
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).Compact(0xc00091f260, {0xc000bc62c0, 0x37}, {0xc0000b9fe0, 0x2, 0x4057e40?}, {0x0, 0x0, 0xc0008ea000?})
	/go/pkg/mod/github.com/prometheus/prometheus@v0.42.0/tsdb/compact.go:438 +0x225
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2b5a648?, 0xc00113a960?})
	/app/pkg/compact/compact.go:1075 +0x4a
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a648?, 0xc00113a960?}, {0x25ef5d1?, 0x2?}, 0xc000e91b48, {0x0?, 0xc000a6a240?, 0x0?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc001585680, {0x2b5a648, 0xc00113a960}, {0xc000bc62c0, 0x37}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
	/app/pkg/compact/compact.go:1074 +0xcab
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2b5a648?, 0xc00113a960?})
	/app/pkg/compact/compact.go:775 +0x65
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2b5a5a0?, 0xc000666000?}, {0x25fcb34?, 0x9?}, 0xc000e91e30, {0xc000cc2d80?, 0x43cba7?, 0xc000e91d80?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc001585680, {0x2b5a5a0, 0xc000666000}, {0xc00089b8a0, 0x1b}, {0x2b42f20, 0xc0005f9bc0}, {0x2b4db40, 0xc00091f260})
	/app/pkg/compact/compact.go:774 +0x35c
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
	/app/pkg/compact/compact.go:1250 +0x165
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
	/app/pkg/compact/compact.go:1247 +0x935

Anything else we need to know:

        - args:
            - compact
            - --log.level=info
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__
@piotrhryszko-img
Copy link
Author

also tried with vertical compaction enabled on another environment and still seeing the same panic

        - args:
            - compact
            - --log.level=info
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__
            - --compact.enable-vertical-compaction
            - --delete-delay=0

@GiedriusS
Copy link
Member

Is this the same with the newest main version? Could you please try it? 0.31.0 is old :/

@piotrhryszko-img
Copy link
Author

Hi @GiedriusS upgrading to the latest version didn't resolve the issue

thanos, version 0.32.4 (branch: HEAD, revision: fcd5683e3049924ae26a680e166ae6f27a344896)
  build user:       root@afb5016d2fc4
  build date:       20231002-07:45:12
  go version:       go1.20.8
  platform:         linux/amd64
  tags:             netgo

As per suggestions on Slack deduplication function was added as in our case applications are scraped by multiple Prometheus instances. This stopped errors from happening. However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now. Current configuration is below

        - args:
            - compact
            - --log.level=debug
            - --log.format=logfmt
            - --http-address=0.0.0.0:10902
            - --objstore.config-file=/etc/config/object-store.yaml
            - --data-dir=/var/thanos/compact
            - --consistency-delay=30m
            - --retention.resolution-raw=30d
            - --retention.resolution-5m=180d
            - --retention.resolution-1h=1y
            - --compact.concurrency=1
            - --wait
            - --deduplication.replica-label=__replica__
            - --deduplication.func=penalty
            - --compact.enable-vertical-compaction
            - --delete-delay=168h

@yeya24
Copy link
Contributor

yeya24 commented Oct 23, 2023

However, it also seems to have caused issues with compaction now, as it's been stuck on a single block for more than 3 days now.

What's the reason of the block stuck? Did you see any error?

@vCra
Copy link

vCra commented Nov 17, 2023

Hey - I've also seen a similar error on 0.32.4

{"caller":"compact.go:708","level":"info","msg":"Found overlapping blocks during compaction","ts":"2023-11-17T22:56:51.255652657Z","ulid":"01HFFR0H1PS6EWAP1ARPPZ4ZG8"}
panic: unexpected seriesToChunkEncoder lack of iterations

goroutine 289 [running]:
github.com/prometheus/prometheus/storage.(*compactChunkIterator).Next(0xc000274b40)
	/go/pkg/mod/github.com/prometheus/prometheus@v0.46.1-0.20230818184859-4d8e380269da/storage/merge.go:753 +0x870
github.com/prometheus/prometheus/tsdb.DefaultBlockPopulator.PopulateBlock({}, {0x2d0f3a8, 0xc000789440}, 0xc0008c1500, {0x2cf1be0, 0xc0006ae0c0}, {0x2d00380, 0xc0000d9cc0}, 0xc000012448?, {0xc00143c040, ...}, ...)
	/go/pkg/mod/github.com/prometheus/prometheus@v0.46.1-0.20230818184859-4d8e380269da/tsdb/compact.go:781 +0x1472
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).write(0xc0006c3860, {0xc00106c0f0, 0x29}, 0xc000806bb0, {0x2cfa620, 0x431d070}, {0xc00143c040, 0x2, 0x2})
	/go/pkg/mod/github.com/prometheus/prometheus@v0.46.1-0.20230818184859-4d8e380269da/tsdb/compact.go:601 +0x6db
github.com/prometheus/prometheus/tsdb.(*LeveledCompactor).CompactWithBlockPopulator(0xc0006c3860, {0xc00106c0f0, 0x29}, {0xc00081a340, 0x2, 0x2d28040?}, {0x0, 0x0, 0xc0001ec380?}, {0x2cfa620, ...})
	/go/pkg/mod/github.com/prometheus/prometheus@v0.46.1-0.20230818184859-4d8e380269da/tsdb/compact.go:442 +0x6bb
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func3({0x2d0f3a8, 0xc001c22420})
	/app/pkg/compact/compact.go:1137 +0x125
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f3a8?, 0xc001476270?}, {0x277957c?, 0x2?}, 0xc0010a5aa0, {0x0?, 0xc000ebc500?, 0x1?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).compact(0xc000bbc8c0, {0x2d0f3a8, 0xc001476270}, {0xc00106c0f0, 0x29}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
	/app/pkg/compact/compact.go:1132 +0x10ad
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact.func2({0x2d0f3a8?, 0xc001476270?})
	/app/pkg/compact/compact.go:830 +0xd7
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2d0f300?, 0xc0008186e0?}, {0x2787486?, 0x9?}, 0xc0010a5e10, {0xc0000c60d0?, 0x40e227?, 0x58?})
	/app/pkg/tracing/tracing.go:82 +0xd0
github.com/thanos-io/thanos/pkg/compact.(*Group).Compact(0xc000bbc8c0, {0x2d0f300, 0xc0008186e0}, {0xc0002662a0, 0xd}, {0x2cf4280, 0xc000789770}, {0x2d07640, 0xc0006c3860}, {0x2cfa920, ...}, ...)
	/app/pkg/compact/compact.go:829 +0x3cc
github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact.func2()
	/app/pkg/compact/compact.go:1373 +0x18a
created by github.com/thanos-io/thanos/pkg/compact.(*BucketCompactor).Compact
	/app/pkg/compact/compact.go:1370 +0x90a

When searching for 01HFFR0H1PS6EWAP1ARPPZ4ZG8 in bucket web nothing shows up. I also can't see a directory with that name within the object bucket

@yeya24
Copy link
Contributor

yeya24 commented Nov 18, 2023

Hi, thanks for all the bug report. I wonder if it is possible for someone to share the problematic block since I don't have a good way to reproduce this issue locally. Please let me know. You can reach out to me on Slack.

@bison
Copy link

bison commented Feb 28, 2024

Seeing this panic on v0.34.0 as well. Also don't see the ulid from the logs in the actual bucket and thanos tools bucket verify --log.level=debug --issues=overlapped_blocks against the bucket doesn't show anything.

Would be happy to provide data if I knew how to find the correct blocks.

@vCra
Copy link

vCra commented Feb 28, 2024

Hey @bison I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled

If you want to try a hacky fix, you can try disabling compaction for all the blocks before you enabled compaction

(Thats presuming we have the same issue - it could be something different)

In compact, look at the logs before it crashed - it should start to compact several blocks - you'll need to mark these, and you might need to do it lots of times for all the blocks that have already been compacted

@yeya24
Copy link
Contributor

yeya24 commented Feb 28, 2024

Hi @vCra, thanks for the investigation.

I think I narrowed this down to thanos trying to do vertical compaction on already compacted blocks - this could be the case if you've not previously had vertical compaction enabled

It is interesting to know that. How did you fugure this out? Ideally it shouldn't matter to compact whether blocks already compacted or not so shouldn't panic. Maybe we miss something.

@bison
Copy link

bison commented Feb 29, 2024

@vCra wow thanks, that's exactly what's happening. Just upgraded this stack and vertical compaction got enabled where it wasn't before. Now the first time the compactor encounters two previously compacted blocks at 5m resolution, it panics. If I mark the same blocks (and all other similar blocks) with no-compact, then compaction completes.

Edit: Actually I guess it's any previously compacted block. I originally thought it was only at that resolution for some reason.

@vCra
Copy link

vCra commented Feb 29, 2024

How did you figure this out?

I'm only guessing that this is the issue - compactor kept crashing, and I noticed that we were managing to vertically compact all the new blocks without issue, but the old blocks were not getting vertically compacted - in bucketweb it was quite clear.
The issue was that no downsampling was happening - the count of downsample-todo kept on slowly increasing.
Looking at the logs was how we solved it - we though it could be 1 or two corrupted blocks, so I kept marking all these blocks as don't compact - we had a large backlog so it took a while, but I slowly started to see a pattern that it was only the old blocks that were having an issue.

Looking at bucket-web, we still have the old blocks, but just not vertically compacted - we don't care too much, as we won't use this data too frequently (10 is with vertical compaction)

Screenshot 2024-02-29 at 23 51 06

The discussion in https://cloud-native.slack.com/archives/CK5RSSC10/p1681966324787459 helped too

@GiedriusS
Copy link
Member

I spotted this in prod. Looking into it 👁️

GiedriusS added a commit that referenced this issue Apr 30, 2024
For #6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS added a commit that referenced this issue May 1, 2024
For #6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS added a commit that referenced this issue May 3, 2024
Adding a minimal test case for issue #6775 - reproduces the panic in the
compactor.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS added a commit that referenced this issue May 3, 2024
Adding a minimal test case for issue #6775 - reproduces the panic in the
compactor.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
GiedriusS added a commit to vinted/thanos that referenced this issue May 3, 2024
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the
compactor.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Nashluffy pushed a commit to Nashluffy/thanos that referenced this issue May 14, 2024
For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: mluffman <nashluffman@gmail.com>
Nashluffy pushed a commit to Nashluffy/thanos that referenced this issue May 14, 2024
Adding a minimal test case for issue thanos-io#6775 - reproduces the panic in the
compactor.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: mluffman <nashluffman@gmail.com>
saswatamcode pushed a commit to saswatamcode/thanos that referenced this issue May 28, 2024
For thanos-io#6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
saswatamcode added a commit that referenced this issue May 28, 2024
* compact: recover from panics (#7318)

For #6775, it would be useful
to know the exact block IDs to aid debugging.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* Sidecar: wait for prometheus on startup (#7323)

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Receive: fix serverAsClient.Series goroutines leak (#6948)

* fix serverAsClient goroutines leak

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* fix lint

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* update changelog

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* delete invalid comment

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove temp dev test

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* remove timer channel drain

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

---------

Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>

* Receive: fix stats (#7373)

If we account stats for remote write and local writes we will count them
twice since the remote write will be counted locally again by the remote
receiver instance.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline (#7382)

* *: Ensure objstore flag values are masked & disable debug/pprof/cmdline

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* small fix

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Query: dont pass query hints to avoid triggering pushdown (#7392)

If we have a new querier it will create query hints even without the
pushdown feature being present anymore. Old sidecars will then trigger
query pushdown which leads to broken max,min,max_over_time and
min_over_time.

Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>

* Cut patch release v0.35.1

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: Michael Hoffmann <mhoffm@posteo.de>
Signed-off-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Co-authored-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Co-authored-by: Michael Hoffmann <mhoffm@posteo.de>
Co-authored-by: Thibault Mange <22740367+thibaultmg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants