Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor fails to upload indexes larger than 1G to swift object storage #8102

Open
tagathi opened this issue May 10, 2024 · 0 comments
Open

Comments

@tagathi
Copy link

tagathi commented May 10, 2024

Describe the bug

We noticed that the compactor is failing to compact two specific 12hours blocks into a 24hour block with the error below:

ts=2024-05-10T07:29:47.807232748Z caller=bucket_compactor.go:276 level=error component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00
:00:00 +0000 UTC" msg="compaction job failed" duration=4m51.795055275s duration_ms=291795 err="upload of 01HXGP9H6ZGPE25A9T69307XDJ failed: upload index: upload file /data/compact/0@17241709254077376921-merge--1714608000000-17146944000
00/01HXGP9H6ZGPE25A9T69307XDJ/index as 01HXGP9H6ZGPE25A9T69307XDJ/index: upload object close: Timeout when reading or writing data"

At the same time, objects with segment prefix are being created in our swift object storage:

1.0G 2024-05-10 07:29:35 application/octet-stream segments/544/4492f3031485847503948365a4750453235413954363933303758444a2f696e646578b9d1151d50c5e930e7ee38ad67c59e07c9f95db97409ec1de81ab47562d9b4dcda39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001      
16M 2024-05-10 07:29:36 application/octet-stream segments/544/4492f3031485847503948365a4750453235413954363933303758444a2f696e646578b9d1151d50c5e930e7ee38ad67c59e07c9f95db97409ec1de81ab47562d9b4dcda39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000002
1.0G 2024-05-10 07:35:31 application/octet-stream segments/544/4492f3031485847504d3453424d4e4d314a57384333393148515850412f696e64657853d9eb59d0676a3b79809d2b356d6874a8919886feff2895e78b226559f0fe6ada39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
16M 2024-05-10 07:35:32 application/octet-stream segments/544/4492f3031485847504d3453424d4e4d314a57384333393148515850412f696e64657853d9eb59d0676a3b79809d2b356d6874a8919886feff2895e78b226559f0fe6ada39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000002
1.0G 2024-05-10 07:42:05 application/octet-stream segments/544/4492f3031485847513043414b595a334e31365245504e4d44594335392f696e646578494e8fd4d588a05d1595c9806686fdb4d0afe120b7d42cbfa5d5ed51f914781eda39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000001
16M 2024-05-10 07:42:05 application/octet-stream segments/544/4492f3031485847513043414b595a334e31365245504e4d44594335392f696e646578494e8fd4d588a05d1595c9806686fdb4d0afe120b7d42cbfa5d5ed51f914781eda39a3ee5e6b4b0d3255bfef95601890afd80709/0000000000000002

We tried increasing the request_timeout, but it didn't help, we just got a different error:

ts=2024-05-06T06:59:39.166816963Z caller=bucket_compactor.go:276 level=error component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000
 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00:00:00 +0000 UTC" msg="compaction job failed" duration=5m46.25127905s duration_ms=346251 err="upload of 01HX6
AXF9QJ4QGFQHFYXRXEHGC failed: upload index: upload file /data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HX6AXF9QJ4QGFQHFYXRXEHGC/index as 01HX6AXF
9QJ4QGFQHFYXRXEHGC/index: upload object close: HTTP Error: 408: 408 Request Timeout"

It seems that issue occurs when the size of the index is larger than 1G, and from what we have been able to find, it seems that the segments blocks are coming from Thanos-IO which Mimir uses to talk to swift:
Check this
According to this:
By default, OpenStack Swift has a limit for maximum file size of 5 GiB. Thanos index files are often larger than that. To resolve this issue, Thanos uses Static Large Objects (SLO) which are uploaded as segments. These are by default put into the segments directory of the same container. The default limit for using SLO is 1 GiB which is also the maximum size of the segment. If you don't want to use the same container for the segments (best practise is to use <container_name>_segments to avoid polluting listing of the container objects) you can use the large_file_segments_container_name option to override the default and put the segments to other container. In rare cases you can switch to Dynamic Large Objects (DLO) by setting the use_dynamic_large_objects to true, but use it with caution since it even more relies on eventual consistency.

To overcome the problem we have introduced grouping and sharding in our compactors, which seems to reduce the size of the indexes. We have not been using sharding until now, as we have ~4.5M metrics, and the recommendation is to have 1 group per 8M metrics.

Please note that by using openstack swiftclient we can upload files larger than 1GB in a few seconds, so the issue seems not to be on the storage side.

To Reproduce

Integrate Mimir with swift object storage.
Try to compact blocks whose resulting index will be greater than 1GB.

Expected behavior

Files with size greater than 1GB should be succesfully uploaded to swift object storage.

Environment

  • Infrastructure: Kubernetes, swift object storage
  • Deployment tool: helm

Additional Context

Compactor Logs

ts=2024-05-10T07:30:50.274591314Z caller=bucket_compactor.go:301 level=info component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00:00:00 +0000 UTC" msg="compaction available and planned; downloading blocks" blocks=2 plan="[01HWWPFGW91B8T486QQ34WH8YB (min time: 1714608000000, max time: 1714651200000) 01HWXY9JP93H6XVJ9VDT1TRFGH (min time: 1714651200000, max time: 1714694400000)]"
ts=2024-05-10T07:31:32.639757073Z caller=bucket_compactor.go:349 level=info component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00:00:00 +0000 UTC" msg="downloaded and verified blocks; compacting blocks" blocks=2 plan="[/data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HWWPFGW91B8T486QQ34WH8YB /data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HWXY9JP93H6XVJ9VDT1TRFGH]" duration=42.365066403s duration_ms=42365
ts=2024-05-10T07:34:25.321168209Z caller=compact.go:510 level=info component=compactor msg="compact blocks" count=2 mint=1714608000000 maxt=1714694400000 ulid=01HXGPM4SBMNM1JW8C391HQXPA sources="[01HWWPFGW91B8T486QQ34WH8YB 01HWXY9JP93H6XVJ9VDT1TRFGH]" duration=2m52.681326367s shard=1_of_1
ts=2024-05-10T07:34:25.404673747Z caller=bucket_compactor.go:379 level=info component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00:00:00 +0000 UTC" msg="compacted blocks" new=[01HXGPM4SBMNM1JW8C391HQXPA] blocks="[/data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HWWPFGW91B8T486QQ34WH8YB /data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HWXY9JP93H6XVJ9VDT1TRFGH]" duration=2m52.76479694s duration_ms=172764
ts=2024-05-10T07:35:43.863762387Z caller=bucket_compactor.go:276 level=error component=compactor user=TDI groupKey=0@17241709254077376921-merge--1714608000000-1714694400000 minTime="2024-05-02 00:00:00 +0000 UTC" maxTime="2024-05-03 00:00:00 +0000 UTC" msg="compaction job failed" duration=4m53.589274237s duration_ms=293589 err="upload of 01HXGPM4SBMNM1JW8C391HQXPA failed: upload index: upload file /data/compact/0@17241709254077376921-merge--1714608000000-1714694400000/01HXGPM4SBMNM1JW8C391HQXPA/index as 01HXGPM4SBMNM1JW8C391HQXPA/index: upload object close: Timeout when reading or writing data"

Source blocks

511M 2024-05-02 13:06:34 application/octet-stream TDI/01HWWPFGW91B8T486QQ34WH8YB/chunks/000001
511M 2024-05-02 13:06:39 application/octet-stream TDI/01HWWPFGW91B8T486QQ34WH8YB/chunks/000002
511M 2024-05-02 13:06:45 application/octet-stream TDI/01HWWPFGW91B8T486QQ34WH8YB/chunks/000003
193M 2024-05-02 13:06:47 application/octet-stream TDI/01HWWPFGW91B8T486QQ34WH8YB/chunks/000004
678M 2024-05-02 13:06:52 application/octet-stream TDI/01HWWPFGW91B8T486QQ34WH8YB/index
12K 2024-05-02 13:06:52         application/json TDI/01HWWPFGW91B8T486QQ34WH8YB/meta.json
511M 2024-05-03 00:45:20 application/octet-stream TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/chunks/000001
511M 2024-05-03 00:45:26 application/octet-stream TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/chunks/000002
511M 2024-05-03 00:45:31 application/octet-stream TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/chunks/000003
196M 2024-05-03 00:45:34 application/octet-stream TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/chunks/000004
699M 2024-05-03 00:45:39 application/octet-stream TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/index
13K 2024-05-03 00:45:40         application/json TDI/01HWXY9JP93H6XVJ9VDT1TRFGH/meta.json


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants