Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

campi01 · 2024-05-13T16:23:15Z

As the title says, after upgrading to latest CH version 24.4.1.2088, from 24.3.3.102, we noticed a huge increase in CPU load on the CH cluster, and upon further investigation it seems to come from the parts merges. I've also noticed that we have a huge increase in max parts per partitions since the upgrade, please see the attached screenshot.

I could also not correlate it to any documented change in the release notes, but perhaps it's there and you could point me in the right direction for tuning something that changed between these 2 releases? We're using the ReplicatedGraphiteMergeTree engine, if it makes any difference.

Algunenano · 2024-05-13T16:35:16Z

Please, investigate and share information that can help pinpoint the problem. Sadly the graph doesn't tell us anything.

What tables are affected? Is it all of them, or only some of them.
Has the ingestion pattern changed (system.query_log / system.asynchrnous_query_log / system.part_log)?
Has the number of merges increased (system.part_log)?

Other ideas:

What happens if you revert back to 24.3? Does it go back to low CPU / low merges?

campi01 · 2024-05-14T05:35:54Z

Yes indeed, it goes back to normal for parts per partition when I revert to 24.3 (screenshot 1).

There was no change in the ingestion pattern, and we only basically have this one "data" table and some materialized views. I didn't check which exact tables had the increased number of parts but I did catch very long running (more than 30 minutes) merges from "system.merges" on the aforementioned table. And it was during these long running merges that the CPU load spiked up to ~600 in some cases. During the same period there is also an increased number of merges (screenshot 2).

Let me know if you require any further information.

Algunenano · 2024-05-14T09:01:52Z

Let me know if you require any further information.

I'm sorry, but you just shared a bunch of screenshots, some of them without legend. With that information it's impossible to investigate anything. All I see is that there were more merges than usual and more active parts than usual.

~~It's a moonshot, but if everything else was slow too then it might be related to #63500.~~ It can't be, because this was already part of 24.3

Otherwise check what I mentioned before:

How many tables were affected?
How many parts per minute were created before the upgrade, after the upgrade and after the downgrade in the affected tables?
What was the general status of the server in terms of CPU / memory and IO? Did it change?

campi01 · 2024-05-14T10:37:42Z

The screenshots are from the metrics we get from CH's exported graphite metrics https://clickhouse.com/docs/en/operations/monitoring, specifically "MaxPartCountForPartition". So I don't know exactly which tables had increased parts, however, looking into "system.part_log" and comparing 24.3 (2024-05-06) with 24.4 (2024-05-13), then 24.3 again after downgrade (2024-04-14 4 hours only today during same timeframe on previous days), there doesn't seem to be much of a difference:

NewPart

select count(*),concat(database,'.',table) as tableName, event_date from system.part_log where (event_time between '2024-05-06 08:00:00' and '2024-05-06 16:00:00' or event_time between '2024-05-13 08:00:00' and '2024-05-13 16:00:00' or event_time between '2024-05-14 08:00:00' and '2024-05-14 16:00:00') and event_type = 'NewPart' group by event_date, tableName order by event_date,count();

count()	tableName	event_date
1945	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-06
2770	graphite.index	2024-05-06
10674	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-06
10809	graphite.data_lr	2024-05-06
1685	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-13
2328	graphite.index	2024-05-13
9363	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-13
9545	graphite.data_lr	2024-05-13
495	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-14
867	graphite.index	2024-05-14
3391	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-14
3446	graphite.data_lr	2024-05-14

MergeParts

select count(*),concat(database,'.',table) as tableName, event_date from system.part_log where (event_time between '2024-05-06 08:00:00' and '2024-05-06 16:00:00' or event_time between '2024-05-13 08:00:00' and '2024-05-13 16:00:00' or event_time between '2024-05-14 08:00:00' and '2024-05-14 16:00:00') and event_type = 'MergeParts' group by event_date, tableName order by event_date,count();

count()	tableName	event_date
4028	graphite.data_lr	2024-05-06
11896	graphite.index	2024-05-06
13184	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-06
41213	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-06
1970	graphite.data_lr	2024-05-13
10025	graphite.index	2024-05-13
11759	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-13
34717	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-13
1271	graphite.data_lr	2024-05-14
3482	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-14
4940	graphite.index	2024-05-14
12813	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-14

For the general status of the server, I unfortunately have to provide you with another screenshot, that shows the spikes in CPU/memory on the cluster that was running 24.4 for several hours only, yesterday. There was also a corresponding increase in I/O reads during this period.

I'll also keep an eye out for #63730 when the new build is released, I've also noticed some increased CPU load in general, since upgrading to 24.3. So that might also resolve those issues. Thanks.

campi01 added the unexpected behaviour label May 13, 2024

Algunenano added the st-need-info We need extra data to continue (waiting for response) label May 13, 2024

Algunenano added the cripshnot A user posted a screenshot or a photo of computer screen instead of a text label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

campi01 commented May 13, 2024

Algunenano commented May 13, 2024

campi01 commented May 14, 2024

Algunenano commented May 14, 2024 •

edited

campi01 commented May 14, 2024 •

edited

Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

Comments

campi01 commented May 13, 2024

Algunenano commented May 13, 2024

campi01 commented May 14, 2024

Algunenano commented May 14, 2024 • edited

campi01 commented May 14, 2024 • edited

Algunenano commented May 14, 2024 •

edited

campi01 commented May 14, 2024 •

edited