Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge increase in max parts per partition after upgrade to CH 24.4.1.2088 #63717

Open
campi01 opened this issue May 13, 2024 · 4 comments
Open
Labels
cripshnot A user posted a screenshot or a photo of computer screen instead of a text st-need-info We need extra data to continue (waiting for response) unexpected behaviour

Comments

@campi01
Copy link

campi01 commented May 13, 2024

As the title says, after upgrading to latest CH version 24.4.1.2088, from 24.3.3.102, we noticed a huge increase in CPU load on the CH cluster, and upon further investigation it seems to come from the parts merges. I've also noticed that we have a huge increase in max parts per partitions since the upgrade, please see the attached screenshot.

I could also not correlate it to any documented change in the release notes, but perhaps it's there and you could point me in the right direction for tuning something that changed between these 2 releases? We're using the ReplicatedGraphiteMergeTree engine, if it makes any difference.
2024-05-13_18-16

@Algunenano
Copy link
Member

Please, investigate and share information that can help pinpoint the problem. Sadly the graph doesn't tell us anything.

  • What tables are affected? Is it all of them, or only some of them.
  • Has the ingestion pattern changed (system.query_log / system.asynchrnous_query_log / system.part_log)?
  • Has the number of merges increased (system.part_log)?

Other ideas:

  • What happens if you revert back to 24.3? Does it go back to low CPU / low merges?

@Algunenano Algunenano added the st-need-info We need extra data to continue (waiting for response) label May 13, 2024
@campi01
Copy link
Author

campi01 commented May 14, 2024

Yes indeed, it goes back to normal for parts per partition when I revert to 24.3 (screenshot 1).
2024-05-14_07-21

There was no change in the ingestion pattern, and we only basically have this one "data" table and some materialized views. I didn't check which exact tables had the increased number of parts but I did catch very long running (more than 30 minutes) merges from "system.merges" on the aforementioned table. And it was during these long running merges that the CPU load spiked up to ~600 in some cases. During the same period there is also an increased number of merges (screenshot 2).
2024-05-14_07-28

Let me know if you require any further information.

@Algunenano
Copy link
Member

Algunenano commented May 14, 2024

Let me know if you require any further information.

I'm sorry, but you just shared a bunch of screenshots, some of them without legend. With that information it's impossible to investigate anything. All I see is that there were more merges than usual and more active parts than usual.

It's a moonshot, but if everything else was slow too then it might be related to #63500. It can't be, because this was already part of 24.3

Otherwise check what I mentioned before:

  • How many tables were affected?
  • How many parts per minute were created before the upgrade, after the upgrade and after the downgrade in the affected tables?
  • What was the general status of the server in terms of CPU / memory and IO? Did it change?

@Algunenano Algunenano added the cripshnot A user posted a screenshot or a photo of computer screen instead of a text label May 14, 2024
@campi01
Copy link
Author

campi01 commented May 14, 2024

The screenshots are from the metrics we get from CH's exported graphite metrics https://clickhouse.com/docs/en/operations/monitoring, specifically "MaxPartCountForPartition". So I don't know exactly which tables had increased parts, however, looking into "system.part_log" and comparing 24.3 (2024-05-06) with 24.4 (2024-05-13), then 24.3 again after downgrade (2024-04-14 4 hours only today during same timeframe on previous days), there doesn't seem to be much of a difference:

NewPart

select count(*),concat(database,'.',table) as tableName, event_date from system.part_log where (event_time between '2024-05-06 08:00:00' and '2024-05-06 16:00:00' or event_time between '2024-05-13 08:00:00' and '2024-05-13 16:00:00' or event_time between '2024-05-14 08:00:00' and '2024-05-14 16:00:00') and event_type = 'NewPart' group by event_date, tableName order by event_date,count();

count()	tableName	event_date
1945	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-06
2770	graphite.index	2024-05-06
10674	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-06
10809	graphite.data_lr	2024-05-06
1685	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-13
2328	graphite.index	2024-05-13
9363	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-13
9545	graphite.data_lr	2024-05-13
495	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-14
867	graphite.index	2024-05-14
3391	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-14
3446	graphite.data_lr	2024-05-14

MergeParts

select count(*),concat(database,'.',table) as tableName, event_date from system.part_log where (event_time between '2024-05-06 08:00:00' and '2024-05-06 16:00:00' or event_time between '2024-05-13 08:00:00' and '2024-05-13 16:00:00' or event_time between '2024-05-14 08:00:00' and '2024-05-14 16:00:00') and event_type = 'MergeParts' group by event_date, tableName order by event_date,count();

count()	tableName	event_date
4028	graphite.data_lr	2024-05-06
11896	graphite.index	2024-05-06
13184	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-06
41213	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-06
1970	graphite.data_lr	2024-05-13
10025	graphite.index	2024-05-13
11759	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-13
34717	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-13
1271	graphite.data_lr	2024-05-14
3482	gentime..inner_id.008139e2-315a-45b2-9bc1-6dd014d2c194	2024-05-14
4940	graphite.index	2024-05-14
12813	gentime..inner_id.faba10da-a606-4713-9e5d-e7abe9f6525a	2024-05-14

For the general status of the server, I unfortunately have to provide you with another screenshot, that shows the spikes in CPU/memory on the cluster that was running 24.4 for several hours only, yesterday. There was also a corresponding increase in I/O reads during this period.
2024-05-14_11-59

I'll also keep an eye out for #63730 when the new build is released, I've also noticed some increased CPU load in general, since upgrading to 24.3. So that might also resolve those issues. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cripshnot A user posted a screenshot or a photo of computer screen instead of a text st-need-info We need extra data to continue (waiting for response) unexpected behaviour
Projects
None yet
Development

No branches or pull requests

2 participants