You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The getCompactionTaskCapacity function which is used to ascertain that the Druid cluster has enough task slots before the coordinator schedules additional compaction tasks doesn't take into consideration overlord dynamic config. The overlord dynamic config can prevent compaction tasks from running on specific categories of workers. This way, the compaction task capacity is incorrectly overestimated.
For example, I have two worker categories, compaction-category with a total of 600 task slots and another ingestion-category with 2000 slots(high number because of multiple ingestion task replicas).
Using the overlord dynamic config,
compaction-category is configured to run the following task types,
kill
compact
single_phase_sub_task
partial_dimension_cardinality
partial_index_generate
partial_index_generic_merge
ingestion-category is configured to run,
index_kafka
Now, getCompactionTaskCapacity would return 2600 as the total capacity, which is inaccurate since only 600 slots are actually available for compaction tasks. While this might not pose a problem in a healthy cluster, it becomes critical during compaction task failures. The oversight leads to the coordinator creating excessive compaction tasks, resulting in contention on compaction slots and slowing down all compaction tasks. This creates a feedback loop where the increasing number of compaction tasks exacerbates contention, ultimately overwhelming the overlord with too many tasks to handle
Affected Version
Saw this on Druid 25. Is also present in master.
The text was updated successfully, but these errors were encountered:
We worked around this by explicitly setting the maxCompactionTaskSlots to the total available worker slots on the compaction tier.(Actually, we set it a little higher than that to improve middle manager utilisation)
Description
The getCompactionTaskCapacity function which is used to ascertain that the Druid cluster has enough task slots before the coordinator schedules additional compaction tasks doesn't take into consideration overlord dynamic config. The overlord dynamic config can prevent compaction tasks from running on specific categories of workers. This way, the compaction task capacity is incorrectly overestimated.
For example, I have two worker categories,
compaction-category
with a total of600
task slots and anotheringestion-category
with2000
slots(high number because of multiple ingestion task replicas).Using the overlord dynamic config,
compaction-category
is configured to run the following task types,ingestion-category
is configured to run,Now, getCompactionTaskCapacity would return
2600
as the total capacity, which is inaccurate since only600
slots are actually available for compaction tasks. While this might not pose a problem in a healthy cluster, it becomes critical during compaction task failures. The oversight leads to the coordinator creating excessive compaction tasks, resulting in contention on compaction slots and slowing down all compaction tasks. This creates a feedback loop where the increasing number of compaction tasks exacerbates contention, ultimately overwhelming the overlord with too many tasks to handleAffected Version
Saw this on Druid 25. Is also present in master.
The text was updated successfully, but these errors were encountered: