New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow processing of large multi instance collection #12946
Comments
ZDP-Triage:
@megglos link the recent fix introduced that surfaced this behavior |
It is #12692 :) |
ZPA triage:
|
ZDP-Planning:
|
Investigation 🕵️I had a deeper look at this issue. I think I know understand why this happens. TL;DR; It is indeed related to the batch processing, but also the combination with multi-instances and the new batching approach in the engine. DetailsI have played a bit with a unit test we have an created a large input collection: Example process:
ScenarioServiceTask with a multi-instance marker In the case of a parallel multi-instance we write a ProcessInstanceBatchRecord. This will be processed as part of the batch. The corresponding ActivateProcessInstanceBatchProcessor will add ACTIVATE commands to the batch, until the batch is full. Here the first issue appears. Let's say we have the following batch now: In the ProcessingStateMachine, where batch processing happens, we will try to process the next command (after the first ProcessInstanceBatchRecord). This is the first activate element command (for the multi-instance service task), it might succeed for the first elements. At some point, we will fail with ExceedBatchSize, rollback, and try until the position where it last time failed to write at least that batch. This means we will always have that retry mechanism triggered. The most problematic part comes next. After we have completed that previous batch, we will read one command at a time, which is in this case the ACTIVATE element command. It is likely a wait state (potentially a service task) which means there will be no more follow-up commands (except the Job create). We are doing this for all commands which are then written in between these ProcessInstanceBatchRecords we have seen above. This means the problem here is that we are not batching persisted commands, and processing them one-by-one which slows the process instance execution enormously down. This explains also why the partition seems to be blocked and no other instance might make progress, since such big batches are written to the log. Potential solutions
|
I think it would be interesting to do a POC for the second point and use simply our LogStreamBatchReader |
ZDP-Planning:
|
ZDP-Planning:
|
Describe the bug
As discovered in the recent chaos day it seems that the current batching of large multi-instance slows the processing down by a lot. It worked in general which is great!, but it was quite slow.
It might happen that a large input collection could consume the complete processing of a partition.
In the current state, it is not clear where the issue lies, but it seems that the batch processing is limited to just 2-4 commands per batch, which slows down the creation of instances.
We should further investigate whether this issue is related to batch processing itself, or based on new instance batching.
To Reproduce
Use https://github.com/camunda-cloud/game-day-cookbook/tree/main/mains/large-multi-instance in order to reproduce this
Expected behavior
I somehow would expect that the instance creation is faster and makes more use of the batch processing.
For further details take a look at the chaos day summary.
The text was updated successfully, but these errors were encountered: