Slow processing of large multi instance collection #12946

Zelldon · 2023-06-02T18:40:44Z

Describe the bug

As discovered in the recent chaos day it seems that the current batching of large multi-instance slows the processing down by a lot. It worked in general which is great!, but it was quite slow.

It might happen that a large input collection could consume the complete processing of a partition.

In the current state, it is not clear where the issue lies, but it seems that the batch processing is limited to just 2-4 commands per batch, which slows down the creation of instances.

We should further investigate whether this issue is related to batch processing itself, or based on new instance batching.

To Reproduce

Use https://github.com/camunda-cloud/game-day-cookbook/tree/main/mains/large-multi-instance in order to reproduce this

Expected behavior

I somehow would expect that the instance creation is faster and makes more use of the batch processing.

For further details take a look at the chaos day summary.

megglos · 2023-06-08T12:17:02Z

ZDP-Triage:

it's not clear yet at what number of multi-instances the performance degradation becomes noticeable
degrades performance of the whole partition (took 10m for creating 20k instances)
cancellation might cause the same performance hit
there is no workaround, it eventually completes but requires a considerable amount of time
might be caused by a bug in batch processing => we should validate this first to decide on how to move on
this affects 8.2 as well as the fix for multi-instance was provided with that one

@megglos link the recent fix introduced that surfaced this behavior

remcowesterhoud · 2023-06-14T07:40:47Z

link the recent fix introduced that surfaced this behavior

It is #12692 :)

korthout · 2023-06-14T08:25:07Z

ZPA triage:

a lower batch size may improve the performance similar to message ttl checking / due date timer checking
we want to have a look at limiting the batch size with a configurable number
at least it does work now, but degraded performance is not great
as a workaround the previous workaround of modeling batching in the process is still a possibility to avoid the degraded performance
priority is later

megglos · 2023-06-29T12:43:51Z

ZDP-Planning:

it results in a bad user experience as internal commands cause a high load leading to backpressure of user requests even if user requests are at a very low rate (affected a customer already)
workaround available being lowering batch size only has a minimal effect
might be caused by a bug in batch processing => we should validate this first to decide on how to move on => @Zelldon let's validate this next

Zelldon · 2023-07-03T12:32:35Z

Investigation 🕵️

I had a deeper look at this issue. I think I know understand why this happens.

TL;DR; It is indeed related to the batch processing, but also the combination with multi-instances and the new batching approach in the engine.

Details

I have played a bit with a unit test we have an created a large input collection: IntStream.range(1, 100_000).boxed().toList();

Example process:

Bpmn.createExecutableProcess(PROCESS_ID)
        .startEvent()
        .serviceTask(
            ELEMENT_ID,
            t -> t.zeebeJobType(jobType).multiInstance(INPUT_VARIABLE_BUILDER.andThen(builder)))
        .endEvent()
        .done();

Scenario

ServiceTask with a multi-instance marker

In the case of a parallel multi-instance we write a ProcessInstanceBatchRecord. This will be processed as part of the batch.

The corresponding ActivateProcessInstanceBatchProcessor will add ACTIVATE commands to the batch, until the batch is full. Here the first issue appears.

Let's say we have the following batch now: [..., PI Batch Activate (index: 100000), Activate Element, Activate Element, Activate Element, ..., PI Batch Activate (index: 89021)] Due to the logic in the ActivateProcessInstanceBatchProcessor we write the batch almost full, at the end we add another ProcessInstanceBatchRecord with the remaining elements.

In the ProcessingStateMachine, where batch processing happens, we will try to process the next command (after the first ProcessInstanceBatchRecord). This is the first activate element command (for the multi-instance service task), it might succeed for the first elements. At some point, we will fail with ExceedBatchSize, rollback, and try until the position where it last time failed to write at least that batch.

This means we will always have that retry mechanism triggered.

The most problematic part comes next.

After we have completed that previous batch, we will read one command at a time, which is in this case the ACTIVATE element command. It is likely a wait state (potentially a service task) which means there will be no more follow-up commands (except the Job create). We are doing this for all commands which are then written in between these ProcessInstanceBatchRecords we have seen above.

This means the problem here is that we are not batching persisted commands, and processing them one-by-one which slows the process instance execution enormously down. This explains also why the partition seems to be blocked and no other instance might make progress, since such big batches are written to the log.

Potential solutions

Batch processing can be more clever and detect already earlier that the batch is already full enough. To avoid the retry.
Reading batch of commands, when they are for the same process instance.
Command batches can be more clever, instead writing of 1000+x commands just write one, which says to activate 1000 elements. The event can contain all keys but might defer the problem since we still need to write the next command like create a job. But maybe this can be combined as well. Would be interesting to investigate.

Zelldon · 2023-07-03T13:09:13Z

I think it would be interesting to do a POC for the second point and use simply our LogStreamBatchReader

megglos · 2023-07-14T09:24:25Z

ZDP-Planning:

effectively batch processing has no effect on the activation commands
moving to backlog as upcoming to improve performance for such scenarios
severity high as a single process instance impacts performance of a whole partition

megglos · 2023-08-14T11:45:17Z

ZDP-Planning:

affecting mainly multi-instance use cases
the second potential solution would be purely within the ZDP domain
to be picked up eventually next time

Zelldon added kind/bug Categorizes an issue or PR as a bug component/engine component/stream-platform labels Jun 2, 2023

megglos added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Jun 8, 2023

Zelldon mentioned this issue Jun 14, 2023

Create incident on multi instance creation to avoid banned instances #5221

Closed

megglos assigned Zelldon Jun 29, 2023

Zelldon added the planning/discuss To be discussed at the next planning. label Jul 11, 2023

Zelldon removed their assignment Jul 11, 2023

megglos removed the planning/discuss To be discussed at the next planning. label Aug 14, 2023

Zelldon mentioned this issue Oct 11, 2023

[POC] Use batch reader for processing #14689

Closed

14 tasks

romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow processing of large multi instance collection #12946

Slow processing of large multi instance collection #12946

Zelldon commented Jun 2, 2023

megglos commented Jun 8, 2023 •

edited

remcowesterhoud commented Jun 14, 2023

korthout commented Jun 14, 2023

megglos commented Jun 29, 2023

Zelldon commented Jul 3, 2023

Zelldon commented Jul 3, 2023

megglos commented Jul 14, 2023 •

edited

megglos commented Aug 14, 2023

Slow processing of large multi instance collection #12946

Slow processing of large multi instance collection #12946

Comments

Zelldon commented Jun 2, 2023

megglos commented Jun 8, 2023 • edited

remcowesterhoud commented Jun 14, 2023

korthout commented Jun 14, 2023

megglos commented Jun 29, 2023

Zelldon commented Jul 3, 2023

Investigation 🕵️

Details

Scenario

Potential solutions

Zelldon commented Jul 3, 2023

megglos commented Jul 14, 2023 • edited

megglos commented Aug 14, 2023

megglos commented Jun 8, 2023 •

edited

megglos commented Jul 14, 2023 •

edited