Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow processing of large multi instance collection #12946

Open
Zelldon opened this issue Jun 2, 2023 · 8 comments
Open

Slow processing of large multi instance collection #12946

Zelldon opened this issue Jun 2, 2023 · 8 comments
Labels
component/engine component/stream-platform component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@Zelldon
Copy link
Member

Zelldon commented Jun 2, 2023

Describe the bug

As discovered in the recent chaos day it seems that the current batching of large multi-instance slows the processing down by a lot. It worked in general which is great!, but it was quite slow.

It might happen that a large input collection could consume the complete processing of a partition.

In the current state, it is not clear where the issue lies, but it seems that the batch processing is limited to just 2-4 commands per batch, which slows down the creation of instances.

We should further investigate whether this issue is related to batch processing itself, or based on new instance batching.

To Reproduce

Use https://github.com/camunda-cloud/game-day-cookbook/tree/main/mains/large-multi-instance in order to reproduce this

Expected behavior

I somehow would expect that the instance creation is faster and makes more use of the batch processing.

For further details take a look at the chaos day summary.

@Zelldon Zelldon added kind/bug Categorizes an issue or PR as a bug component/engine component/stream-platform labels Jun 2, 2023
@megglos megglos added the severity/high Marks a bug as having a noticeable impact on the user with no known workaround label Jun 8, 2023
@megglos
Copy link
Contributor

megglos commented Jun 8, 2023

ZDP-Triage:

  • it's not clear yet at what number of multi-instances the performance degradation becomes noticeable
  • degrades performance of the whole partition (took 10m for creating 20k instances)
  • cancellation might cause the same performance hit
  • there is no workaround, it eventually completes but requires a considerable amount of time
  • might be caused by a bug in batch processing => we should validate this first to decide on how to move on
  • this affects 8.2 as well as the fix for multi-instance was provided with that one

@megglos link the recent fix introduced that surfaced this behavior

@remcowesterhoud
Copy link
Contributor

link the recent fix introduced that surfaced this behavior

It is #12692 :)

@korthout
Copy link
Member

ZPA triage:

  • a lower batch size may improve the performance similar to message ttl checking / due date timer checking
  • we want to have a look at limiting the batch size with a configurable number
  • at least it does work now, but degraded performance is not great
  • as a workaround the previous workaround of modeling batching in the process is still a possibility to avoid the degraded performance
  • priority is later

@megglos
Copy link
Contributor

megglos commented Jun 29, 2023

ZDP-Planning:

  • it results in a bad user experience as internal commands cause a high load leading to backpressure of user requests even if user requests are at a very low rate (affected a customer already)
  • workaround available being lowering batch size only has a minimal effect
  • might be caused by a bug in batch processing => we should validate this first to decide on how to move on => @Zelldon let's validate this next

@Zelldon
Copy link
Member Author

Zelldon commented Jul 3, 2023

Investigation 🕵️

I had a deeper look at this issue. I think I know understand why this happens.

TL;DR; It is indeed related to the batch processing, but also the combination with multi-instances and the new batching approach in the engine.

Details

I have played a bit with a unit test we have an created a large input collection: IntStream.range(1, 100_000).boxed().toList();

Example process:

Bpmn.createExecutableProcess(PROCESS_ID)
        .startEvent()
        .serviceTask(
            ELEMENT_ID,
            t -> t.zeebeJobType(jobType).multiInstance(INPUT_VARIABLE_BUILDER.andThen(builder)))
        .endEvent()
        .done();

Scenario

ServiceTask with a multi-instance marker

In the case of a parallel multi-instance we write a ProcessInstanceBatchRecord. This will be processed as part of the batch.

The corresponding ActivateProcessInstanceBatchProcessor will add ACTIVATE commands to the batch, until the batch is full. Here the first issue appears.

Let's say we have the following batch now: [..., PI Batch Activate (index: 100000), Activate Element, Activate Element, Activate Element, ..., PI Batch Activate (index: 89021)] Due to the logic in the ActivateProcessInstanceBatchProcessor we write the batch almost full, at the end we add another ProcessInstanceBatchRecord with the remaining elements.

In the ProcessingStateMachine, where batch processing happens, we will try to process the next command (after the first ProcessInstanceBatchRecord). This is the first activate element command (for the multi-instance service task), it might succeed for the first elements. At some point, we will fail with ExceedBatchSize, rollback, and try until the position where it last time failed to write at least that batch.

This means we will always have that retry mechanism triggered.

The most problematic part comes next.

After we have completed that previous batch, we will read one command at a time, which is in this case the ACTIVATE element command. It is likely a wait state (potentially a service task) which means there will be no more follow-up commands (except the Job create). We are doing this for all commands which are then written in between these ProcessInstanceBatchRecords we have seen above.

This means the problem here is that we are not batching persisted commands, and processing them one-by-one which slows the process instance execution enormously down. This explains also why the partition seems to be blocked and no other instance might make progress, since such big batches are written to the log.

Potential solutions

  1. Batch processing can be more clever and detect already earlier that the batch is already full enough. To avoid the retry.
  2. Reading batch of commands, when they are for the same process instance.
  3. Command batches can be more clever, instead writing of 1000+x commands just write one, which says to activate 1000 elements. The event can contain all keys but might defer the problem since we still need to write the next command like create a job. But maybe this can be combined as well. Would be interesting to investigate.

@Zelldon
Copy link
Member Author

Zelldon commented Jul 3, 2023

I think it would be interesting to do a POC for the second point and use simply our LogStreamBatchReader

@Zelldon Zelldon added the planning/discuss To be discussed at the next planning. label Jul 11, 2023
@Zelldon Zelldon removed their assignment Jul 11, 2023
@megglos
Copy link
Contributor

megglos commented Jul 14, 2023

ZDP-Planning:

  • effectively batch processing has no effect on the activation commands
  • moving to backlog as upcoming to improve performance for such scenarios
  • severity high as a single process instance impacts performance of a whole partition

@megglos
Copy link
Contributor

megglos commented Aug 14, 2023

ZDP-Planning:

  • affecting mainly multi-instance use cases
  • the second potential solution would be purely within the ZDP domain
  • to be picked up eventually next time

@megglos megglos removed the planning/discuss To be discussed at the next planning. label Aug 14, 2023
@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/engine component/stream-platform component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants