Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-instance parallel subprocess times out and never completes for large input collections #8687

Open
lfourky opened this issue Jan 30, 2022 · 31 comments
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/clients component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/clients-go Marks an issue or PR to appear in the Go client section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround

Comments

@lfourky
Copy link

lfourky commented Jan 30, 2022

Describe the bug
When running a multi-instance parallel subprocess, I've experienced issues such as stalling - for example, with an input count of 5000 elements, Zeebe seems to process a few, then never process the rest.

This can sometime stall all the future processes for the same process ID, as well, I think.

Hint: when run as a sequential subprocess, it runs fine.

To Reproduce
https://github.com/lfourky/zeebe-multiinstance-issue
I've created a repository to demonstrate this issue. The BPMN is quite a simple one and it's using a Go client.
Assuming you've got zbctl installed, you can run:

   make deploy
   make run

to reproduce this locally.

Expected behavior
Expected the process to complete in near future.

Log/Stacktrace

Full Stacktrace

12:11:38.887 [] [grpc-default-worker-ELG-5-6] WARN  io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
        at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:172) ~[netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:521) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.Http2ConnectionHandler.closeStream(Http2ConnectionHandler.java:613) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:444) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.73.Final.jar:4.1.73.Final]
        at java.lang.Thread.run(Unknown Source) [?:?]



12:11:49.564 [io.camunda.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.gateway - Failed to activate jobs for type second_data_processor from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=5022, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=136, hash=-1458081928}} to 0.0.0.0:26501 timed out in PT15S
        at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[zeebe-atomix-cluster-1.4.0-SNAPSHOT.jar:1.4.0-SNAPSHOT]
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
        at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.73.Final.jar:4.1.73.Final]
        at java.lang.Thread.run(Unknown Source) ~[?:?]

Environment:

  • OS: Ubuntu 20 (WSL2)
  • Zeebe Version: latest develop branch 5932521
  • Configuration: Standalone broker with default settings, but when deployed in Kubernetes (a bit more custom deployment) it behaves the same way.
@lfourky lfourky added the kind/bug Categorizes an issue or PR as a bug label Jan 30, 2022
@Zelldon
Copy link
Member

Zelldon commented Jan 30, 2022

Hey @lfourky thanks for reporting.

It is very likely that your process instances are blacklisted, due to big input collection. Could you verify that in your broker log? Do you see any error logs? This is related to #2890 and #5221

@npepinpe npepinpe added the blocker/info Marks an issue as blocked, awaiting more information from the author label Jan 31, 2022
@lfourky
Copy link
Author

lfourky commented Jan 31, 2022

When calling the process with 5.000 input collection variables, I do not see the blacklist log. I keep seeing these timeout issues which keep repeating over and over again:

Stacktrace

18:18:56.805 [io.camunda.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.gateway - Failed to activate jobs for type second_data_processor from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=5015, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=136, hash=-1458081928}} to 0.0.0.0:26501 timed out in PT14.962S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[zeebe-atomix-cluster-1.4.0-20220128.182414-45.jar:1.4.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
18:18:56.943 [io.camunda.zeebe.gateway.impl.broker.BrokerRequestManager] [Broker-0-zb-actors-1] WARN  io.camunda.zeebe.gateway - Failed to activate jobs for type second_data_processor from partition 1
java.util.concurrent.TimeoutException: Request ProtocolRequest{id=5016, subject=command-api-1, sender=0.0.0.0:26502, payload=byte[]{length=136, hash=-677107123}} to 0.0.0.0:26501 timed out in PT15S
	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$sendAndReceive$4(NettyMessagingService.java:230) ~[zeebe-atomix-cluster-1.4.0-20220128.182414-45.jar:1.4.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]
18:18:57.055 [] [grpc-nio-worker-ELG-5-3] WARN  io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
	at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:172) ~[netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:521) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.closeStream(Http2ConnectionHandler.java:613) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:444) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:833) [?:?]
18:18:57.261 [] [grpc-nio-worker-ELG-5-3] WARN  io.grpc.netty.NettyServerHandler - Stream Error


When calling the process with 20.000 input collection variables, I do see the blacklist log:

Stacktrace

23:08:15.440 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.camunda.zeebe.processor - Expected to write one or more follow up events for event 'LoggedEvent [type=0, version=16384, streamId=1, position=6615, key=2251799813687535, timestamp=1643666895256, sourceEventPosition=6610] RecordMetadata{recordType=COMMAND, intentValue=255, intent=ACTIVATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=1.4.0}' without errors, but exception was thrown.
java.lang.IllegalArgumentException: Expected to claim segment of size 6500888, but can't claim more than 4194304 bytes.
	at io.camunda.zeebe.dispatcher.Dispatcher.offer(Dispatcher.java:194) ~[zeebe-dispatcher-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.dispatcher.Dispatcher.claimFragmentBatch(Dispatcher.java:164) ~[zeebe-dispatcher-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.claimBatchForEvents(LogStreamBatchWriterImpl.java:222) ~[zeebe-logstreams-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.tryWrite(LogStreamBatchWriterImpl.java:199) ~[zeebe-logstreams-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.TypedStreamWriterImpl.flush(TypedStreamWriterImpl.java:106) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.TypedStreamWriterProxy.flush(TypedStreamWriterProxy.java:59) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$writeEvent$7(ProcessingStateMachine.java:374) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.AbortableRetryStrategy.run(AbortableRetryStrategy.java:44) ~[zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
23:08:15.461 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] ERROR io.camunda.zeebe.processor - Expected to process event 'TypedEventImpl{metadata=RecordMetadata{recordType=COMMAND, intentValue=255, intent=ACTIVATE_ELEMENT, requestStreamId=-2147483648, requestId=-1, protocolVersion=3, valueType=PROCESS_INSTANCE, rejectionType=NULL_VAL, rejectionReason=, brokerVersion=1.4.0}, value={"bpmnProcessId":"my_process","version":1,"processDefinitionKey":2251799813685251,"processInstanceKey":2251799813687523,"elementId":"Activity_0kpdzv0","flowScopeKey":2251799813687523,"bpmnElementType":"MULTI_INSTANCE_BODY","parentProcessInstanceKey":-1,"parentElementInstanceKey":-1}}' without errors, but exception occurred with message 'Expected to claim segment of size 6500888, but can't claim more than 4194304 bytes.'.
java.lang.IllegalArgumentException: Expected to claim segment of size 6500888, but can't claim more than 4194304 bytes.
	at io.camunda.zeebe.dispatcher.Dispatcher.offer(Dispatcher.java:194) ~[zeebe-dispatcher-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.dispatcher.Dispatcher.claimFragmentBatch(Dispatcher.java:164) ~[zeebe-dispatcher-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.claimBatchForEvents(LogStreamBatchWriterImpl.java:222) ~[zeebe-logstreams-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.logstreams.impl.log.LogStreamBatchWriterImpl.tryWrite(LogStreamBatchWriterImpl.java:199) ~[zeebe-logstreams-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.writers.TypedStreamWriterImpl.flush(TypedStreamWriterImpl.java:106) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.bpmn.behavior.TypedStreamWriterProxy.flush(TypedStreamWriterProxy.java:59) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.engine.processing.streamprocessor.ProcessingStateMachine.lambda$writeEvent$7(ProcessingStateMachine.java:374) ~[zeebe-workflow-engine-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.retry.AbortableRetryStrategy.run(AbortableRetryStrategy.java:44) ~[zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.invoke(ActorJob.java:79) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorJob.execute(ActorJob.java:44) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:97) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.doWork(ActorThread.java:80) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
	at io.camunda.zeebe.util.sched.ActorThread.run(ActorThread.java:189) [zeebe-util-1.4.0-20220131.134754-49.jar:1.4.0-SNAPSHOT]
23:08:15.462 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-0] WARN  io.camunda.zeebe.broker.logstreams - Blacklist process instance 2251799813687523, due to previous errors.

@npepinpe
Copy link
Member

npepinpe commented Feb 1, 2022

Let's try to reproduce it and timebox root causing it as we would normally in our medic process.

@oleschoenburg oleschoenburg self-assigned this Feb 1, 2022
@oleschoenburg
Copy link
Member

@lfourky Thanks to https://github.com/lfourky/zeebe-multiinstance-issue I was able to reproduce the issue locally. It looks like feel is taking a long time to evaluate the input collection expression.

@lfourky lfourky changed the title Multi-instance parallel subprocess times out and never completes when for large inputs Multi-instance parallel subprocess times out and never completes for large inputs Feb 1, 2022
@lfourky lfourky changed the title Multi-instance parallel subprocess times out and never completes for large inputs Multi-instance parallel subprocess times out and never completes for large input collections Feb 1, 2022
@oleschoenburg
Copy link
Member

I've looked into this a bit more:
MutliInstanceBodyProcessor.onChildActivating is called for each child instance (so in this example 5000 times). Each time we call ExpressionProcessor.evaluateArrayExpression which makes up the majority of the time spent.

I tried to improve the performance of evaluateArrayExpression (e.g. giving size hints while building up the collections), but couldn't make enough of an impact to solve this issue.

To me this looks like a limitation of how MultiInstanceBodyProcessor is implemented: For each child instance that is created, it has to evaluate the entire input collection.

/cc @npepinpe

@oleschoenburg oleschoenburg removed their assignment Feb 1, 2022
@lfourky
Copy link
Author

lfourky commented Feb 1, 2022

Could this be a related issue? #8622

@npepinpe
Copy link
Member

npepinpe commented Feb 2, 2022

That one is a bit different, it had to do with the jobs being activated but not making it to the worker, then timing out, as far as I remember.

/cc @saig0 / @korthout - are there any low hanging fruits to help improve the evaluation of the expression? If not, then I classify this as a current limitation related to multi instances (for now anyway). Is there any workaround we can suggest for users running into this?

@npepinpe npepinpe added area/performance Marks an issue as performance related scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround and removed blocker/info Marks an issue as blocked, awaiting more information from the author labels Feb 2, 2022
@korthout
Copy link
Member

korthout commented Feb 2, 2022

I shortly discussed it with @saig0. We wonder what the actual numbers are on the performance of the evaluateArrayExpression and whether this is really the reason that the process can't make progress.

If it would just be slow, then it would just take some additional time for the processing to finish, but this shouldn't block anything. However, the stacktrace shows that the activate job batch request times out after 15 seconds. @oleschoenburg Did you already try increasing the request timeout or decreasing the job timeout?

Finally, if slowness is the reason, then we have some ideas on improving the performance of this specific expression evaluation. But again there we should first measure before making changes.

An idea is that currently, it evaluates the expression and then transforms each element in the array to message pack. However, on activation of the child, only 1 specific element is needed from this list (the one at the loop counter's index). Perhaps, if that part is taking too much time, we could change how we retrieve this 1 item from the expression result.

@oleschoenburg
Copy link
Member

oleschoenburg commented Feb 2, 2022

I've tried playing around with the timeouts in the go worker:

- client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).Open()
+ client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).RequestTimeout(60 * time.Second).Timeout(5 * time.Second).PollInterval(1 * time.Second).Concurrency(1).Open()

but that did not improve the situation at all: no jobs of the multi-instance were activated (or at least they didn't reach the worker). I waited for ~10 minutes.

So as a sanity check I implemented the worker with the java client:

import io.camunda.zeebe.client.ZeebeClient;
import io.camunda.zeebe.client.api.response.ActivatedJob;
import io.camunda.zeebe.client.api.worker.JobClient;

import java.time.Duration;

public class App {
  public static void main(String[] args) throws InterruptedException {
    try (final var client =
        ZeebeClient.newClientBuilder().gatewayAddress("localhost:26500").usePlaintext().build()) {
      try (final var worker =
          client
              .newWorker()
              .jobType("second_data_processor")
              .handler(
                  (JobClient jobClient, ActivatedJob job) -> {
                    System.out.println("Handling job: " + job.getKey());
                    client.newCompleteCommand(job.getKey()).send().join();
                  })
              .timeout(Duration.ofSeconds(5))
              .requestTimeout(Duration.ofSeconds(60))
              .maxJobsActive(1)
              .pollInterval(Duration.ofSeconds(1))
              .open()) {
        System.out.println("Job worker started");
        Thread.currentThread().join();
      }
    }
  }
}

I'm using the same settings for timeout, requestTimeout, concurrency and pollInterval. Now we are finally activating jobs!
It still takes a good while for the first job to appear (which might be due to #8622) but after the first one we are making steady progress and finish all jobs.

So this leads me to believe that there is either a bug in the go worker implementation, a bug in the go client, or a subtle difference in the (default) configuration used for the java and go clients.

@saig0
Copy link
Member

saig0 commented Feb 3, 2022

@oleschoenburg can you share some numbers? It would be interesting to see how long it takes to complete the process instance. Also, for different configurations. This could guide users in the right direction.

We could tweak the job worker a bit by increasing the maxJobsActive (e.g. 100 jobs - we expect a lot of them).

@npepinpe
Copy link
Member

npepinpe commented Feb 3, 2022

We had some issues in the C# clients with job activation, which we only discovered by using Wireshark - we could then confirm the gateway was sending a response with the jobs, but somehow the job handlers on the C# side never observed them. Could be one lead here to figure out if the bug is on the client or gateway side.

@oleschoenburg
Copy link
Member

@saig0 Increasing maxJobsActive breaks the java worker - no jobs are activated, at least from the workers point of view. I've only tried 1 and 100 so far, so there is probably a middle ground that works.
Activating the first of the 5000 jobs happens after 2 minutes.

@oleschoenburg
Copy link
Member

When using maxJobsActive=100 with the java worker, I'm seeing lot's of grpc debug logs that look like the gateway wants to send over data to the worker but can't because the stream is closed.

Worker logs:

Feb 03, 2022 9:35:52 AM io.grpc.internal.AbstractClientStream$TransportState inboundDataReceived
INFO: Received data on closed stream

Zeebe logs:

2022-02-03 09:37:13.181 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND HEADERS: streamId=17 headers=GrpcHttp2RequestHeaders[:path: /gateway_protocol.Gateway/ActivateJobs, :authority: localhost:26500, :method: POST, :scheme: http, te: trailers, content-type: application/grpc, user-agent: zeebe-client-java/1.3.2 grpc-java-netty/1.43.0, grpc-accept-encoding: gzip, grpc-timeout: 69999686u] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
2022-02-03 09:37:13.182 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND DATA: streamId=17 padding=0 endStream=true length=46 bytes=00000000290a157365636f6e645f646174615f70726f636573736f72120764656661756c74188827206430e0d403
2022-02-03 09:37:13.182 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND PING: ack=false bytes=1234
2022-02-03 09:37:13.184 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND PING: ack=true bytes=1234
2022-02-03 09:37:13.246 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND HEADERS: streamId=17 headers=GrpcHttp2OutboundHeaders[:status: 200, content-type: application/grpc, grpc-encoding: identity, grpc-accept-encoding: gzip] padding=0 endStream=false
2022-02-03 09:37:13.256 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=65536 bytes=00004a44ee0aef801b08989382808080800412157365636f6e645f646174615f70726f636573736f72188680808080808004220a6d795f70726f636573732801...
2022-02-03 09:37:13.256 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=3a226e616d655f343934222c22696e646578223a3439342c22657870697265734174223a22323032322d30322d30335430393a34333a32322e30303037353437...
2022-02-03 09:37:13.257 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=34353332372b30313a3030227d2c7b226e616d65223a226e616d655f31303930222c22696e646578223a313039302c22657870697265734174223a2232303232...
2022-02-03 09:37:13.257 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=2d30335431303a30333a30332e3030353532353935322b30313a3030227d2c7b226e616d65223a226e616d655f31363736222c22696e646578223a313637362c...
2022-02-03 09:37:13.258 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=3235392c22657870697265734174223a22323032322d30322d30335431303a31323a34372e3030373031333136332b30313a3030227d2c7b226e616d65223a22...
2022-02-03 09:37:13.258 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=22657870697265734174223a22323032322d30322d30335431303a32323a33332e3030383337303533382b30313a3030227d2c7b226e616d65223a226e616d65...
2022-02-03 09:37:13.259 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=362e3030393734343637352b30313a3030227d2c7b226e616d65223a226e616d655f33343239222c22696e646578223a333432392c2265787069726573417422...
2022-02-03 09:37:13.259 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=30313a3030227d2c7b226e616d65223a226e616d655f34303135222c22696e646578223a343031352c22657870697265734174223a22323032322d30322d3033...
2022-02-03 09:37:13.260 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=70697265734174223a22323032322d30322d30335431303a35313a34362e3031323630353034362b30313a3030227d2c7b226e616d65223a226e616d655f3435...
2022-02-03 09:37:13.261 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=6c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e...
2022-02-03 09:37:13.261 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=70697265734174223a22323032322d30322d30335430393a34333a30382e3030303731353137392b30313a3030227d2c7b226e616d65223a226e616d655f3438...
2022-02-03 09:37:13.262 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=3331372b30313a3030227d2c7b226e616d65223a226e616d655f31303739222c22696e646578223a313037392c22657870697265734174223a22323032322d30...
2022-02-03 09:37:13.263 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=6578223a313636322c22657870697265734174223a22323032322d30322d30335431303a30323a35302e3030353439333936342b30313a3030227d2c7b226e61...
2022-02-03 09:37:13.263 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=22657870697265734174223a22323032322d30322d30335431303a31323a33362e3030363938353932352b30313a3030227d2c7b226e616d65223a226e616d65...
2022-02-03 09:37:13.264 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=226e616d65223a226e616d655f32383332222c22696e646578223a323833322c22657870697265734174223a22323032322d30322d30335431303a32323a3230...
2022-02-03 09:37:13.264 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=393732313833372b30313a3030227d2c7b226e616d65223a226e616d655f33343138222c22696e646578223a333431382c22657870697265734174223a223230...
2022-02-03 09:37:13.265 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=3032322d30322d30335431303a34313a34392e3031313231333330392b30313a3030227d2c7b226e616d65223a226e616d655f34303032222c22696e64657822...
2022-02-03 09:37:13.266 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=734174223a22323032322d30322d30335431303a35313a33352e3031323538313738392b30313a3030227d2c7b226e616d65223a226e616d655f34353838222c...
2022-02-03 09:37:13.267 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=6c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e...
2022-02-03 09:37:13.268 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=3639222c22696e646578223a3436392c22657870697265734174223a22323032322d30322d30335430393a34323a35372e3030303638353633362b30313a3030...
2022-02-03 09:37:13.269 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=4174223a22323032322d30322d30335430393a35323a35332e3030323537383133392b30313a3030227d2c7b226e616d65223a226e616d655f31303636222c22...
2022-02-03 09:37:13.269 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=3a313635312c22657870697265734174223a22323032322d30322d30335431303a30323a33392e3030353436383638322b30313a3030227d2c7b226e616d6522...
2022-02-03 09:37:13.271 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=616d65223a226e616d655f32323335222c22696e646578223a323233352c22657870697265734174223a22323032322d30322d30335431303a31323a32332e30...
2022-02-03 09:37:13.272 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=65223a226e616d655f32383231222c22696e646578223a323832312c22657870697265734174223a22323032322d30322d30335431303a32323a30392e303038...
2022-02-03 09:37:13.273 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=65734174223a22323032322d30322d30335431303a33313a35322e3030393639333230322b30313a3030227d2c7b226e616d65223a226e616d655f3334303522...
2022-02-03 09:37:13.273 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=322d30335431303a34313a33382e3031313138393335332b30313a3030227d2c7b226e616d65223a226e616d655f33393931222c22696e646578223a33393931...
2022-02-03 09:37:13.274 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=655f34353734222c22696e646578223a343537342c22657870697265734174223a22323032322d30322d30335431303a35313a32322e3031323535343036322b...
2022-02-03 09:37:13.275 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e756c6c2c6e75...
2022-02-03 09:37:13.275 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49152 bytes=223a22323032322d30322d30335430393a34323a34332e3030303634323638342b30313a3030227d2c7b226e616d65223a226e616d655f343536222c22696e64...
2022-02-03 09:37:13.276 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=48951 bytes=323032322d30322d30335430393a35323a34322e3030323534303938342b30313a3030227d2c7b226e616d65223a226e616d655f31303535222c22696e646578...
2022-02-03 09:37:13.277 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=47142 bytes=7d2c7b226e616d65223a226e616d655f31363338222c22696e646578223a313633382c22657870697265734174223a22323032322d30322d30335431303a3032...
2022-02-03 09:37:13.289 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND PING: ack=false bytes=1234
2022-02-03 09:37:13.289 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND PING: ack=true bytes=1234
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=1015808
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=65423 bytes=65223a226e616d655f32323030222c22696e646578223a323230302c22657870697265734174223a22323032322d30322d30335431303a31313a34382e303036...
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49064 bytes=65223a226e616d655f32393830222c22696e646578223a323938302c22657870697265734174223a22323032322d30322d30335431303a32343a34382e303038...
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49039 bytes=2b30313a3030227d2c7b226e616d65223a226e616d655f33353635222c22696e646578223a333536352c22657870697265734174223a22323032322d30322d30...
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49064 bytes=30335431303a34343a31372e3031313539303539342b30313a3030227d2c7b226e616d65223a226e616d655f34313530222c22696e646578223a343135302c22...
2022-02-03 09:37:13.290 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49039 bytes=70697265734174223a22323032322d30322d30335431303a35343a30322e3031323930363632332b30313a3030227d2c7b226e616d65223a226e616d655f3437...
2022-02-03 09:37:13.291 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49064 bytes=322d30335430393a33353a32352e3939393039323132362b30313a3030227d2c7b226e616d65223a226e616d655f3139222c22696e646578223a31392c226578...
2022-02-03 09:37:13.291 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49039 bytes=30313138303734352b30313a3030227d2c7b226e616d65223a226e616d655f363230222c22696e646578223a3632302c22657870697265734174223a22323032...
2022-02-03 09:37:13.291 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49064 bytes=30322d30335430393a35353a32312e3030333030303832322b30313a3030227d2c7b226e616d65223a226e616d655f31323134222c22696e646578223a313231...
2022-02-03 09:37:13.291 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=49039 bytes=382c22657870697265734174223a22323032322d30322d30335431303a30353a30362e30303538323638332b30313a3030227d2c7b226e616d65223a226e616d...
2022-02-03 09:37:13.291 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] OUTBOUND DATA: streamId=17 padding=0 endStream=false length=34690 bytes=226e616d655f32333833222c22696e646578223a323338332c22657870697265734174223a22323032322d30322d30335431303a31343a35312e303037323939...
2022-02-03 09:37:13.292 [] [grpc-default-worker-ELG-5-5] DEBUG
      io.grpc.netty.NettyServerHandler - [id: 0xb53c85b6, L:/127.0.0.1:26500 - R:/127.0.0.1:56856] INBOUND RST_STREAM: streamId=17 errorCode=8
2022-02-03 09:37:13.292 [] [grpc-default-worker-ELG-5-5] WARN 
      io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
	at io.netty.handler.codec.http2.Http2Exception.streamError(Http2Exception.java:172) ~[netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$FlowState.cancel(DefaultHttp2RemoteFlowController.java:481) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController$1.onStreamClosed(DefaultHttp2RemoteFlowController.java:105) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection.notifyClosed(DefaultHttp2Connection.java:357) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.removeFromActiveStreams(DefaultHttp2Connection.java:1007) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$ActiveStreams.deactivate(DefaultHttp2Connection.java:963) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:515) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2Connection$DefaultStream.close(DefaultHttp2Connection.java:521) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.closeStream(Http2ConnectionHandler.java:613) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onRstStreamRead(DefaultHttp2ConnectionDecoder.java:444) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger$1.onRstStreamRead(Http2InboundFrameLogger.java:80) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readRstStreamFrame(DefaultHttp2FrameReader.java:509) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.processPayloadState(DefaultHttp2FrameReader.java:259) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2FrameReader.readFrame(DefaultHttp2FrameReader.java:159) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2InboundFrameLogger.readFrame(Http2InboundFrameLogger.java:41) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder.decodeFrame(DefaultHttp2ConnectionDecoder.java:173) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler$FrameDecoder.decode(Http2ConnectionHandler.java:378) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.http2.Http2ConnectionHandler.decode(Http2ConnectionHandler.java:438) [netty-codec-http2-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-classes-epoll-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:833) [?:?]

@npepinpe
Copy link
Member

npepinpe commented Feb 3, 2022

Yes, that's a known issue, but unless we know of a quick fix I'd rather try to convince our product team to allocate resources to redesign the job worker end-to-end (i.e. including gateway/broker) for a smoother experience. I guess my point here is, don't dig too deep into that unless it's immediately relevant to the issue at hand, if that makes sense 😄

@npepinpe
Copy link
Member

npepinpe commented Feb 8, 2022

I'll leave it in the backlog for now. It seems to be issues mostly with job activation, and I would like that we work on the job activation pipeline in general, as it has more than one such issue, and no real quick fixes.

@npepinpe
Copy link
Member

npepinpe commented Apr 14, 2022

Would this be fixed by #8879 ? What was the end consensus, the issue was with the expression or the job activation?

@oleschoenburg
Copy link
Member

I think it was mostly the job activation that was the problem here. Thanks for the reminder, I actually wanted to test this again 👍

@oleschoenburg
Copy link
Member

For an input collection of 1000 elements everything looks as expected but for 2000 I'm still getting pretty much the same behavior.

Zeebe logs a bunch of StreamExceptions and the worker is only getting a handful of jobs.

2022-04-14 09:00:27.903 [] [grpc-default-worker-ELG-5-3] WARN 
      io.grpc.netty.NettyServerHandler - Stream Error
io.netty.handler.codec.http2.Http2Exception$StreamException: Stream closed before write could take place
2022/04/14 11:00:12 started process: 2251799813696325
2022/04/14 11:00:12 first job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:39 second job handler called
2022/04/14 11:00:40 second job handler called
2022/04/14 11:00:40 second job handler called
2022/04/14 11:00:40 second job handler called
2022/04/14 11:00:40 second job handler called

@npepinpe
Copy link
Member

Yes, but now the jobs should be returned to the broker and be available for activation again, so overall it the hope is perceived performance is better now. Is that the case?

@oleschoenburg
Copy link
Member

I don't think so. There's just nothing happening. I'd expect that the same worker gets job once they are re-activated, right?

@npepinpe
Copy link
Member

Yeah, that's the expectation 😞

@korthout
Copy link
Member

This issue came up in a support case. As a workaround users can try to reduce the maxActiveJobs configuration setting (default in java client, config in java client) to a low value (e.g. 1, 10, or 20).

@npepinpe
Copy link
Member

Can you refresh my memory here - the issue is that we spend too long busy-looping over the same multi-instance activity because the collection is too big? Correct?

@korthout
Copy link
Member

I'm not sure, I just tried the workaround and thought it makes sense to highlight it here. @oleschoenburg do you remember the cause?

@oleschoenburg
Copy link
Member

If I remember correctly, the cause was unclear to me. I had initially suspected that evaluating the input collection (which involves FEEL) was too slow but I later discarded that theory, probably based on what I saw when I experimented with smaller input collections.

@korthout
Copy link
Member

Marking priority as later because improving 'job activation' and 'multi-instance for large collections' are not the main concerns for the process automation team right now. However, we should work on this when we work on

Note that we're currently already improving multi-instance for large collections, with:

So we could consider increasing the priority to upcoming. @saig0 @remcowesterhoud What do you think?

Please comment if you think this should have a higher priority.

@korthout
Copy link
Member

korthout commented Aug 4, 2023

@lfourky I've had another look at this bug. Using your example, I'm able to reproduce this on zeebe version 8.2.11.

I had to make some small adjustments:

  • go.mod:
-require github.com/camunda-cloud/zeebe/clients/go v1.3.2
+require github.com/camunda/zeebe/clients/go/v8 v8.2.11
  • main.go:
-       "github.com/camunda-cloud/zeebe/clients/go/pkg/entities"
-       "github.com/camunda-cloud/zeebe/clients/go/pkg/worker"
-       "github.com/camunda-cloud/zeebe/clients/go/pkg/zbc"
+       "github.com/camunda/zeebe/clients/go/v8/pkg/entities"
+       "github.com/camunda/zeebe/clients/go/v8/pkg/worker"
+       "github.com/camunda/zeebe/clients/go/v8/pkg/zbc"

Running it, I noticed that the problem is not the large input collection but the large number of variables passed along with each job. The case works as follows:

  • a large variable called inputCollection is created at the process scope by the Data Loader's job worker
  • this variable is used as input collection for the parallel multi-instance, so far so good
  • this variable is sent along to each activated job in an activate jobs batch response
  • the gateway fails to hand over the jobs to the client

Solution 1:

So, then I got to the first idea. Just don't send the inputCollection along with each job. You can achieve this by setting the FetchVariables for the worker so that only the inputElement variable is fetched:

client.NewJobWorker().JobType("second_data_processor").Handler(secondParallelMultiInstance).FetchVariables("inputElement").Open()

This solves the problem by reducing the amount of data that needs to be transmitted to the worker for each batch of activated jobs.

Solution 2:

But I wondered, how can it be that the gateway cannot send the jobs to the client when there are large variables sent along with the activated jobs?

I was able to complete the process by specifying the client's inbound MaxMsgSize to 8 MB:

client, err := zbc.NewClient(&zbc.ClientConfig{
	GatewayAddress:         gatewayAddress,
	UsePlaintextConnection: true,
	DialOpts:               []grpc.DialOption{grpc.WithDefaultCallOptions(grpc.MaxCallRecvMsgSize(8 * 1024 * 1024))},
})

So, it seems that the problem is that the broker can produce responses larger than the expected maximum of 4MB (default). I'm not yet sure whether this is a bug or expected behavior.

@npepinpe What do you think?


@lfourky You can find a working version of your example using solution 2 here.

@korthout
Copy link
Member

korthout commented Aug 4, 2023

So, we have two workarounds available:

  1. don't fetch so many variables
  2. increase the inbound max message size

🤔 However, I'm still unsure whether it is expected that the gateway can send the client an ActivateJobBatch response that is larger than the default 4MB.

@korthout korthout added scope/clients-go Marks an issue or PR to appear in the Go client section of the changelog component/clients area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) and removed scope/broker Marks an issue or PR to appear in the broker section of the changelog component/engine area/performance Marks an issue as performance related labels Aug 7, 2023
@npepinpe
Copy link
Member

npepinpe commented Aug 7, 2023

So, it seems that the problem is that the broker can produce responses larger than the expected maximum of 4MB (default). I'm not yet sure whether this is a bug or expected behavior.

Since the response goes through different encodings, it's possible that it was very close to 4MB on the broker/Raft side, but once transformed to gRPC is larger than 4MB. I can't really think of another reason (other than misconfiguration with different max message sizes, but I assume that's not it here).

I would expect that if we fail to forward the activated job, we would yield it back to the engine.

@korthout
Copy link
Member

korthout commented Aug 8, 2023

I would expect that if we fail to forward the activated job, we would yield it back to the engine.

That happens as expected, yielding this way is still implemented using FailJob. In the situation above, this simply leads to loads of traffic:

  • collecting and activating jobs with large variables
  • failing them when unable to forward
  • collecting again
  • etc
  • sometimes some jobs make it through (somehow the size of available jobs at that moment in time or the size of the variables together fits in a response that the client can receive)

I'm not yet sure whether this is a bug or expected behavior.

it's possible that it was very close to 4MB on the broker/Raft side, but once transformed to gRPC is larger than 4MB.

@npepinpe Do you think that's expected behavior? Or should we avoid this from happening? For example, we could further limit the max size of jobs collected in a job batch record. That would be at the cost of payload size (users might expect to activate a single job with a single large variable near the configured max message size)

EDIT: If we believe this is expected behavior. Then I'll close this issue

@remcowesterhoud
Copy link
Contributor

For example, we could further limit the max size of jobs collected in a job batch record.

This sounds a lot like what I did recently for a different bug 😄

#13624

@romansmirnov romansmirnov added the component/zeebe Related to the Zeebe component/team label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/reliability Marks an issue as related to improving the reliability of our software (i.e. it behaves as expected) component/clients component/zeebe Related to the Zeebe component/team kind/bug Categorizes an issue or PR as a bug scope/clients-go Marks an issue or PR to appear in the Go client section of the changelog severity/mid Marks a bug as having a noticeable impact but with a known workaround
Projects
None yet
Development

No branches or pull requests

10 participants