High memory consumption during long running jobs #3790

PauliusPeciura · 2020-10-12T12:31:37Z

Bug description
We found that memory consumption is fairly high on one of the service nodes that uses the Spring Batch. Even though both data nodes did a similar amount of work, the memory consumption across nodes was not even - 15GB vs 1.5GB (see memory use screenshot).

We have some jobs that could run for seconds while others might run for hours, so we set the polling interval (MessageChannelPartitionHandler#setPollInterval) to 1 second rather than 10 seconds that is the default. In a large running job scenario, we ended up creating 837 step executions.

What I found was that MessageChannelPartitionHandler#pollReplies gets a full StepExecution representation for each step, which contains a JobExecution which would also contain StepExecutions for each. However, they are retrieved at different times and stages. This means that we end up with square number of StepExecution objects, e.g. 837*837=700569 StepExecutions (see screenshot below)

Environment
Initially reproduced on Spring Batch 4.1.4.

Expected behavior
My proposal would be to:

Issue a SQL query to get the count of running StepExecutions instead of retrieving DTOs. This way there is less objects loaded into the heap.
Once all steps are finished, then query for all StepExecutions for that job. We can then assign the same JobExecution to each step.

Memory usage graph comparison between two service nodes, doing roughly equal amount of work:

My apologies for a messy screenshot, but it does explain the number of StepExecution objects:

ssanghavi-appdirect · 2021-03-22T05:44:44Z

We are facing same issue. When number of steps in a job increases it leads to OOM, killing the manager jvm.
Is there a plan to fix this?

fmbenhassine · 2021-03-22T16:58:16Z

@PauliusPeciura Thank you for reporting this issue and for opening a PR! I would like to be able to reproduce the issue first in order to validate a fix if any. From your usage of MessageChannelPartitionHandler, I understand that this is related to a remote partitioning setup. However, you did not share your job/step configuration. Is a job with a single partitioned step configured with a high number of worker steps enough to reproduce the issue? Do you think the same problem would happen locally with a TaskExecutorPartitionHandler (this would be easier to test in comparison to a remote partitioning setup)? I would be grateful if you could share more details on your configuration or provide a minimal example.

@ssanghavi-appdirect Yes. If we can reproduce the issue in a reliable manner, we will plan a fix for one of the upcoming releases.

ssanghavi-appdirect · 2021-03-23T12:34:34Z

@benas I am able to reproduce with TaskExecutorPartitionHandler as well. However the fix provided by @PauliusPeciura is very specific to DB polling and won't fix what I reproduced with TaskExecutorPartitionHandler
Basically this issue can occur in any code path that is holding references to StepExecution objects returned by JobExplorer.getStepExecution. Similar code exists in RemoteStepExecutionAggregator.aggregate() and MessageChannelPartitionHandler.pollReplies.

Scenario to reproduce: create a job with more than 900 remote partitions, wait for it to complete. Observe that manager jvm fails with OOM if -Xmx is set else memory consumption keeps on increasing.
Issue can be reproduced with both MessageChannelPartitionHandler and TaskExecutorPartitionHandler. We are able to reproduce issue both using DB polling and request-reply channel while using MessageChannelPartitionHandler.

What is the most convenient way to share code that reproduces issue?

ssanghavi-appdirect · 2021-03-31T04:59:30Z

Attaching a spring boot project that can reproduce issue with TaskExecutorPartitionHandler. It requires maven and java 11 to run.

Steps to execute the program

Download the attached zip file and extract the contents
Navigate to spring-batch-remoting directory that is created by step# 1
Run maven command to build mvn clean install
Start java process with java -Xmx250m -jar target/spring-batch-remoting-0.0.1-SNAPSHOT.jar

spring-batch-remoting.zip

cazacmarin · 2022-08-27T17:31:18Z

Will this picture help, guys? will it indicate that using last Spring batch version, you will really agree that you have a memory leak inside?

fmbenhassine · 2023-02-22T08:28:41Z

Thank you all for for your feedback here! This is a valid performance issue. There is definitely no need to load the entire object graph of step executions when polling the status of workers.

Ideally, polling for running workers could be done with a single query, and once they are all done, we should grab shallow copies of step executions with the minimum required to do the aggregation.

I will plan the fix for the upcoming 5.0.1 / 4.3.8.

Resolves #3790

galovics · 2023-03-29T07:44:46Z

@fmbenhassine I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

Here's a snapshot from a heap dump I've taken:

And here's the relevant stacktrace where the objects are coming from:

Scheduler1_Worker-1
  at java.lang.Thread.sleep(J)V (Native Method)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get(JLjava/util/concurrent/TimeUnit;)Ljava/lang/Object; (DirectPoller.java:109)
  at org.springframework.batch.poller.DirectPoller$DirectPollingFuture.get()Ljava/lang/Object; (DirectPoller.java:80)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.pollReplies(Lorg/springframework/batch/core/StepExecution;Ljava/util/Set;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:288)
  at org.springframework.batch.integration.partition.MessageChannelPartitionHandler.handle(Lorg/springframework/batch/core/partition/StepExecutionSplitter;Lorg/springframework/batch/core/StepExecution;)Ljava/util/Collection; (MessageChannelPartitionHandler.java:251)
  at org.springframework.batch.core.partition.support.PartitionStep.doExecute(Lorg/springframework/batch/core/StepExecution;)V (PartitionStep.java:106)
  at org.springframework.batch.core.step.AbstractStep.execute(Lorg/springframework/batch/core/StepExecution;)V (AbstractStep.java:208)
  at org.springframework.batch.core.job.SimpleStepHandler.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (SimpleStepHandler.java:152)
  at org.springframework.batch.core.job.AbstractJob.handleStep(Lorg/springframework/batch/core/Step;Lorg/springframework/batch/core/JobExecution;)Lorg/springframework/batch/core/StepExecution; (AbstractJob.java:413)
  at org.springframework.batch.core.job.SimpleJob.doExecute(Lorg/springframework/batch/core/JobExecution;)V (SimpleJob.java:136)
  at org.springframework.batch.core.job.AbstractJob.execute(Lorg/springframework/batch/core/JobExecution;)V (AbstractJob.java:320)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run()V (SimpleJobLauncher.java:149)
  at org.springframework.core.task.SyncTaskExecutor.execute(Ljava/lang/Runnable;)V (SyncTaskExecutor.java:50)
  at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(Lorg/springframework/batch/core/Job;Lorg/springframework/batch/core/JobParameters;)Lorg/springframework/batch/core/JobExecution; (SimpleJobLauncher.java:140)
  ...
  at org.springframework.scheduling.quartz.QuartzJobBean.execute(Lorg/quartz/JobExecutionContext;)V (QuartzJobBean.java:75)
  at org.quartz.core.JobRunShell.run()V (JobRunShell.java:202)
  at org.quartz.simpl.SimpleThreadPool$WorkerThread.run()V (SimpleThreadPool.java:573)

Note: this specific job could run for hours and processes a lot of data (millions of records). When the number of partitions exceed 500 (not the threshold) the manager is slowly accumulating more and more memory.
As a mitigation, I've reduced the number of partitions to 36ish and now it doesn't fail. Probably it's still consuming more and more memory but finishes before it starts to run OOM.

fmbenhassine · 2023-04-03T09:40:37Z

@galovics Thank you for reporting this.

I'm afraid the issue is still present. I've checked the commit you made but since it's still working with entities, the associations are still there.

We will always work with entities according to the domain model. What we can do is reduce the number of entities loaded in memory to the minimum required. Before 93800c6, the code was loading job executions in a loop for every partitioned step execution, which is obviously not necessary.

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

To correctly address any performance issue, we need to analyse the performance for a single job execution first. So I am expecting to see a single job execution in memory with a partitioned step. Once we ensure that a single partitioned execution is optimized, we can discuss if the packaging/deployment pattern is suitable to run several job executions in the same JVM or not.

Please open a separate issue and provide a minimal example to be sure we are addressing your specific issue and we will dig deeper. Thank you upfront.

galovics · 2023-04-05T07:34:29Z

@fmbenhassine

In your screenshot, I see you have several JobExecution objects with different IDs. Are you running several job instances in the same JVM and sharing the MessageChannelPartitionHanlder between them?

That's strange to me too. I re-read the Spring Batch docs on job instances to use the same terminology and understanding and I can confirm there's a single job instance being run. In fact it's the book example of the Spring Batch docs.
It's a remote partitioned end of day job (close of business (COB) as we refer to it) running once each day.

I can even show the code to you cause the project is open-source.
Here's the whole manager configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBManagerConfiguration.java
Here's the worker configuration: https://github.com/apache/fineract/blob/dbfedf5cfdffbddfd400f51498c02a88c0551bd1/fineract-provider/src/main/java/org/apache/fineract/cob/loan/LoanCOBWorkerConfiguration.java

fmbenhassine · 2023-04-05T08:20:45Z

Thank you for your feedback.

I can confirm there's a single job instance being run

In that case, there should really be a single JobExecution object in memory. By design, Spring Batch does not allow concurrent job executions of the same job instance. Therefore, if a single job instance is launched within a JVM, there should be a single job execution for that instance running at a time (and consequently, a single JobExecution object in memory). That is the setup we need to analyse the performance issue.

As mentioned previously, as this issue has been closed and assigned to a release, please open a separate one with all these details and I will take a look. Thank you upfront.

pstetsuk · 2024-01-08T10:13:23Z

We have the same problem. I modified PR #3791so it can be merged to main branch

hpoettker · 2024-05-19T18:20:01Z

@galovics @pstetsuk
If you find the time, it would be interesting to hear whether #4599 improves the situation for you.

pstetsuk · 2024-05-20T07:42:30Z

@hpoettker our problem is that we have thousands steps and all of then load in-memory every time it checks step result. It leads to OutOfMemory. Your fix doesn't change this behavior and can't resolve the problem. In the fix from @galovics it doesn't load all the steps but get the count of incomplete steps from the database. It works much faster and consumer much less memory.

PauliusPeciura added status: waiting-for-triage Issues that we did not analyse yet type: bug labels Oct 12, 2020

PauliusPeciura mentioned this issue Oct 12, 2020

Poll the count of running step executions #3791

Open

fmbenhassine added related-to: performance type: enhancement in: core and removed type: bug labels Nov 9, 2020

fmbenhassine added status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter and removed status: waiting-for-triage Issues that we did not analyse yet labels Mar 22, 2021

fmbenhassine added the has: minimal-example Bug reports that provide a minimal complete reproducible example label Mar 31, 2021

hpoettker mentioned this issue Oct 5, 2022

Performance improvement for step execution retrieval #4208

Closed

fmbenhassine added for: backport-to-4.3.x Issues that will be back-ported to the 4.3.x line and removed status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter labels Feb 22, 2023

fmbenhassine added this to the 5.0.1 milestone Feb 22, 2023

fmbenhassine added the has: votes Issues that have votes label Feb 22, 2023

fmbenhassine mentioned this issue Feb 22, 2023

Improve performance when polling replies in MessageChannelPartitionHandler #4135

Closed

fmbenhassine pushed a commit that referenced this issue Feb 22, 2023

Improve step execution polling and retrieval

c68da18

Resolves #3790

fmbenhassine closed this as completed in 93800c6 Feb 22, 2023

fmbenhassine mentioned this issue Feb 22, 2023

4.3.8 Backported issues #4206

Closed

pstetsuk mentioned this issue Jan 7, 2024

Poll the count of running step executions (updated) #4530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory consumption during long running jobs #3790

High memory consumption during long running jobs #3790

PauliusPeciura commented Oct 12, 2020 •

edited

ssanghavi-appdirect commented Mar 22, 2021

fmbenhassine commented Mar 22, 2021

ssanghavi-appdirect commented Mar 23, 2021

ssanghavi-appdirect commented Mar 31, 2021

cazacmarin commented Aug 27, 2022

fmbenhassine commented Feb 22, 2023

galovics commented Mar 29, 2023 •

edited

fmbenhassine commented Apr 3, 2023

galovics commented Apr 5, 2023

fmbenhassine commented Apr 5, 2023

pstetsuk commented Jan 8, 2024

hpoettker commented May 19, 2024

pstetsuk commented May 20, 2024

High memory consumption during long running jobs #3790

High memory consumption during long running jobs #3790

Comments

PauliusPeciura commented Oct 12, 2020 • edited

ssanghavi-appdirect commented Mar 22, 2021

fmbenhassine commented Mar 22, 2021

ssanghavi-appdirect commented Mar 23, 2021

ssanghavi-appdirect commented Mar 31, 2021

cazacmarin commented Aug 27, 2022

Will this picture help, guys? will it indicate that using last Spring batch version, you will really agree that you have a memory leak inside?

fmbenhassine commented Feb 22, 2023

galovics commented Mar 29, 2023 • edited

fmbenhassine commented Apr 3, 2023

galovics commented Apr 5, 2023

fmbenhassine commented Apr 5, 2023

pstetsuk commented Jan 8, 2024

hpoettker commented May 19, 2024

pstetsuk commented May 20, 2024

PauliusPeciura commented Oct 12, 2020 •

edited

galovics commented Mar 29, 2023 •

edited