Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

throttle-limit unclear, javadoc wrong #3947

Closed
farnetto opened this issue Jun 21, 2021 · 7 comments
Closed

throttle-limit unclear, javadoc wrong #3947

farnetto opened this issue Jun 21, 2021 · 7 comments

Comments

@farnetto
Copy link
Contributor

farnetto commented Jun 21, 2021

It is unclear what throttle-limit does, and the Javadoc comment in TaskExecutorRepeatTemplate#setThrottleLimit is wrong imho.

Yes, "the throttle limit is the largest number of concurrent tasks that can be executing at one time", but the following statement in the javadoc (from @dsyer possibly?) is misleading or wrong I think:

N.B. when used with a thread pooled {@link TaskExecutor} the thread pool
might prevent the throttle limit actually being reached (so make the core
pool size larger than the throttle limit if possible).

The semaphore waits in ResultHolderResultQueue prevents more than "throttle limit" tasks being active at once, so if the thread pool is larger, there will be idle threads.

From another angle, a ThreadPoolExecutor will normally never block when adding a task, so how could it prevent throttle limit from being reached? Tasks above and beyond the number of available threads will simply be queued.

Throttle limit is the number of tasks being processed by threads plus the number of tasks in the jobQueue when using a ThreadPoolTaskExecutor (and hence a ThreadPoolExecutor), so a better name for the property would be e.g. max-concurrent.

I find the example parallel-sample could be much improved. One could set the property throttle-limit, use a ThreadPoolTaskExecutor (with e.g. corePoolSize == maxPoolSize for a fixed-size task executor), and have more than five lines as input data! See 20070122.teststream.ImportTradeDataStep.txt.

On the same note, throttle-limit does not even get a mention in @mminella's latest book. The documentation and explanation for scaling Spring Batch in general could be much improved and expanded in my opinion.

@farnetto farnetto added the status: waiting-for-triage Issues that we did not analyse yet label Jun 21, 2021
@fmbenhassine
Copy link
Contributor

fmbenhassine commented Jun 24, 2021

It is unclear what throttle-limit does

The throttleLimit parameter limits the number of concurrent executions of the Tasklet in a TaskletStep when a task executor is specified. This is explained in the javadoc of AbstractTaskletStepBuilder#throttleLimit. Here is a quick example:

import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobExecution;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.batch.core.listener.JobExecutionListenerSupport;
import org.springframework.batch.core.step.tasklet.Tasklet;
import org.springframework.batch.repeat.RepeatStatus;
import org.springframework.context.ApplicationContext;
import org.springframework.context.annotation.AnnotationConfigApplicationContext;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor;

@Configuration
@EnableBatchProcessing
public class GH3947 {

    @Bean
    ThreadPoolTaskExecutor taskExecutor() {
        ThreadPoolTaskExecutor taskExecutor = new ThreadPoolTaskExecutor();
        taskExecutor.setCorePoolSize(10);
        taskExecutor.setMaxPoolSize(10);
        taskExecutor.setQueueCapacity(10);
        return taskExecutor;
    }

    @Bean
    public Job job(JobBuilderFactory jobs, StepBuilderFactory steps, ThreadPoolTaskExecutor taskExecutor) {
        Tasklet tasklet = (contribution, chunkContext) -> {
            System.out.println(Thread.currentThread().getName() + " running tasklet");
            return RepeatStatus.FINISHED;
        };

        return jobs.get("job")
                .start(steps.get("step")
                        .tasklet(tasklet)
                        .taskExecutor(taskExecutor)
                        .throttleLimit(5)
                        .build())
                .listener(new JobExecutionListenerSupport() {
                    @Override
                    public void afterJob(JobExecution jobExecution) {
                        taskExecutor.shutdown();
                    }
                })
                .build();
    }

    public static void main(String[] args) throws Exception {
        ApplicationContext context = new AnnotationConfigApplicationContext(GH3947.class);
        JobLauncher jobLauncher = context.getBean(JobLauncher.class);
        Job job = context.getBean(Job.class);
        jobLauncher.run(job, new JobParameters());
    }

}

This prints:

taskExecutor-2 running tasklet
taskExecutor-5 running tasklet
taskExecutor-3 running tasklet
taskExecutor-1 running tasklet
taskExecutor-4 running tasklet

In this example, the task executor pool size is set to 10, but the number of concurrent Tasklet executions is throttled at 5. Is that more clear to you now? Hopefully this example makes things a bit clearer.

the following statement in the javadoc (from @dsyer possibly?) is misleading or wrong I think:

N.B. when used with a thread pooled {@link TaskExecutor} the thread pool
might prevent the throttle limit actually being reached (so make the core
pool size larger than the throttle limit if possible).

I think this javadoc statement is about the case where the task executor is configured to limit the number of concurrent tasks to a value lower than the throttle limit, in which case it would prevent the throttle limit to be reached.

That said, the fact that this throttle limit parameter may interfere with the throttle limit of the task executor itself (which is the confusion you are facing I think), and in addition to other issues like #1516, this parameter will be deprecated in v5 in favor of the throttling capabilities of the provided task executor, see #2218.

With regard to the parallel sample improvement, we are actually planning to improve all samples in v5 as well, see #3663 .

@fmbenhassine fmbenhassine added status: waiting-for-reporter Issues for which we are waiting for feedback from the reporter and removed status: waiting-for-triage Issues that we did not analyse yet labels Jun 24, 2021
@farnetto
Copy link
Contributor Author

Your example doesn't terminate. The main thread ends but all the threads in the pool block waiting on a new task. You need to call setAllowCoreThreadTimeOut or setDaemon on the ThreadPoolTaskExecutor or shutdown on the ThreadPoolExecutor.

You cap the concurrency at 5 with the throttle limit but permanently have 10 threads in the pool. That's a waste of resources, doesn't make any sense. And why cause a queue for waiting tasks to be created with setQueueCapacity when it will never be used? That doesn't make any sense either.

@fmbenhassine
Copy link
Contributor

Your example doesn't terminate.

That's not the real issue here, I updated the sample to shutdown the task executor with a listener after the job. That said, the task executor could be kept running if the JVM was used for something else after the job is finished (like for example to run subsequent scheduled jobs).

You cap the concurrency at 5 with the throttle limit but permanently have 10 threads in the pool. That's a waste of resources, doesn't make any sense. And why cause a queue for waiting tasks to be created with setQueueCapacity when it will never be used? That doesn't make any sense either.

Of course it's a waste of resources in this particular case, that's why it is up to you to correctly configure the size of your thread pool and its queue capacity. The example is only for demonstration purpose to clarify things, and could not explain the purpose of throttle-limit parameter if its value was higher than the pool size.

But this is not always a waste of resources, for example in the case where the same thread pool is used in different parts of batch application. Many users prefer creating a single thread pool (instead of several ones due to the cost of creating threads/pools) and (re)use it in different parts of the batch app, like for example to launch jobs (by setting the pool in the JobLauncher) and also to run a multi-threaded step or a locally partitioned step. In this case, you can have a pool of 10 threads and set the throttle-limit to 5 in your step, in which case, only 5 threads from the pool will be used for the step, and the remaining threads could be used to launch concurrent jobs or something else. In this case, there is no waste of resources, and the throttle-limit makes perfect sense.

@farnetto
Copy link
Contributor Author

I was actually not asking for an explanation. I understand what the throttle limit does because I have looked at the code. My point is the throttle limit is unclearly documented and explained at best.

The throttle limit is not a cap on the number of concurrent executions. It is the exact number of (dummy) runnables that are submitted to the Executor as long as the step is running, i.e. until the RepeatStatus of the Tasklet is FINISHED, hence it is a minimum as well. This is regulated by the semaphore waits in ResultHolderResultQueue.

Permits would be a more appropriate name. Why not use a bounded queue as a thread handoff mechanism instead? What exactly is ResultHolderResultQueue.count counting - shouldn't it have a better name or at least documentation? It is not a queue because it HAS A queue (results). Why is the semaphore protecting the executor from flooding accessed in TaskExecutorRepeatTemplate indirectly through the dummy runnable? This code is poor and needs an overhaul. I'm a bit shocked it was written over 10 years ago and hasn't been improved since.

@farnetto
Copy link
Contributor Author

farnetto commented Jun 28, 2021

With your example setting a queue capacity it is easy to get the batch to hang with an inappropriate throttle limit - because throttle limit is not just a cap but also the fewest number of tasks sent to the executor. I've opened a bug report for this problem here: #3951.

@fmbenhassine
Copy link
Contributor

ok let's back up, one issue at a time because you are addressing several aspects at once (documentation, implementation, samples, etc).

I was actually not asking for an explanation.

Well, based on your statement: It is unclear what throttle-limit does, I thought it was not clear to you, so I tried to clarify things with an example.

The throttle limit is not a cap on the number of concurrent executions.

So here you say throttle limit is not a cap on the number of executions, but in your initial comment you said: Yes, "the throttle limit is the largest number of concurrent tasks that can be executing at one time", [...]. Sorry, but I'm not sure I'm following anymore.

There are at least two things reported here based on what you shared:

Documentation issue

You claim the Javadoc is misleading or wrong, but I don't think so. The javadocs being discussed here are:

In the case of an asynchronous taskExecutor(TaskExecutor) the number of concurrent tasklet executions can be throttled (beyond any throttling provided by a thread pool)

This is clear and correct IMO, as shown in the previous example. If it is not clear to you, please open a PR with what you think will clarify the docs.

N.B. when used with a thread pooled TaskExecutor the thread pool might prevent the throttle limit actually being reached (so make the core pool size larger than the throttle limit if possible)

This is also correct. If you set your thread pool size to 10 and the throttle limit to 2000, then your throttle limit won't be reached since the thread pool will prevent that from happening anyway (ie having 2000 concurrent executions with only 10 threads). That's why the javadoc suggests to make the pool size larger than the throttle limit if possible. Again, if you think this is not clear enough, please open a PR with what you think can make things clearer.

That said, I'm closing this issue in favor of a PR from your part if you still think this is needed.

Implementation issue

Why not use a bounded queue as a thread handoff mechanism instead? What exactly is ResultHolderResultQueue.count counting - shouldn't it have a better name or at least documentation? It is not a queue because it HAS A queue (results). Why is the semaphore protecting the executor from flooding accessed in TaskExecutorRepeatTemplate indirectly through the dummy runnable? This code is poor and needs an overhaul. I'm a bit shocked it was written over 10 years ago and hasn't been improved since.

We are open to constructive feedback. If you think the current implementation can be improved, you are welcome to share your ideas or open a PR. Please note that the throttle-limit will be deprecated (#2218) and the concurrency model will be reviewed (#3950), so I suggest you wait for that to happen before attempting to improve the current implementation.

With your example setting a queue capacity it is easy to get the batch to hang with an inappropriate throttle limit - because throttle limit is not just a cap but also the fewest number of tasks sent to the executor. I've opened a bug report for this problem here: #3951.

Thank you for opening that issue. We will take a look and get back to you.

@farnetto
Copy link
Contributor Author

farnetto commented Jun 29, 2021

Sorry, you're not going to sell me on this code. Just quoting the javadoc and claiming everything is correct is not going to convince anybody. Please read the code. You don't seem to be familiar with it. I've shown it is quite easy to produce a deadlock in multithreaded step among other things. We'll just have to hope for an improvement in v5 (after no improvement in v2-4). I have nothing left to say on this topic, except that scalability is one of the big reasons we were considering adopting Spring Batch, and Michael Minella's book, after claiming "the features Spring Batch offers to scale ... are some of the strongest of any framework," spends only two pages on multithreaded step and doesn't even mention the crucial throttleLimit property, or says not to use SimpleAsyncTaskExecutor after using it in the examples, but doesn't describe any alternatives or their pitfalls.
fyi @dsyer @mminella

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants