New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock accessing creating a job on sqlserver when multiple jobs start at once. [BATCH-2147] #1448
Comments
Steve Ash commented I also hit this and I have the T1222 trace info showing exactly what is deadlocking (although I think its already obvious from the above). I do not think that retrying should be the solution. If the problem is just getting sequence numbers then I think a better solution is just acquiring those sequence numbers in a separate transaction. Having "missing" sequence numbers (due to failure) is not a problem. They are synthethic numbers.
|
Steve Ash commented Ok well I created a Max Value Incrementor Factory that decorates the default one by using the TransactionTemplate to give each call its own transaction. This got past my original deadlock and went right to another. In the SimpleJobRepository when creating a job it first SELECTs for the first job, then if it doesn't find one, INSERTs a new one. Unfortunately that first select (even at serializable) just takes out shared locks. So two concurrent processes both get shared (reader) locks, then both try to acquire exclusive (writer) locks -- leading to a deadlock. Since you know that you're probably going to end up inserting after the select, its best to issue that select with the FOR UPDATE clause, which will cause the database to acquire Update locks -- which don't allow multiple reads -- and thus don't suffer from this conversion deadlock scenario. |
Ayushya Devmurari commented Hey there I am also facing some issues while running(starting) multiple jobs at the same time. Stack trace of my application is on this link Is this the same issue under the hibernate layer? If you need any more information I am willing to provide. |
Andreas Selenwall commented Spring Team! Do you have an update on this issue. This EXACT issue occur for us as well. It doesn't matter which JobParams we use we still get a deadlock in SqlServerMaxValue which keeps failing our jobs from time to time. It is not about unique jobs, it happens because we access the same table. Doesn't happen a lot but still pretty annoying to deal with since we at this moment have to rerun jobs manually when it occurs. Could you please just investigate and give us some possible work-around to be able to solve this? |
Josh Chappelle commented I believe we are also experiencing this problem. There is a comment in SqlServerMaxValueIncrementer that says it's recommended to use IDENTITY columns for SQL Server. Is it possible to make Spring Batch work that way? It's easy enough to change the schema if there was only a way to configure spring batch to retrieve the generated keys. Here's the stack trace in case it helps:
|
kaushik mondal commented Spring Team! Do you have an update on this issue? |
Andreas Fe commented Well we faced the same issue and seem to have been able to solve it through configuration. @Configuration
@EnableBatchProcessing
public class BatchConfiguration {
private static final String ISOLATION_REPEATABLE_READ = "ISOLATION_REPEATABLE_READ";
@Autowired
private DataSource dataSource;
@Autowired
private PlatformTransactionManager platformTransactionManager;
@Bean
public JobRepository jobRepository() throws Exception {
JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
factory.setDataSource(dataSource);
factory.setTransactionManager(platformTransactionManager);
factory.setValidateTransactionState(true);
factory.setIsolationLevelForCreate(ISOLATION_REPEATABLE_READ);
factory.setIncrementerFactory(customIncrementerFactory());
factory.afterPropertiesSet();
return factory.getObject();
}
@Bean
public SimpleJobLauncher jobLauncher(JobRepository jobRepository) {
SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
simpleJobLauncher.setJobRepository(jobRepository);
return simpleJobLauncher;
}
private DataFieldMaxValueIncrementerFactory customIncrementerFactory() {
return new CustomDataFieldMaxValueIncrementerFactory(dataSource);
}
private class CustomDataFieldMaxValueIncrementerFactory extends DefaultDataFieldMaxValueIncrementerFactory {
CustomDataFieldMaxValueIncrementerFactory(DataSource dataSource) {
super(dataSource);
}
@Override
public DataFieldMaxValueIncrementer getIncrementer(String incrementerType, String incrementerName) {
DataFieldMaxValueIncrementer incrementer = super.getIncrementer(incrementerType, incrementerName);
if (incrementer instanceof SqlServerMaxValueIncrementer) {
((SqlServerMaxValueIncrementer) incrementer).setCacheSize(20);
}
return incrementer;
}
}
} If there's an easier way to achive this, feel free to point me in the direction :) Update:
Different jobs were now trying to insert into the DB at the same time, causing a deadlock. We now set IsolationLevel to "ISOLATION_READ_COMMITTED". To prevent executing the same job in parallel on a cluster we've been using Hazelcast-locks all along. |
Jeff Smith commented I ran into this issue and fought it all last week but found a workaround last Friday and since I've implemented in production, haven't seen a single deadlock. The fix that worked for me is 2 things:
ALTER TABLE [JET].[BATCH_JOB_SEQ] ALTER TABLE [JET].[BATCH_STEP_EXECUTION_SEQ] I spent alot of time looking at this code last week and you should know that to get a next sequence number value it basically does these 3 things within a transaction:
All 3 of those steps are wrapped into a transaction that is added into the code via an aspect. That is why you don't see the transaction code directly in that class, its injected into it from another class. During testing I was able to consistently reproduce this by kicking off many repeated instances of jobs kicking off rapidly between 2 different spring batch JVM's that shared the same repo tables. I had the DBA use monitoring tools and she was able to detect the deadlock was due to the delete operation. It was putting a table lock in place and causing the deadlock from the other JVM's accessing the same SEQ table. For some reason adding the index meant that the delete operation was able to use the index to do its work instead of locking the whole table. But, I did test both with and without setting the ISOLATION level as I specified above. I did also have to set that isolation level otherwise the deadlock still happened. I'm not an expert in the isolation levels so I'm not totally sure why that works that way, but it does. I would recommend the spring batch developers add those index creation steps to the SQL that is used to create the repo tables tables for SQLServer. I hope this helps. |
Josh Chappelle commented The only way I was able to finally put this to bed was by wrapping my JobRepository with retry functionality. |
Leonardo Duarte commented So far I've not been able to recreate any issues (deadlocks or duplicate key) using sql server IDENTITY columns instead of psuedo sequences. No need to use retry because deadlocks are no longer occurring. No other solution works in my scenario, I even tried using SQL Server 2012 Sequences. Tested with multiple batch server instances running against a single database. The job used for testing uses partitioning (with ThreadPoolTaskExecutor). Make the following fields of type IDENTITY: BATCH_JOB_INSTANCE.JOB_INSTANCE_ID
BATCH_JOB_EXECUTION.JOB_EXECUTION_ID
BATCH_STEP_EXECUTION.STEP_EXECUTION_ID Change or copy JdbcJobInstanceDao (SqlServerJdbcJobInstanceDao) and change the CREATE_JOB_INSTANCE sql constant and createJobInstance method to: private static final String CREATE_JOB_INSTANCE = "INSERT into %PREFIX%JOB_INSTANCE(JOB_NAME, JOB_KEY, VERSION)"
+ " values (?, ?, ?)";
@Override
public JobInstance createJobInstance(String jobName, JobParameters jobParameters) {
Assert.notNull(jobName, "Job name must not be null.");
Assert.notNull(jobParameters, "JobParameters must not be null.");
Assert.state(getJobInstance(jobName, jobParameters) == null, "JobInstance must not already exist");
JobInstance jobInstance = new JobInstance(null, jobName);
jobInstance.incrementVersion();
KeyHolder generatedKeyHolder = new GeneratedKeyHolder();
getJdbcTemplate().update(connection -> {
final PreparedStatement ps = connection.prepareStatement(getQuery(CREATE_JOB_INSTANCE), Statement.RETURN_GENERATED_KEYS);
ps.setString(1, jobName);
ps.setString(2, jobKeyGenerator.generateKey(jobParameters));
ps.setInt (3, jobInstance.getVersion());
return ps;
}, generatedKeyHolder);
jobInstance.setId(generatedKeyHolder.getKey().longValue());
return jobInstance;
} Change or copy JdbcJobExecutionDao (SqlServerJdbcJobExecutionDao) and change SAVE_JOB_EXECUTION sql constant and saveJobExecution method: private static final String SAVE_JOB_EXECUTION = "INSERT into %PREFIX%JOB_EXECUTION(JOB_INSTANCE_ID, START_TIME, "
+ "END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, VERSION, CREATE_TIME, LAST_UPDATED, JOB_CONFIGURATION_LOCATION) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
@Override
public void saveJobExecution(JobExecution jobExecution) {
validateJobExecution(jobExecution);
jobExecution.incrementVersion();
KeyHolder generatedKeyHolder = new GeneratedKeyHolder();
getJdbcTemplate().update(connection -> {
PreparedStatement ps = connection.prepareStatement(getQuery(SAVE_JOB_EXECUTION), Statement.RETURN_GENERATED_KEYS);
ps.setLong ( 1, jobExecution.getJobId());
ps.setDate ( 2, jobExecution.getStartTime() != null ? new java.sql.Date(jobExecution.getStartTime().getTime()) : null);
ps.setDate ( 3, jobExecution.getEndTime() != null ? new java.sql.Date(jobExecution.getEndTime().getTime()) : null);
ps.setString( 4, jobExecution.getStatus().toString());
ps.setString( 5, jobExecution.getExitStatus().getExitCode());
ps.setString( 6, jobExecution.getExitStatus().getExitDescription());
ps.setInt ( 7, jobExecution.getVersion());
ps.setDate ( 8, jobExecution.getCreateTime() != null ? new java.sql.Date(jobExecution.getCreateTime().getTime()) : null);
ps.setDate ( 9, jobExecution.getLastUpdated() != null ? new java.sql.Date(jobExecution.getLastUpdated().getTime()) : null);
ps.setString(10, jobExecution.getJobConfigurationName());
return ps;
}, generatedKeyHolder);
jobExecution.setId(generatedKeyHolder.getKey().longValue());
insertJobParameters(jobExecution.getId(), jobExecution.getJobParameters());
} Change or copy JdbcStepExecutionDao (SqlServerJdbcStepExecutionDao) and change the SAVE_STEP_EXECUTION sql constant and saveStepExecution/saveStepExecutions methods: private static final String SAVE_STEP_EXECUTION = "INSERT into %PREFIX%STEP_EXECUTION(VERSION, STEP_NAME, JOB_EXECUTION_ID, START_TIME, "
+ "END_TIME, STATUS, COMMIT_COUNT, READ_COUNT, FILTER_COUNT, WRITE_COUNT, EXIT_CODE, EXIT_MESSAGE, READ_SKIP_COUNT, WRITE_SKIP_COUNT, PROCESS_SKIP_COUNT, ROLLBACK_COUNT, LAST_UPDATED) "
+ "values(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
@Override
public void saveStepExecution(StepExecution stepExecution) {
stepExecution.incrementVersion();
final KeyHolder generatedKeyHolder = new GeneratedKeyHolder();
getJdbcTemplate().update(connection -> {
PreparedStatement ps = connection.prepareStatement(getQuery(SAVE_STEP_EXECUTION), Statement.RETURN_GENERATED_KEYS);
ps.setInt ( 1, stepExecution.getVersion());
ps.setString( 2, stepExecution.getStepName());
ps.setLong ( 3, stepExecution.getJobExecutionId());
ps.setDate ( 4, stepExecution.getStartTime() != null ? new Date(stepExecution.getStartTime().getTime()) : null);
ps.setDate ( 5, stepExecution.getEndTime() != null ? new Date(stepExecution.getEndTime().getTime()) : null);
ps.setString( 6, stepExecution.getStatus().toString());
ps.setInt ( 7, stepExecution.getCommitCount());
ps.setInt ( 8, stepExecution.getReadCount());
ps.setInt ( 9, stepExecution.getFilterCount());
ps.setInt (10, stepExecution.getWriteCount());
ps.setString(11, stepExecution.getExitStatus().getExitCode());
ps.setString(12, truncateExitDescription(stepExecution.getExitStatus().getExitDescription()));
ps.setInt (13, stepExecution.getReadSkipCount());
ps.setInt (14, stepExecution.getWriteSkipCount());
ps.setInt (15, stepExecution.getProcessSkipCount());
ps.setInt (16, stepExecution.getRollbackCount());
ps.setDate (17, stepExecution.getLastUpdated() != null ? new Date(stepExecution.getLastUpdated().getTime()) : null);
return ps;
}, generatedKeyHolder);
stepExecution.setId(generatedKeyHolder.getKey().longValue());
}
@Override
public void saveStepExecutions(final Collection<StepExecution> stepExecutions) {
Assert.notNull(stepExecutions, "Attempt to save a null collection of step executions");
for (StepExecution stepExecution : stepExecutions) {
saveStepExecution(stepExecution);
}
} Change or create copy of JobRepositoryFactoryBean (SqlServerJobRepositoryFactoryBean) with the following changes: @Override
protected JobInstanceDao createJobInstanceDao() throws Exception {
SqlServerJdbcJobInstanceDao dao = new SqlServerJdbcJobInstanceDao();
dao.setJdbcTemplate(jdbcOperations);
dao.setJobIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix + "JOB_SEQ"));
dao.setTablePrefix(tablePrefix);
dao.afterPropertiesSet();
return dao;
}
@Override
protected JobExecutionDao createJobExecutionDao() throws Exception {
SqlServerJdbcJobExecutionDao dao = new SqlServerJdbcJobExecutionDao();
dao.setJdbcTemplate(jdbcOperations);
dao.setJobExecutionIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix
+ "JOB_EXECUTION_SEQ"));
dao.setTablePrefix(tablePrefix);
dao.setClobTypeToUse(determineClobTypeToUse(this.databaseType));
dao.setExitMessageLength(maxVarCharLength);
dao.afterPropertiesSet();
return dao;
}
@Override
protected StepExecutionDao createStepExecutionDao() throws Exception {
SqlServerJdbcStepExecutionDao dao = new SqlServerJdbcStepExecutionDao();
dao.setJdbcTemplate(jdbcOperations);
dao.setStepExecutionIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix
+ "STEP_EXECUTION_SEQ"));
dao.setTablePrefix(tablePrefix);
dao.setClobTypeToUse(determineClobTypeToUse(this.databaseType));
dao.setExitMessageLength(maxVarCharLength);
dao.afterPropertiesSet();
return dao;
} Create a batch configuration to use new SqlServerBatchConfigurer that uses the new SqlServerJobRepositoryFactoryBean: @Configuration
public class BatchConfiguration {
@Bean
public SqlServerBatchConfigurer basicBatchConfigurer(BatchProperties properties, DataSource dataSource) {
return new SqlServerBatchConfigurer(properties, dataSource);
}
class SqlServerBatchConfigurer extends BasicBatchConfigurer {
private final DataSource dataSource;
private final BatchProperties properties;
SqlServerBatchConfigurer(final BatchProperties properties, final DataSource dataSource) {
super(properties, dataSource);
this.properties = properties;
this.dataSource = dataSource;
}
@Override
protected JobRepository createJobRepository() throws Exception {
SqlServerJobRepositoryFactoryBean factory = new SqlServerJobRepositoryFactoryBean();
// this is required to avoid deadlocks
factory.setIsolationLevelForCreate("ISOLATION_REPEATABLE_READ");
factory.setDataSource(this.dataSource);
String tablePrefix = this.properties.getTablePrefix();
if (StringUtils.hasText(tablePrefix)) {
factory.setTablePrefix(tablePrefix);
}
factory.setTransactionManager(getTransactionManager());
factory.afterPropertiesSet();
return factory.getObject();
}
}
} |
Tyler K Van Gorder commented This issue is related to https://jira.spring.io/browse/SPR-16886 and the increment strategy that is used in MS SQL server.
|
This is key, and the issue on SF side is still open at the time of writing this. I'm adding the following SO question here for reference: Deadlock while running multiple instances of a spring batch job. The solution could be to update the DDL script for SQLServer to use real sequences (which are supported since SQLServer 2012) instead of emulating sequences with tables. This should be validated with a Docker based end-to-end test (similar to this one). |
Emulating sequences with tables causes deadlocks when running multiple jobs at the same time. Sequences have been supported in SQL Server since version 2012. This commit replaces the usage of tables to emulate sequences with real sequences. It also adds a new incrementer that is based on sequences instead of tables. Resolves spring-projects#1448
Angus Mezick opened BATCH-2147 and commented
I have a group of spring integration chains running in a single context. Each chain waits for a file to appear in a directory, and passed that file off to spring batch for processing. I just got a series of exceptions from my jobs when they tried to start. I had 5 chains going at once.
Possible fixes:
Failure 1:
Failure 2:
Failure 3:
Affects: 2.2.2
8 votes, 11 watchers
The text was updated successfully, but these errors were encountered: