Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock accessing creating a job on sqlserver when multiple jobs start at once. [BATCH-2147] #1448

Closed
spring-projects-issues opened this issue Nov 22, 2013 · 12 comments

Comments

@spring-projects-issues
Copy link
Collaborator

Angus Mezick opened BATCH-2147 and commented

I have a group of spring integration chains running in a single context. Each chain waits for a file to appear in a directory, and passed that file off to spring batch for processing. I just got a series of exceptions from my jobs when they tried to start. I had 5 chains going at once.

Possible fixes:

  1. Use identity column for servers that have them.
  2. catch the deadlock and retry instead of failing spectacularly.

Failure 1:

Reason: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 234) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
org.springframework.dao.DataAccessResourceFailureException: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 234) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
	at org.springframework.jdbc.support.incrementer.SqlServerMaxValueIncrementer.getNextKey(SqlServerMaxValueIncrementer.java:108)
	at org.springframework.jdbc.support.incrementer.AbstractDataFieldMaxValueIncrementer.nextLongValue(AbstractDataFieldMaxValueIncrementer.java:125)
	at org.springframework.batch.core.repository.dao.JdbcJobInstanceDao.createJobInstance(JdbcJobInstanceDao.java:98)
	at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:135)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
	at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:96)
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:260)

Failure 2:

Reason: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 235) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
org.springframework.dao.DataAccessResourceFailureException: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 235) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
	at org.springframework.jdbc.support.incrementer.SqlServerMaxValueIncrementer.getNextKey(SqlServerMaxValueIncrementer.java:108)
	at org.springframework.jdbc.support.incrementer.AbstractDataFieldMaxValueIncrementer.nextLongValue(AbstractDataFieldMaxValueIncrementer.java:125)
	at org.springframework.batch.core.repository.dao.JdbcJobInstanceDao.createJobInstance(JdbcJobInstanceDao.java:98)
	at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:135)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)

Failure 3:

Reason: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 235) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
org.springframework.dao.DataAccessResourceFailureException: Could not increment identity; nested exception is java.sql.SQLException: Transaction (Process ID 235) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
	at org.springframework.jdbc.support.incrementer.SqlServerMaxValueIncrementer.getNextKey(SqlServerMaxValueIncrementer.java:108)
	at org.springframework.jdbc.support.incrementer.AbstractDataFieldMaxValueIncrementer.nextLongValue(AbstractDataFieldMaxValueIncrementer.java:125)
	at org.springframework.batch.core.repository.dao.JdbcJobInstanceDao.createJobInstance(JdbcJobInstanceDao.java:98)
	at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:135)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)

Affects: 2.2.2

8 votes, 11 watchers

@spring-projects-issues
Copy link
Collaborator Author

Steve Ash commented

I also hit this and I have the T1222 trace info showing exactly what is deadlocking (although I think its already obvious from the above).

I do not think that retrying should be the solution. If the problem is just getting sequence numbers then I think a better solution is just acquiring those sequence numbers in a separate transaction. Having "missing" sequence numbers (due to failure) is not a problem. They are synthethic numbers.

2015-02-17 00:08:49.83 spid17s     deadlock-list
2015-02-17 00:08:49.83 spid17s      deadlock victim=process14fdc8
2015-02-17 00:08:49.83 spid17s       process-list
2015-02-17 00:08:49.83 spid17s        process id=process14fdc8 taskpriority=0 logused=0 waitresource=OBJECT: 36:1741334196:0  waittime=4426 ownerId=610332429 transactionname=implicit_transaction lasttranstarted=2015-02-17T00:08:45.407 XDES=0x1c6ef2e90 lockMode=IX schedulerid=1 kpid=4952 status=suspended spid=81 sbid=0 ecid=0 priority=0 trancount=2 lastbatchstarted=2015-02-17T00:08:45.410 lastbatchcompleted=2015-02-17T00:08:45.407 clientapp=Microsoft JDBC Driver for SQL Server hostname=cr6a-memphis hostpid=0 loginname=sa isolationlevel=serializable (4) xactid=610332429 currentdb=36 lockTimeout=4294967295 clientoption1=671088672 clientoption2=128058
2015-02-17 00:08:49.83 spid17s         executionStack
2015-02-17 00:08:49.83 spid17s          frame procname=adhoc line=1 sqlhandle=0x0200000046a07915e38ccff521f6d02cdbdedd5f2d87e56c
2015-02-17 00:08:49.83 spid17s     insert into FRD_BATCH_JOB_SEQ default values     
2015-02-17 00:08:49.83 spid17s         inputbuf
2015-02-17 00:08:49.83 spid17s     insert into FRD_BATCH_JOB_SEQ default values    
2015-02-17 00:08:49.83 spid17s        process id=process9018e088 taskpriority=0 logused=420 waitresource=KEY: 36:72057614842265600 (ffffffffffff) waittime=4426 ownerId=610332407 transactionname=implicit_transaction lasttranstarted=2015-02-17T00:08:45.380 XDES=0x18a79ce90 lockMode=RangeI-N schedulerid=4 kpid=6732 status=suspended spid=76 sbid=0 ecid=0 priority=0 trancount=2 lastbatchstarted=2015-02-17T00:08:45.407 lastbatchcompleted=2015-02-17T00:08:45.400 clientapp=Microsoft JDBC Driver for SQL Server hostname=cr6a-memphis hostpid=0 loginname=sa isolationlevel=serializable (4) xactid=610332407 currentdb=36 lockTimeout=4294967295 clientoption1=671088672 clientoption2=128058
2015-02-17 00:08:49.83 spid17s         executionStack
2015-02-17 00:08:49.83 spid17s          frame procname=adhoc line=1 stmtstart=112 sqlhandle=0x020000009322df04d94cd48a4870ffbcdf6c954fb9973e1c
2015-02-17 00:08:49.83 spid17s     INSERT into FRD_BATCH_JOB_INSTANCE(JOB_INSTANCE_ID, JOB_NAME, JOB_KEY, VERSION) values (@P0, @P1, @P2, @P3)     
2015-02-17 00:08:49.83 spid17s          frame procname=unknown line=1 sqlhandle=0x000000000000000000000000000000000000000000000000
2015-02-17 00:08:49.83 spid17s     unknown     
2015-02-17 00:08:49.83 spid17s         inputbuf
2015-02-17 00:08:49.83 spid17s     (@P0 bigint,@P1 varchar(8000),@P2 varchar(8000),@P3 int)INSERT into FRD_BATCH_JOB_INSTANCE(JOB_INSTANCE_ID, JOB_NAME, JOB_KEY, VERSION) values (@P0, @P1, @P2, @P3)                                    
2015-02-17 00:08:49.83 spid17s       resource-list
2015-02-17 00:08:49.83 spid17s        objectlock lockPartition=0 objid=1741334196 subresource=FULL dbid=36 objectname=UNIT_Steve.dbo.FRD_BATCH_JOB_SEQ id=lockeacad580 mode=X associatedObjectId=1741334196
2015-02-17 00:08:49.83 spid17s         owner-list
2015-02-17 00:08:49.83 spid17s          owner id=process9018e088 mode=X
2015-02-17 00:08:49.83 spid17s         waiter-list
2015-02-17 00:08:49.83 spid17s          waiter id=process14fdc8 mode=IX requestType=wait
2015-02-17 00:08:49.83 spid17s        keylock hobtid=72057614842265600 dbid=36 objectname=UNIT_Steve.dbo.FRD_BATCH_JOB_INSTANCE indexname=JOB_INST_UN id=lock216b92300 mode=RangeS-S associatedObjectId=72057614842265600
2015-02-17 00:08:49.83 spid17s         owner-list
2015-02-17 00:08:49.83 spid17s          owner id=process14fdc8 mode=RangeS-S
2015-02-17 00:08:49.83 spid17s         waiter-list
2015-02-17 00:08:49.83 spid17s          waiter id=process9018e088 mode=RangeI-N requestType=convert

@spring-projects-issues
Copy link
Collaborator Author

Steve Ash commented

Ok well I created a Max Value Incrementor Factory that decorates the default one by using the TransactionTemplate to give each call its own transaction. This got past my original deadlock and went right to another. In the SimpleJobRepository when creating a job it first SELECTs for the first job, then if it doesn't find one, INSERTs a new one. Unfortunately that first select (even at serializable) just takes out shared locks. So two concurrent processes both get shared (reader) locks, then both try to acquire exclusive (writer) locks -- leading to a deadlock. Since you know that you're probably going to end up inserting after the select, its best to issue that select with the FOR UPDATE clause, which will cause the database to acquire Update locks -- which don't allow multiple reads -- and thus don't suffer from this conversion deadlock scenario.

@spring-projects-issues
Copy link
Collaborator Author

Ayushya Devmurari commented

Hey there I am also facing some issues while running(starting) multiple jobs at the same time. Stack trace of my application is on this link

http://stackoverflow.com/questions/29493484/primary-key-violation-in-spring-batchs-batch-job-instance-table-while-running-j?noredirect=1#comment47186282_29493484

Is this the same issue under the hibernate layer? If you need any more information I am willing to provide.
Thanks & Regards
Ayushya

@spring-projects-issues
Copy link
Collaborator Author

Andreas Selenwall commented

Spring Team! Do you have an update on this issue. This EXACT issue occur for us as well. It doesn't matter which JobParams we use we still get a deadlock in SqlServerMaxValue which keeps failing our jobs from time to time. It is not about unique jobs, it happens because we access the same table.

Doesn't happen a lot but still pretty annoying to deal with since we at this moment have to rerun jobs manually when it occurs.

Could you please just investigate and give us some possible work-around to be able to solve this?

@spring-projects-issues
Copy link
Collaborator Author

Josh Chappelle commented

I believe we are also experiencing this problem. There is a comment in SqlServerMaxValueIncrementer that says it's recommended to use IDENTITY columns for SQL Server. Is it possible to make Spring Batch work that way? It's easy enough to change the schema if there was only a way to configure spring batch to retrieve the generated keys.

Here's the stack trace in case it helps:

Caused by: org.springframework.dao.DuplicateKeyException: PreparedStatementCallback; SQL [INSERT into BATCH_JOB_EXECUTION(JOB_EXECUTION_ID, JOB_INSTANCE_ID, START_TIME, END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, VERSION, CREATE_TIME, LAST_UPDATED, JOB_CONFIGURATION_LOCATION) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]; Violation of PRIMARY KEY constraint 'PK__BATCH_JO__56435A7787F86FAE'. Cannot insert duplicate key in object 'dbo.BATCH_JOB_EXECUTION'. The duplicate key value is (0).; nested exception is java.sql.SQLException: Violation of PRIMARY KEY constraint 'PK__BATCH_JO__56435A7787F86FAE'. Cannot insert duplicate key in object 'dbo.BATCH_JOB_EXECUTION'. The duplicate key value is (0).
	at org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:239) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:660) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:909) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:970) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.jdbc.core.JdbcTemplate.update(JdbcTemplate.java:975) ~[spring-jdbc-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.batch.core.repository.dao.JdbcJobExecutionDao.saveJobExecution(JdbcJobExecutionDao.java:156) ~[spring-batch-core-3.0.3.RELEASE.jar:3.0.3.RELEASE]
	at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:145) ~[spring-batch-core-3.0.3.RELEASE.jar:3.0.3.RELEASE]
	at sun.reflect.GeneratedMethodAccessor1359.invoke(Unknown Source) ~[na:na]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_40]
	at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_40]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:99) ~[spring-tx-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:281) ~[spring-tx-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:96) ~[spring-tx-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.batch.core.repository.support.AbstractJobRepositoryFactoryBean$1.invoke(AbstractJobRepositoryFactoryBean.java:172) ~[spring-batch-core-3.0.3.RELEASE.jar:3.0.3.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:207) ~[spring-aop-4.1.4.RELEASE.jar:4.1.4.RELEASE]
	at com.sun.proxy.$Proxy64.createJobExecution(Unknown Source) ~[na:na]
	at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:125) ~[spring-batch-core-3.0.3.RELEASE.jar:3.0.3.RELEASE]

@spring-projects-issues
Copy link
Collaborator Author

kaushik mondal commented

Spring Team! Do you have an update on this issue?

@spring-projects-issues
Copy link
Collaborator Author

Andreas Fe commented

Well we faced the same issue and seem to have been able to solve it through configuration.
I still have to investigate if setting "ISOLATION_REPEATABLE_READ" would have been enough, but while I was at it, I also enabled the DataFieldMaxValueIncrementer-Cache for SQL-Server. Now 20 ids will be fetched at once and held in a cache, greatly reducing the probability of a deadlock.
If the server stops, up to 20 ids might be "lost", and between the cluster serves the ids will not be strictly increasing - but that's ok for us.

@Configuration
@EnableBatchProcessing
public class BatchConfiguration {

  private static final String ISOLATION_REPEATABLE_READ = "ISOLATION_REPEATABLE_READ";

  @Autowired
  private DataSource dataSource;
  @Autowired
  private PlatformTransactionManager platformTransactionManager;

  @Bean
  public JobRepository jobRepository() throws Exception {
    JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
    factory.setDataSource(dataSource);
    factory.setTransactionManager(platformTransactionManager);
    factory.setValidateTransactionState(true);
    factory.setIsolationLevelForCreate(ISOLATION_REPEATABLE_READ);
    factory.setIncrementerFactory(customIncrementerFactory());
    factory.afterPropertiesSet();
    return factory.getObject();
  }

  @Bean
  public SimpleJobLauncher jobLauncher(JobRepository jobRepository) {
    SimpleJobLauncher simpleJobLauncher = new SimpleJobLauncher();
    simpleJobLauncher.setJobRepository(jobRepository);
    return simpleJobLauncher;
  }

  private DataFieldMaxValueIncrementerFactory customIncrementerFactory() {
    return new CustomDataFieldMaxValueIncrementerFactory(dataSource);
  }

  private class CustomDataFieldMaxValueIncrementerFactory extends DefaultDataFieldMaxValueIncrementerFactory {

    CustomDataFieldMaxValueIncrementerFactory(DataSource dataSource) {
      super(dataSource);
    }

    @Override
    public DataFieldMaxValueIncrementer getIncrementer(String incrementerType, String incrementerName) {
      DataFieldMaxValueIncrementer incrementer = super.getIncrementer(incrementerType, incrementerName);
      if (incrementer instanceof SqlServerMaxValueIncrementer) {
        ((SqlServerMaxValueIncrementer) incrementer).setCacheSize(20);
      }
      return incrementer;
    }
  }
}

If there's an easier way to achive this, feel free to point me in the direction :)

Update:
While the above resolved the errors with "could not increment identity", we were facing other problems now:

Encountered fatal error executing job org.springframework.dao.DeadlockLoserDataAccessException: PreparedStatementCallback; SQL [SELECT STEP_EXECUTION_ID, STEP_NAME, START_TIME, END_TIME, STATUS, COMMIT_COUNT, READ_COUNT, FILTER_COUNT, WRITE_COUNT, EXIT_CODE, EXIT_MESSAGE, READ_SKIP_COUNT, WRITE_SKIP_COUNT, PROCESS_SKIP_COUNT, ROLLBACK_COUNT, LAST_UPDATED, VERSION from BATCH_STEP_EXECUTION where JOB_EXECUTION_ID = ? order by STEP_EXECUTION_ID]; Die Transaktion (Prozess-ID 477) befand sich auf Sperre | Kommunikationspuffer Ressourcen aufgrund eines anderen Prozesses in einer Deadlocksituation und wurde als Deadlockopfer ausgew?hlt. F?hren Sie die Transaktion erneut aus.; 
nested exception is com.microsoft.sqlserver.jdbc.SQLServerException: Die Transaktion (Prozess-ID 477) befand sich auf Sperre | Kommunikationspuffer Ressourcen aufgrund eines anderen Prozesses in einer Deadlocksituation und wurde als Deadlockopfer ausgew?hlt. F?hren Sie die Transaktion erneut aus. at 
org.springframework.jdbc.support.SQLErrorCodeSQLExceptionTranslator.doTranslate(SQLErrorCodeSQLExceptionTranslator.java:263) at 
org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:73) at 
org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:649) at 
org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:684) at 
org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:716) at 
org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:726) at 
org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:781) at 
org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.addStepExecutions(JdbcStepExecutionDao.java:299) at 
org.springframework.batch.core.repository.support.SimpleJobRepository.getLastStepExecution(SimpleJobRepository.java:219) at 
sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at 
java.lang.reflect.Method.invoke(Method.java:498)
...

Different jobs were now trying to insert into the DB at the same time, causing a deadlock. We now set IsolationLevel to "ISOLATION_READ_COMMITTED". To prevent executing the same job in parallel on a cluster we've been using Hazelcast-locks all along.

@spring-projects-issues
Copy link
Collaborator Author

Jeff Smith commented

I ran into this issue and fought it all last week but found a workaround last Friday and since I've implemented in production, haven't seen a single deadlock. The fix that worked for me is 2 things:

  1. Set the IsolationLevelForCreate like this
    <code snippet - not complete class>
    JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
    factory.setIsolationLevelForCreate("ISOLATION_REPEATABLE_READ");
  2. Have the DBA add indexes to each of the SEQ tables like this (JET is my schema that I put the repo tables in):
    ALTER TABLE [JET].[BATCH_JOB_EXECUTION_SEQ]
    ADD CONSTRAINT [BATCH_JOB_EXECUTION_SEQ_PK] PRIMARY KEY CLUSTERED ([ID] ASC)
    go

ALTER TABLE [JET].[BATCH_JOB_SEQ]
ADD CONSTRAINT [BATCH_JOB_SEQ_PK] PRIMARY KEY CLUSTERED ([ID] ASC)
go

ALTER TABLE [JET].[BATCH_STEP_EXECUTION_SEQ]
ADD CONSTRAINT [BATCH_STEP_EXECUTION_SEQ_PK] PRIMARY KEY CLUSTERED ([ID] ASC)
go

I spent alot of time looking at this code last week and you should know that to get a next sequence number value it basically does these 3 things within a transaction:

  1. inserts a new row into the SEQ table and it auto increments to the next numeric value
  2. A select is performed and that value is retrieved and held into a java long variable
  3. A delete is performed on all rows < the long that was just retrieved

All 3 of those steps are wrapped into a transaction that is added into the code via an aspect. That is why you don't see the transaction code directly in that class, its injected into it from another class.

During testing I was able to consistently reproduce this by kicking off many repeated instances of jobs kicking off rapidly between 2 different spring batch JVM's that shared the same repo tables. I had the DBA use monitoring tools and she was able to detect the deadlock was due to the delete operation. It was putting a table lock in place and causing the deadlock from the other JVM's accessing the same SEQ table.

For some reason adding the index meant that the delete operation was able to use the index to do its work instead of locking the whole table.

But, I did test both with and without setting the ISOLATION level as I specified above. I did also have to set that isolation level otherwise the deadlock still happened. I'm not an expert in the isolation levels so I'm not totally sure why that works that way, but it does.

I would recommend the spring batch developers add those index creation steps to the SQL that is used to create the repo tables tables for SQLServer.

I hope this helps.

@spring-projects-issues
Copy link
Collaborator Author

Josh Chappelle commented

The only way I was able to finally put this to bed was by wrapping my JobRepository with retry functionality.

@spring-projects-issues
Copy link
Collaborator Author

Leonardo Duarte commented

So far I've not been able to recreate any issues (deadlocks or duplicate key) using sql server IDENTITY columns instead of psuedo sequences. No need to use retry because deadlocks are no longer occurring. No other solution works in my scenario, I even tried using SQL Server 2012 Sequences. Tested with multiple batch server instances running against a single database. The job used for testing uses partitioning (with ThreadPoolTaskExecutor).

Make the following fields of type IDENTITY:

BATCH_JOB_INSTANCE.JOB_INSTANCE_ID
BATCH_JOB_EXECUTION.JOB_EXECUTION_ID
BATCH_STEP_EXECUTION.STEP_EXECUTION_ID

Change or copy JdbcJobInstanceDao (SqlServerJdbcJobInstanceDao) and change the CREATE_JOB_INSTANCE sql constant and createJobInstance method to:

private static final String CREATE_JOB_INSTANCE = "INSERT into %PREFIX%JOB_INSTANCE(JOB_NAME, JOB_KEY, VERSION)"
		+ " values (?, ?, ?)";
		
@Override
public JobInstance createJobInstance(String jobName, JobParameters jobParameters) {

	Assert.notNull(jobName, "Job name must not be null.");
	Assert.notNull(jobParameters, "JobParameters must not be null.");

	Assert.state(getJobInstance(jobName, jobParameters) == null, "JobInstance must not already exist");

	JobInstance jobInstance = new JobInstance(null, jobName);
	jobInstance.incrementVersion();

	KeyHolder generatedKeyHolder = new GeneratedKeyHolder();

	getJdbcTemplate().update(connection -> {
		final PreparedStatement ps = connection.prepareStatement(getQuery(CREATE_JOB_INSTANCE), Statement.RETURN_GENERATED_KEYS);
		ps.setString(1, jobName);
		ps.setString(2, jobKeyGenerator.generateKey(jobParameters));
		ps.setInt   (3, jobInstance.getVersion());
		return ps;
	}, generatedKeyHolder);

	jobInstance.setId(generatedKeyHolder.getKey().longValue());

	return jobInstance;
}

Change or copy JdbcJobExecutionDao (SqlServerJdbcJobExecutionDao) and change SAVE_JOB_EXECUTION sql constant and saveJobExecution method:

private static final String SAVE_JOB_EXECUTION = "INSERT into %PREFIX%JOB_EXECUTION(JOB_INSTANCE_ID, START_TIME, "
		+ "END_TIME, STATUS, EXIT_CODE, EXIT_MESSAGE, VERSION, CREATE_TIME, LAST_UPDATED, JOB_CONFIGURATION_LOCATION) values (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";
	
@Override
public void saveJobExecution(JobExecution jobExecution) {

	validateJobExecution(jobExecution);

	jobExecution.incrementVersion();

	KeyHolder generatedKeyHolder = new GeneratedKeyHolder();

	getJdbcTemplate().update(connection -> {
		PreparedStatement ps = connection.prepareStatement(getQuery(SAVE_JOB_EXECUTION), Statement.RETURN_GENERATED_KEYS);
		ps.setLong  ( 1, jobExecution.getJobId());
		ps.setDate  ( 2, jobExecution.getStartTime() != null ? new java.sql.Date(jobExecution.getStartTime().getTime()) : null);
		ps.setDate  ( 3, jobExecution.getEndTime() != null ? new java.sql.Date(jobExecution.getEndTime().getTime()) : null);
		ps.setString( 4, jobExecution.getStatus().toString());
		ps.setString( 5, jobExecution.getExitStatus().getExitCode());
		ps.setString( 6, jobExecution.getExitStatus().getExitDescription());
		ps.setInt   ( 7, jobExecution.getVersion());
		ps.setDate  ( 8, jobExecution.getCreateTime() != null ? new java.sql.Date(jobExecution.getCreateTime().getTime()) : null);
		ps.setDate  ( 9, jobExecution.getLastUpdated() != null ? new java.sql.Date(jobExecution.getLastUpdated().getTime()) : null);
		ps.setString(10, jobExecution.getJobConfigurationName());
		return ps;
	}, generatedKeyHolder);

	jobExecution.setId(generatedKeyHolder.getKey().longValue());

	insertJobParameters(jobExecution.getId(), jobExecution.getJobParameters());
}

Change or copy JdbcStepExecutionDao (SqlServerJdbcStepExecutionDao) and change the SAVE_STEP_EXECUTION sql constant and saveStepExecution/saveStepExecutions methods:

private static final String SAVE_STEP_EXECUTION = "INSERT into %PREFIX%STEP_EXECUTION(VERSION, STEP_NAME, JOB_EXECUTION_ID, START_TIME, "
		+ "END_TIME, STATUS, COMMIT_COUNT, READ_COUNT, FILTER_COUNT, WRITE_COUNT, EXIT_CODE, EXIT_MESSAGE, READ_SKIP_COUNT, WRITE_SKIP_COUNT, PROCESS_SKIP_COUNT, ROLLBACK_COUNT, LAST_UPDATED) "
		+ "values(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)";

@Override
public void saveStepExecution(StepExecution stepExecution) {

	stepExecution.incrementVersion();

	final KeyHolder generatedKeyHolder = new GeneratedKeyHolder();

	getJdbcTemplate().update(connection -> {
		PreparedStatement ps = connection.prepareStatement(getQuery(SAVE_STEP_EXECUTION), Statement.RETURN_GENERATED_KEYS);
		ps.setInt   ( 1, stepExecution.getVersion());
		ps.setString( 2, stepExecution.getStepName());
		ps.setLong  ( 3, stepExecution.getJobExecutionId());
		ps.setDate  ( 4, stepExecution.getStartTime() != null ? new Date(stepExecution.getStartTime().getTime()) : null);
		ps.setDate  ( 5, stepExecution.getEndTime() != null ? new Date(stepExecution.getEndTime().getTime()) : null);
		ps.setString( 6, stepExecution.getStatus().toString());
		ps.setInt   ( 7, stepExecution.getCommitCount());
		ps.setInt   ( 8, stepExecution.getReadCount());
		ps.setInt   ( 9, stepExecution.getFilterCount());
		ps.setInt   (10, stepExecution.getWriteCount());
		ps.setString(11, stepExecution.getExitStatus().getExitCode());
		ps.setString(12, truncateExitDescription(stepExecution.getExitStatus().getExitDescription()));
		ps.setInt   (13, stepExecution.getReadSkipCount());
		ps.setInt   (14, stepExecution.getWriteSkipCount());
		ps.setInt   (15, stepExecution.getProcessSkipCount());
		ps.setInt   (16, stepExecution.getRollbackCount());
		ps.setDate  (17, stepExecution.getLastUpdated() != null ? new Date(stepExecution.getLastUpdated().getTime()) : null);
		return ps;
	}, generatedKeyHolder);

	stepExecution.setId(generatedKeyHolder.getKey().longValue());
}

@Override
public void saveStepExecutions(final Collection<StepExecution> stepExecutions) {
	Assert.notNull(stepExecutions, "Attempt to save a null collection of step executions");
	for (StepExecution stepExecution : stepExecutions) {
		saveStepExecution(stepExecution);
	}
}

Change or create copy of JobRepositoryFactoryBean (SqlServerJobRepositoryFactoryBean) with the following changes:

@Override
protected JobInstanceDao createJobInstanceDao() throws Exception {
	SqlServerJdbcJobInstanceDao dao = new SqlServerJdbcJobInstanceDao();
	dao.setJdbcTemplate(jdbcOperations);
	dao.setJobIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix + "JOB_SEQ"));
	dao.setTablePrefix(tablePrefix);
	dao.afterPropertiesSet();
	return dao;
}

@Override
protected JobExecutionDao createJobExecutionDao() throws Exception {
	SqlServerJdbcJobExecutionDao dao = new SqlServerJdbcJobExecutionDao();
	dao.setJdbcTemplate(jdbcOperations);
	dao.setJobExecutionIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix
			+ "JOB_EXECUTION_SEQ"));
	dao.setTablePrefix(tablePrefix);
	dao.setClobTypeToUse(determineClobTypeToUse(this.databaseType));
	dao.setExitMessageLength(maxVarCharLength);
	dao.afterPropertiesSet();
	return dao;
}

@Override
protected StepExecutionDao createStepExecutionDao() throws Exception {
	SqlServerJdbcStepExecutionDao dao = new SqlServerJdbcStepExecutionDao();
	dao.setJdbcTemplate(jdbcOperations);
	dao.setStepExecutionIncrementer(incrementerFactory.getIncrementer(databaseType, tablePrefix
			+ "STEP_EXECUTION_SEQ"));
	dao.setTablePrefix(tablePrefix);
	dao.setClobTypeToUse(determineClobTypeToUse(this.databaseType));
	dao.setExitMessageLength(maxVarCharLength);
	dao.afterPropertiesSet();
	return dao;
}

Create a batch configuration to use new SqlServerBatchConfigurer that uses the new SqlServerJobRepositoryFactoryBean:

@Configuration
public class BatchConfiguration {

    @Bean
    public SqlServerBatchConfigurer basicBatchConfigurer(BatchProperties properties, DataSource dataSource) {
        return new SqlServerBatchConfigurer(properties, dataSource);
    }

    class SqlServerBatchConfigurer extends BasicBatchConfigurer {

        private final DataSource dataSource;
        private final BatchProperties properties;

        SqlServerBatchConfigurer(final BatchProperties properties, final DataSource dataSource) {
            super(properties, dataSource);
            this.properties = properties;
            this.dataSource = dataSource;
        }

        @Override
        protected JobRepository createJobRepository() throws Exception {
            SqlServerJobRepositoryFactoryBean factory = new SqlServerJobRepositoryFactoryBean();

            // this is required to avoid deadlocks
            factory.setIsolationLevelForCreate("ISOLATION_REPEATABLE_READ");

            factory.setDataSource(this.dataSource);
            String tablePrefix = this.properties.getTablePrefix();
            if (StringUtils.hasText(tablePrefix)) {
                factory.setTablePrefix(tablePrefix);
            }
            factory.setTransactionManager(getTransactionManager());
            factory.afterPropertiesSet();
            return factory.getObject();
        }
    }
}

@spring-projects-issues
Copy link
Collaborator Author

Tyler K Van Gorder commented

This issue is related to https://jira.spring.io/browse/SPR-16886 and the increment strategy that is used in MS SQL server.

 

@fmbenhassine
Copy link
Contributor

fmbenhassine commented Apr 30, 2021

This issue is related to https://jira.spring.io/browse/SPR-16886 and the increment strategy that is used in MS SQL server.

This is key, and the issue on SF side is still open at the time of writing this.

I'm adding the following SO question here for reference: Deadlock while running multiple instances of a spring batch job.

The solution could be to update the DDL script for SQLServer to use real sequences (which are supported since SQLServer 2012) instead of emulating sequences with tables. This should be validated with a Docker based end-to-end test (similar to this one).

@fmbenhassine fmbenhassine added related-to: ddl-scripts and removed status: waiting-for-triage Issues that we did not analyse yet labels Apr 30, 2021
@fmbenhassine fmbenhassine added this to the 5.0.0 milestone May 5, 2021
@fmbenhassine fmbenhassine modified the milestones: 5.0.0, 5.0.0-M1 Sep 17, 2021
darkmastermindz pushed a commit to darkmastermindz/spring-batch that referenced this issue Mar 29, 2022
Emulating sequences with tables causes deadlocks when running
multiple jobs at the same time. Sequences have been supported
in SQL Server since version 2012.

This commit replaces the usage of tables to emulate sequences
with real sequences. It also adds a new incrementer that is
based on sequences instead of tables.

Resolves spring-projects#1448
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants