-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oracle clustered environment with cached sequences can lead to Spring Batch thinking new job already exists [BATCH-1586] #2000
Comments
Dave Syer commented I'm not aware of anywhere in Spring Batch where we rely on ordering of primary keys. Can you be more specific? |
Jimmi Changa commented Our incrementer, via JobParametersBuilder, I think is pulling the primary key value from one of the job or instance tables, and that is what is got us into trouble. It seems to pull the largest primary key which isn't always the most recent job. If we're doing something wrong in our incrementer below, please let us know. public class RunIdIncrementer implements JobParametersIncrementer {
} |
Jimmi Changa commented I pasted the wrong three sequences in the original description... Of course, I meant these three: CREATE SEQUENCE BATCH_STEP_EXECUTION_SEQ MAXVALUE 9223372036854775807 NOCYCLE; And adding NOCACHE at the end is how we quickly dealt with the problem. |
Dave Syer commented I don't see what the incrementer has to do with the database. If you need to take a global lock while the incrementer is being called that's up to you (the framework only provides that interface as a courtesy really anyway). It certainly has nothing to do with Oracle sequences, unless I'm missing something. Maybe you need to explain how you use the incrementer? |
Javier Andrade commented Dave, I'm running into a similar issue with an Oracle DB, but I think the problem where Spring Batch is relying on the ordering of the primary keys is in the SimpleJobExplorer class. When we call to get the last job instance: List<JobInstance> lastInstances = jobExplorer.getJobInstances(jobName, 0, 1); it appears to use the JdbcJobInstanceDao to do so, which uses this query: private static final String FIND_LAST_JOBS_BY_NAME = "SELECT JOB_INSTANCE_ID, JOB_NAME from %PREFIX%JOB_INSTANCE where JOB_NAME = ? order by JOB_INSTANCE_ID desc"; where you can see that it is ordering by JOB_INSTANCE_ID. Because of these gaps in the sequence, this query will not always return the last job execution based on timestamp. I hope that helps to clarify this issue. |
Dave Syer commented That's not really the same issue, and I'm not sure it's serious enough to merit a change. The JobExplorer is read only so no damage is going to be done if you get some instances back out of order from it - the main thing is that the order is repeatable. We could add an order by clause on the creation time, but that might not suit everyone because it would require an extra index. How much of a problem is it really? |
Javier Andrade commented Sorry for mixing up issues. I'm trying implement a generic batch job launcher to be used to schedule jobs in Quartz. This is based on a sample class you wrote previously for this same purpose. The problem is that the job launcher needs to get the last job instance so that it can get the next value out of the incrementer defined for that job. Since I'm not always getting the last job instance, I get execution errors since I end up using the same run.id value as an existing job execution. For now I modified the oracle sequence objects to NOCACHE, but I'm thinking a solution based on timestamp would be better. Is there a better way to do what I need? |
Dave Syer commented I see. I suppose the workaround with no cache is the best thing for now. If you want to change the JdbcJobInstanceDao to use a timestamp filter and check that it works for you that would help us to decide when and how to implement it in the framework. The SimpleJobService from Spring Batch Admin has the same problem as your launcher, so we probably should find a better solution in the long term. |
Dave Syer commented Actually, the SimpleJobService does not have the same problem because it relies on JobExecutionDao.getLastJobExecution() not JonExecutionDao.getJobInstances() and the JDBC implementation uses timestamps to filter the executions. Maybe you can do the same? |
Becca Gaspard commented I also ran into this issue and was able to work around by adding ORDER to the sequence. CREATE SEQUENCE BATCH_STEP_EXECUTION_SEQ START WITH 0 MINVALUE 0 MAXVALUE 9223372036854775807 NOCYCLE ORDER;
CREATE SEQUENCE BATCH_JOB_EXECUTION_SEQ START WITH 0 MINVALUE 0 MAXVALUE 9223372036854775807 NOCYCLE ORDER;
CREATE SEQUENCE BATCH_JOB_SEQ START WITH 0 MINVALUE 0 MAXVALUE 9223372036854775807 NOCYCLE ORDER; From the oracle documentation:
If you don't specify ORDER, NOORDER is the default. |
Roan Brasil Monteiro commented HI, How can I help to fix issues? |
Gary L Peskin commented FYI. We've found this to be a significant problem when restarting a STOPPED job. The query to find the most recent execution for an instance in JdbcJobExecutionDao.getLastJobExecution is
This obviously requires that the JOB_EXECUTION_ID be monotonically increasing. However, that was not the case with our Oracle RAC setup until we added the ORDER keyword to the sequences. Before that, a STOPPED job was sometimes getting restarted with a Job Execution Context that belonged to a job execution that was not the most recent job execution. |
Jimmi Changa opened BATCH-1586 and commented
OBSERVED BEHAVIOR
The primary keys for the the spring_batch tables can are not always being saved in sequence. For example, we would see an rows created in the following order: 1, 2, 3, 4, 20, 21, 22, 40, 41, 23, 24, 42.
Having gaps between primary keys is common due to Oracle sequences not reverting during a rollback. What is not common, especially in a non-clustered environment, is the sequences having the appearance of decreasing. In this example, the sequence went from 41 to 23. This is also normally not an issue if the developer does not rely on using the primary key for ordering or querying for order. However, in this case, the third party Spring Batch library does indeed query on the primary key for ordering which is a design flaw leading to Spring Batch thinking our job has already run ("JobInstance already exists and is not restartable").
ROOT CAUSE
In a clustered DB environment, whenever someone opens a database connection, there is no guarantee which cluster node the user will connect to if the sequences are setup per the SQL files included with Spring Batch. In Oracle, the default cache size for a sequence is 20 when created as such in the resources file business-schema-oracle10g.sql:
CREATE SEQUENCE CUSTOMER_SEQ START WITH 5;
CREATE SEQUENCE BATCH_STAGING_SEQ START WITH 0 MINVALUE 0;
CREATE SEQUENCE TRADE_SEQ START WITH 0 MINVALUE 0;
This eventually leads to possibilities where the primary key can be written out of sequence specially when new DB connections are established. This is not a problem because most DBs only guarantee a unique primary key -- not a primary key in order. The true problem seems to be that Spring Batch is relying on the primary key being in order which I've always been taught should not be done. Developers shouldn't read the primary key for ordering.
PROPOSED SOLUTIONS
There are several potential solutions.
(1) Specify 'nocache' for each of the three sequences
(e.g., CREATE SEQUENCE CUSTOMER_SEQ START WITH 5 NOCACHE;)
(2) Specify a shared pooling for each of the three sequences
(3) Design the Spring Batch tables/logic not to rely on the primary key increasing in order. A new column could be added to the table, for example, for tracking ordering.
I lean towards solution (3) because it would solve the problem regardless of which DB utilized and is considered good software design. This primary key sequencing issue may also be present in other DBs besides Oracle so solution (3) would be database agnostic. Solution (1) could be a cheap and easy fix until (3) is completed.
Affects: 2.0.4
2 votes, 6 watchers
The text was updated successfully, but these errors were encountered: