-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Race Condititons During Liquibase Locking #2327
Test Race Condititons During Liquibase Locking #2327
Conversation
@StevenMassaro, I noticed that your PR #2198 fixed the race condition that I was trying to resolve in #1901. #1901 contains a test case that runs a migration with multiple JVMs and I want to provide that to Liquibase. Do you have any thoughts? |
8ec98f6
to
045797e
Compare
// hubUpdater releases the lock temporarily. In this time span another JVM instance might have | ||
// acquired the database lock and could have applied further changesets to prevent that | ||
// liquibase works with an outdated changelog. | ||
changeLogService.reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also a necessary change to fix #1584
I just committed a change to fix the compilation error. I first want to see if the test you wrote passes in the CI build. Certainly more test coverage is not a bad thing, and some coverage in the area you've written here is also necessary, but I do wonder if the approach feels somewhat heavy handed. I want to discuss with my team and see if we have any existing tests which cover the area you've written here, and I'll get back to you. |
I think the test seems good. It does spin up another process which is heavyweight, but because we use singletons etc. within the JVM, just using another thread wouldn't really test what you're looking to do. We're in the process of re-doing how the integration test framework works, so there may be a better way to abstract some of that out down the road, but we can work with the code from where it is. |
We also do not have any existing test coverage for this pattern, so adding this test will help cover more use cases. Thanks @schrieveslaach ! |
Looking at your I moved the reset call into HubUpdater as part of it re-acquiring the lock since that seemed safer. BUT your new test was still passing for me even without the line in either place. Based on why you wanted to add it, @schrieveslaach, is the new location good? Or did #2198 end up addressing the problem in a way that makes us not even need that reset() call anymore? |
Thank for your feedback and I'm happy to help the project. Here are my responses:
Unfortunately, the test I wrote does not trigger all race conditions all the time due to the non-determinism of parallel code. 🤷 On my local machine I can reproduce the issue with a Microservice setup.
Unfortunately, no. Let me try to explain what I observe:
Therefore, I called @nvoxland, @StevenMassaro, does |
@schrieveslaach lockService.reset() is a similar reset but for the LockService. What should generally be happening is that we use the lock to ensure that your instance is the only one running. Once we have the lock ourselves, we read from the databasechangelog table to populate the ChangeLogHistoryService and then assume that nobody else will be messing with it besides us. The problem is that we have the code that does a temporary unlock when it's prompting the user, which then break that older assumption that the changelogHistory remains correct. We need to make sure that anytime we unlock, we also no longer trust the changelog service. The extra commit I pushed to your fork that moves the reset to be part of the logic that does the temporary unlock in hopes of better handling the case. We have a refactoring of the lock+changelog services coming up soon and having a more direct tie between them may be smart based on these bugs. But until that, since you can reproduce it more readily on your system than the tests do, can you try the newest version of your fork to see if it's still working for you? Or if my function move broke your fix? |
@nvoxland, I tested the newest version and I could reproduce the race condition again. Your fix does not work in my scenario. |
This commit adds an integration test for LB-2131's fix. Additionally, the commit ensures that the changelog history cache will be reseted when the lock service has been reseted. That ensures that multiple JVMs can try to the changelog without stepping on each other's toes.
48673aa
to
af0fd02
Compare
@nvoxland, @StevenMassaro, I updated the PR with a slight change. Based on @nvoxland's proposal I tried to reset the changelog history when reseting the lock service. However, that seems not to clean the history properly because on my system the migration still crashes with multiple JVMs. I've got the impression that my initial fix (reseting the changelog within What do you think? |
Thanks for the update @schrieveslaach. I'm trying out some options for improving testing in general around the lock service and then fixing a variety of bugs based on what those tests can show failing. I'm going to wrap this PR into that effort over the next couple weeks. |
@nvoxland, thanks for the update. Please, let me know if you have something to test. |
@nvoxland, is there any update on this topic? |
@nvoxland, ping. 😉 |
Sorry for the slow response. I keep feeling like some of the larger test refatoring was coming soon but then we keep finding other things to work on first... Since it's taking long enough, I'll bring this in since it's a good test to be adding and if/when we get to the larger integration test refactoring it will be there for us to figure out how to incorporate into any new structures we have. Things to be aware of:
Things to worry about:
|
@nvoxland, thanks for the heads up. Please, keep in mind that following change was one necessary change to resolve a race condition: // Reset the lockService in case other JVM instances have done things to the lock table since we had last locked it
lockService.reset(); Without the fixes I wasn't able to spin up multiple services that all try to apply the database migration. |
True, @schrieveslaach . I had noticed that but forgot to include it in my notes. I updated them with that added. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR addresses race conditions when multiple Liquibase operations on the same database are locking/unlocking the DATABASECHANGELOGLOCK tracking table.
- The fix resets the ChangeLogHistoryServiceFactory in the StandardLockService class to ensure the status of changesets is updated if another JVM process updated the DATABASECHANGELOG table while another process waited for the DATABASECHANGELOGLOCK table to be unlocked.
- New integration test added that validates the locking use cases when three updates against the same database execute.
- No additional testing required.
APPROVED
Thanks for your PR submission! We just finished reviewing and merging it into the 4.17.0 release on October 10, 2022. When you get a chance, could you please Star the Liquibase project? The star button is in the upper right corner of the screen. |
This commit adds an integration test for LB-2131's fix.
Environment
Liquibase Version:
Liquibase Integration & Version: <Pick one: CLI, maven, gradle, spring boot, servlet, etc.>
Liquibase Extension(s) & Version:
Database Vendor & Version:
Operating System Type & Version:
Pull Request Type
Description
#1901 tried to fix a race condition (see #1584). However, the fix was introduce by #2198 but without any automated test. Therefore, this PR which supersedes #1901 restores the test case to provide more quality assurance.