Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Race Condititons During Liquibase Locking #2327

Merged

Conversation

schrieveslaach
Copy link
Contributor

@schrieveslaach schrieveslaach commented Jan 5, 2022

This commit adds an integration test for LB-2131's fix.

Environment

Liquibase Version:

Liquibase Integration & Version: <Pick one: CLI, maven, gradle, spring boot, servlet, etc.>

Liquibase Extension(s) & Version:

Database Vendor & Version:

Operating System Type & Version:

Pull Request Type

  • Bug fix (non-breaking change which fixes an issue.)
  • Enhancement/New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Description

#1901 tried to fix a race condition (see #1584). However, the fix was introduce by #2198 but without any automated test. Therefore, this PR which supersedes #1901 restores the test case to provide more quality assurance.

@schrieveslaach
Copy link
Contributor Author

@StevenMassaro, I noticed that your PR #2198 fixed the race condition that I was trying to resolve in #1901. #1901 contains a test case that runs a migration with multiple JVMs and I want to provide that to Liquibase. Do you have any thoughts?

@schrieveslaach schrieveslaach force-pushed the test-race-condititon-during-init branch from 8ec98f6 to 045797e Compare January 5, 2022 12:38
// hubUpdater releases the lock temporarily. In this time span another JVM instance might have
// acquired the database lock and could have applied further changesets to prevent that
// liquibase works with an outdated changelog.
changeLogService.reset();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is also a necessary change to fix #1584

@StevenMassaro
Copy link
Contributor

@StevenMassaro, I noticed that your PR #2198 fixed the race condition that I was trying to resolve in #1901. #1901 contains a test case that runs a migration with multiple JVMs and I want to provide that to Liquibase. Do you have any thoughts?

I just committed a change to fix the compilation error. I first want to see if the test you wrote passes in the CI build. Certainly more test coverage is not a bad thing, and some coverage in the area you've written here is also necessary, but I do wonder if the approach feels somewhat heavy handed. I want to discuss with my team and see if we have any existing tests which cover the area you've written here, and I'll get back to you.

@nvoxland
Copy link
Contributor

nvoxland commented Jan 5, 2022

I think the test seems good. It does spin up another process which is heavyweight, but because we use singletons etc. within the JVM, just using another thread wouldn't really test what you're looking to do. We're in the process of re-doing how the integration test framework works, so there may be a better way to abstract some of that out down the road, but we can work with the code from where it is.

@StevenMassaro
Copy link
Contributor

We also do not have any existing test coverage for this pattern, so adding this test will help cover more use cases. Thanks @schrieveslaach !

@nvoxland
Copy link
Contributor

nvoxland commented Jan 5, 2022

Looking at your changeLogService.reset(); call you added, though, if it is something you were seeing problems caused by the temporary unlock in HubUpdater, my worry is that just handling it in update() like you have doesn't address similar problems it would cause in rollback etc. where we also call the hubUpdater.register() command.

I moved the reset call into HubUpdater as part of it re-acquiring the lock since that seemed safer. BUT your new test was still passing for me even without the line in either place.

Based on why you wanted to add it, @schrieveslaach, is the new location good? Or did #2198 end up addressing the problem in a way that makes us not even need that reset() call anymore?

@nvoxland nvoxland added this to To Do in Conditioning++ via automation Jan 5, 2022
@nvoxland nvoxland moved this from To Do to In discussion in Conditioning++ Jan 5, 2022
@schrieveslaach
Copy link
Contributor Author

Thank for your feedback and I'm happy to help the project. Here are my responses:

BUT your new test was still passing for me even without the line in either place.

Unfortunately, the test I wrote does not trigger all race conditions all the time due to the non-determinism of parallel code. 🤷 On my local machine I can reproduce the issue with a Microservice setup.

Or did #2198 end up addressing the problem in a way that makes us not even need that reset() call anymore?

Unfortunately, no. Let me try to explain what I observe:

  1. I have multiple Dockerized JVMs that I start at the same time and all try to acquire the Liquibase lock on a fresh database.
  2. Each JVM called checkLiquibaseTables before it tried to acquire the lock. So every JVM has a cached version of the applied changelogs in ChangeLogHistoryService.
  3. Because the database is empty, every JVM has an empty cache of applied changelogs.
  4. Now, one JVM applies the full changelog file and releases the lock.
  5. Another JVM acquires the lock and uses the cached list of applied changelogs (which is empty) and determines which entries have to be applied. It this case it applies all entries of the changelog file and fails due to existing tables.

Therefore, I called changeLogService.reset(); in Liquibase, fixing the out-of-cache of ChangeLogHistoryService in bullet point 5.

@nvoxland, @StevenMassaro, does lockService.reset(); the same thing?

@nvoxland
Copy link
Contributor

@schrieveslaach lockService.reset() is a similar reset but for the LockService.

What should generally be happening is that we use the lock to ensure that your instance is the only one running. Once we have the lock ourselves, we read from the databasechangelog table to populate the ChangeLogHistoryService and then assume that nobody else will be messing with it besides us.

The problem is that we have the code that does a temporary unlock when it's prompting the user, which then break that older assumption that the changelogHistory remains correct. We need to make sure that anytime we unlock, we also no longer trust the changelog service.

The extra commit I pushed to your fork that moves the reset to be part of the logic that does the temporary unlock in hopes of better handling the case. We have a refactoring of the lock+changelog services coming up soon and having a more direct tie between them may be smart based on these bugs.

But until that, since you can reproduce it more readily on your system than the tests do, can you try the newest version of your fork to see if it's still working for you? Or if my function move broke your fix?

@schrieveslaach
Copy link
Contributor Author

@nvoxland, I tested the newest version and I could reproduce the race condition again. Your fix does not work in my scenario.

@kataggart kataggart moved this from In discussion to To Do in Conditioning++ Feb 1, 2022
@kataggart kataggart assigned nvoxland and unassigned StevenMassaro Feb 1, 2022
This commit adds an integration test for LB-2131's fix.

Additionally, the commit ensures that the changelog history cache will
be reseted when the lock service has been reseted. That ensures that
multiple JVMs can try to the changelog without stepping on each other's
toes.
@schrieveslaach
Copy link
Contributor Author

@nvoxland, @StevenMassaro, I updated the PR with a slight change. Based on @nvoxland's proposal I tried to reset the changelog history when reseting the lock service. However, that seems not to clean the history properly because on my system the migration still crashes with multiple JVMs.

I've got the impression that my initial fix (reseting the changelog within Liquibase) ist the only working solution.

What do you think?

@nvoxland
Copy link
Contributor

Thanks for the update @schrieveslaach. I'm trying out some options for improving testing in general around the lock service and then fixing a variety of bugs based on what those tests can show failing.

I'm going to wrap this PR into that effort over the next couple weeks.

@schrieveslaach
Copy link
Contributor Author

@nvoxland, thanks for the update. Please, let me know if you have something to test.

@schrieveslaach
Copy link
Contributor Author

@nvoxland, is there any update on this topic?

@schrieveslaach
Copy link
Contributor Author

@nvoxland, ping. 😉

@nvoxland nvoxland added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions autocandidate labels Sep 6, 2022
@nvoxland
Copy link
Contributor

nvoxland commented Sep 6, 2022

Sorry for the slow response. I keep feeling like some of the larger test refatoring was coming soon but then we keep finding other things to work on first...

Since it's taking long enough, I'll bring this in since it's a good test to be adding and if/when we get to the larger integration test refactoring it will be there for us to figure out how to incorporate into any new structures we have.

Things to be aware of:

  • The PR really just adds a test. The other changed code is just minor cleanup and the inclusion of a needed call to reset()

Things to worry about:

  • Nothing

@github-actions
Copy link

github-actions bot commented Sep 6, 2022

Unit Test Results

  4 644 files  ±    0    4 644 suites  ±0   38m 37s ⏱️ + 2m 12s
  4 641 tests +  15    4 419 ✔️ +12     222 💤 +    3  0 ±0 
54 864 runs  +180  49 684 ✔️ +24  5 180 💤 +156  0 ±0 

Results for commit 72579f9. ± Comparison against base commit 165c594.

♻️ This comment has been updated with latest results.

@nvoxland nvoxland added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Sep 10, 2022
@schrieveslaach
Copy link
Contributor Author

schrieveslaach commented Sep 13, 2022

@nvoxland, thanks for the heads up. Please, keep in mind that following change was one necessary change to resolve a race condition:

// Reset the lockService in case other JVM instances have done things to the lock table since we had last locked it
lockService.reset();

Without the fixes I wasn't able to spin up multiple services that all try to apply the database migration.

@nvoxland
Copy link
Contributor

True, @schrieveslaach . I had noticed that but forgot to include it in my notes. I updated them with that added. Thanks

@nvoxland nvoxland added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Sep 21, 2022
@nvoxland nvoxland added SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions and removed SafeToBuild Indicates that a particular PR contains changes which are safe to build using GitHub actions labels Sep 21, 2022
Copy link
Contributor

@XDelphiGrl XDelphiGrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR addresses race conditions when multiple Liquibase operations on the same database are locking/unlocking the DATABASECHANGELOGLOCK tracking table.

  • The fix resets the ChangeLogHistoryServiceFactory in the StandardLockService class to ensure the status of changesets is updated if another JVM process updated the DATABASECHANGELOG table while another process waited for the DATABASECHANGELOGLOCK table to be unlocked.
  • New integration test added that validates the locking use cases when three updates against the same database execute.
  • No additional testing required.

APPROVED

@nvoxland nvoxland merged commit f700115 into liquibase:master Sep 26, 2022
@nvoxland nvoxland added this to the 1NEXT milestone Sep 30, 2022
@tabbyf00
Copy link

Thanks for your PR submission! We just finished reviewing and merging it into the 4.17.0 release on October 10, 2022. When you get a chance, could you please Star the Liquibase project? The star button is in the upper right corner of the screen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Race condition and possible lock bypass if two nodes run on clean schema
8 participants