Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sync.Mutex and os.Rename to prevent corrupted file when downloading the Postgres archive #105

Merged
merged 39 commits into from
May 21, 2023

Conversation

alecsammon
Copy link
Contributor

@alecsammon alecsammon commented Mar 7, 2023

Related to: #96

  1. Downloads the postgres archive to a temp location, then uses os.Rename to move into the correct position. As this is an atomic operation this should prevent corruption on Unix machines. (On windows devices then this will fail if the destination already exists, but we can ignore the error, as we know the source and destination should be the same)
  2. Extracts the archive to a temporary location, and then renames the files into the correct location. This should solve corruption issues when extracting the archive.
  3. Uses sync.Mutex to lock the resource before attempting to check for the cache, to prevent duplicate downloads within the same process.

@alecsammon alecsammon marked this pull request as draft March 7, 2023 10:42
@alecsammon alecsammon marked this pull request as ready for review March 7, 2023 12:15
@vanzin
Copy link

vanzin commented Mar 7, 2023

A mutex here won't work because the race is between different processes, not between different tests in the same process.

@alecsammon
Copy link
Contributor Author

alecsammon commented Mar 7, 2023

So in our particular case this appears to be happening when we spin up multiple instances from the same parent go process.

I feel that different parent processes will be much less likely to experience this issue, as the timing is likely to be different enough. In addition I would think that the host machine should manage file access better.

I've been playing with this test case and was able to replicate it the unable to extract postgres archive: xz: data is truncated or corrupt error. However I'm not sure if it's stable enough (or fast enough to add to this PR!). Let me know if you think it's worth me trying to get this to be stable enough to add to the PR.

func Test_ConcurrentStart(t *testing.T) {
	var wg sync.WaitGroup

	database := NewDatabase()
	cacheLocation, _ := database.cacheLocator()
	if err := os.RemoveAll(cacheLocation); err != nil {
		panic(err)
	}

	port := 5432
	for i := 1; i <= 5; i++ {
		port = port + 1
		wg.Add(1)

		go func(p int) {
			defer wg.Done()
			tempDir, err := os.MkdirTemp("", "embedded_postgres_test")
			assert.NoError(t, err)

			database := NewDatabase(
				DefaultConfig().
					Port(uint32(p)).
					DataPath(tempDir).
					RuntimePath(tempDir),
			)

			if err := database.Start(); err != nil {
				shutdownDBAndFail(t, err, database)
			}

			if err := database.Stop(); err != nil {
				shutdownDBAndFail(t, err, database)
			}

		}(port)
	}

	wg.Wait()
}

@vanzin
Copy link

vanzin commented Mar 7, 2023

I feel that different parent processes will be much less likely to experience this issue

We run into it, so it's still pretty likely.

@vanzin
Copy link

vanzin commented Mar 7, 2023

BTW the fix proposed in the issue (#96 (comment)) is the correct fix and works for both cases.

@alecsammon
Copy link
Contributor Author

Ah - sorry. I had addressed the issue we were facing, but hadn't understood yours properly. Sorry I had assumed it was the same thing - and should have read your issue more carefully.

I'll close this PR.

Is it worth me opening one using https://github.com/natefinch/atomic or will you do this?

@vanzin
Copy link

vanzin commented Mar 7, 2023

I wasn't planning on opening a PR (we've worked around it by serializing the tests for now). Not sure what are the project's rules around bringing in new dependencies...

@alecsammon alecsammon marked this pull request as draft March 8, 2023 07:47
@fergusstrange
Copy link
Owner

Hey all,

This is almost exactly what we're looking for to resolve the issue. However @vanzin and @alecsammon as you guessed we are hesitant to introduce any new dependencies.

Ideally we should be able to do a download to a temporary location and move to the intended location as a work around, ignoring failures when the file already exists. If you're up for it, you're welcome to have a go at introducing this technique.

I'll have some spare time in the coming weeks and should be able to have a go myself if not.

@alecsammon alecsammon changed the title Add sync.Mutex to prevent collisions Add sync.Mutex and os.Rename to prevent corrupted file when downloading the Postgres archive Mar 11, 2023
@alecsammon
Copy link
Contributor Author

Sorry for all the noise here - was really struggling to understand why this was failing on windows and not linux.
I had tried multiple atomic libraries and was still getting unexpected errors in the tests.

Turns out the issue is probably this

when using temporary files on windows then:

under certain circumstances it is ... impossible to rename it
https://stackoverflow.com/a/64943554/419324

Manually closing the temporary file looks to solve this problem, and allow the rename to happen.

Now I'm close to having something that works I can spend some time cleaning up the PR and hopefully have it ready for review soon.

decompression.go Outdated
@@ -21,9 +21,14 @@ func defaultTarReader(xzReader *xz.Reader) (func() (*tar.Header, error), func()
}

func decompressTarXz(tarReader func(*xz.Reader) (func() (*tar.Header, error), func() io.Reader), path, extractPath string) error {
tempExtractPath, err := os.MkdirTemp("", "embedded_postgres")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a couple of problems here.

  • you're not cleaning up this temp dir (minor problem but it's not nice to leave this stuff behind)
  • the temp dir is being created in the current working directory

The latter is a problem because rename doesn't work across filesystems, and this temp dir may be in a different fs than extractPath.

What you want is to create a temp file in the same directory as the final file. Or create the temp dir under that directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - that make sense - thanks!

I've updated the PR to do both of those:

  1. Have the temp locations in the same location as the extract location.
  2. Ensure the temp locations are cleaned up when necessary.

@vanzin
Copy link

vanzin commented Mar 28, 2023

@fergusstrange anything else that's needed to move this forward?

remote_fetch.go Outdated
}
}()

if err := os.WriteFile(tmp.Name(), archiveBytes, file.FileHeader.Mode()); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird, because tmp created above is already an open file. So you have two open descriptors to the same "name", and below when you close tmp I'm not sure what the end state is. You should probably call tmp.Close() before this line.

Overall I'm a bit confused about all the code you're adding... to me only the changes in this file are needed, so I haven't even looked at the rest.

@fergusstrange fergusstrange merged commit 54a6dcf into fergusstrange:master May 21, 2023
5 checks passed
@fergusstrange
Copy link
Owner

Hey @alecsammon thank you for all the effort here and apologies for the delay in getting this all merged in.

I'll likely do a little tidy up and try to up coverage again but will look to get a pre-release cut for this shortly so that you and others can begin testing.

@vanzin thanks for looking over the code as well 🙏

slimsag pushed a commit to sourcegraph/embedded-postgres that referenced this pull request May 25, 2023
…oading the Postgres archive (fergusstrange#105)

* Add sync.Mutex to prevent collisions

* * add atomic download, and use defer to ensure mutex unlock

* * move mutex to global

* * fix tests

* * update platform-test

* * update examples

* * remove atomic dependency

* * remove code duplication

* * reduce test parallel run count

* * fix race condition in tests

* * run tests

* * attempt to fix windows

* * attempt a different solution for windows

* * revert changes

* * try additional fix for windows

* * add another test

* * catch syscall.EEXIST

* * fix test

* * fix test

* * add extra debugging

* * attempt to fix windows

* * add additional error message

* * fix race in decompression

* * more fixes

* * use atomic

* * add extra debug

* * try catching the error

* * try different permissions

* * add more debugging

* * more debug

* * more debug

* * test dest

* * attempt to close temp file

* * simplify

* * remove atomic

* * clean up code

* * add more tests

* * clean up temporary files

* * prevent file being opened twice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants