Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

safe delete cache directory files #8710

Merged
merged 1 commit into from Nov 15, 2022

Conversation

kkmuffme
Copy link
Contributor

since we safeFileGetContents, it makes sense to also acquire lock before deleting, to avoid reading partial data

since we safeFileGetContents, it makes sense to also acquire lock before deleting, to avoid reading partial data
@kkmuffme kkmuffme marked this pull request as ready for review November 15, 2022 19:04
@kkmuffme
Copy link
Contributor Author

Ready for review

@orklah orklah added the release:internal The PR will be included in 'Internal changes' section of the release notes label Nov 15, 2022
@orklah orklah merged commit 22fe7f4 into vimeo:master Nov 15, 2022
@orklah
Copy link
Collaborator

orklah commented Nov 15, 2022

Thanks!

@weirdan
Copy link
Collaborator

weirdan commented Nov 15, 2022

Are there any filesystems where unlink() is not atomic?

@kkmuffme
Copy link
Contributor Author

RHEL 8/9 it seems. We have 100s of psalm instances running at the same time (which partially use the same cache directories) and as we recently increased the number of instances, we got tons of igbinary unserialize end-of-data errors, which prompted me to submit this. (I previously submitted the psalm lock/safe read stuff, as we got those issue before)

There seems to be more race condition issues though with things that were recently changed, so I'll probably add another PR soon to address those.

@weirdan
Copy link
Collaborator

weirdan commented Nov 15, 2022

RHEL 8/9 it seems.

I find it unlikely, as POSIX specifically mandates that data should not be removed until there are no open file descriptors referencing the file:

The unlink() function shall remove a link to a file. If path names a symbolic link, unlink() shall remove the symbolic link named by path and shall not affect any file or directory named by the contents of the symbolic link. Otherwise, unlink() shall remove the link named by the pathname pointed to by path and shall decrement the link count of the file referenced by the link.

When the file's link count becomes 0 and no process has the file open, the space occupied by the file shall be freed and the file shall no longer be accessible. If one or more processes have the file open when the last link is removed, the link shall be removed before unlink() returns, but the removal of the file contents shall be postponed until all references to the file are closed.

POSIX.1-2017: unlink()

In my experience, partial reads were pretty much always caused by partial writes, so to avoid them, all one needed was to write data into a temporary file (on the target filesystem) and then rename them to the desired file name.

@kkmuffme
Copy link
Contributor Author

Possibly, the problem is that those race conditions are extremely hard to debug, since it's impossible to deliberately reproduce them. So it's kind of a best effort at fixing them:

As I found out today, it seems most of the race conditions had another source problem anyway (and this issue here was just a symptom of it), which I PRed here #8714

@kkmuffme kkmuffme deleted the safe-delete-cache-directory branch November 20, 2022 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:internal The PR will be included in 'Internal changes' section of the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants