Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layerStore.Load under layerStore.RLock is racy #1332

Closed
mtrmac opened this issue Sep 9, 2022 · 7 comments · Fixed by #1346
Closed

layerStore.Load under layerStore.RLock is racy #1332

mtrmac opened this issue Sep 9, 2022 · 7 comments · Fixed by #1346

Comments

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 9, 2022

From #1322:

  • store.Mounted (locks layerStore for reading) → ReloadIfChanged (locks loadMut, irrelevant because there is no other concurrent ReloadIfChanged) → Load (locks nothing, updates things like r.byid)
  • vs. store.Layer (locks layerStore for reading → Get (locks nothing, reads things like r.byid)

The read lock is shared, right? So there is no exclusion of Load modifying the data structure vs. any other reader of the store AFAICT.

I just can’t see how calling ReloadIfChanged with a shared reader lock is supposed to work in this case. My initial guess is that it is not supported at all to use the same store from multiple goroutines, and each goroutine should have its own store — but that’s clearly not correct, store.GetStore carefully returns a pre-existing matching store instance for concurrent callers.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 13, 2022

@nalind @haircommander I’ve been told that performance with in-process concurrency is a major concern in this area.

Is there any way I could quantify that? Ideally a benchmark I can run on-demand? Or specific code paths that matter? (I assume that’s mostly CRI-O, which nowadays has its own in-memory caching for both images and sandboxes. Which parts of CRI-O?)

@haircommander
Copy link
Contributor

I don't think we have a good reproducer for storage deadlocks/lock contention/races as of now. We've only seen theses situations in production so far

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 14, 2022

A tentative RFC:

Currently, all getters within a single process lock a layerStore for reading, via layerStore.lockfile. But they, also, on each call, serialize on layerStore.loadMut (inside layerStore.ReloadIfChanged).

I propose to, more or less, turn loadMut into a RWLock, which is locked for reading during any read accesses to layerStore, and locked for writing during any updates. That means:

  • ~No changes to correct writers to store; they currently lock for exclusive use cross-process, and also intra-process.
  • Hopefully trivially small changes to concurrent readers without a concurrent external modification: They can all hold loadMut for reading concurrently. The difference is only the extra overhead to lock/unlock the intra-process lock. This should be the CRI-O case, where there usually are no concurrent external writers.
  • Some slowdown for concurrent readers if there is an external modification: currently, we can have one goroutine inside Load, and other goroutines past ReloadIfChanged concurrently reading the data. With the read-write loadMut, those goroutines would not run concurrently with Load — we would be slower, and more correct, in cases that should usually not happen for CRI-O.

(And similarly for the container and image stores.)

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

@haircommander To clarify, I’m not looking for test of correctness, I’m looking for benchmarks to qualify performance, at least to the extent it is relevant for CRI-O.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

Note to self: The above, and #1346, does nothing for #1136. But if we had all readers going through a few wrappers, we could look into handling incomplete layers in there. Maybe lock the store inter-process for reading only, load, and if there are incomplete layers, abort, lock inter-process for writing, clean the layers up, and then do the actual data read.

The point would be to structure things so that we have confidence that Load can only be called at the beginning of a transaction, making it safe to abort and retry with a write lock (without needing to somehow upgrade a held read lock to write, which is not always possible.)

@haircommander
Copy link
Contributor

@haircommander To clarify, I’m not looking for test of correctness, I’m looking for benchmarks to qualify performance, at least to the extent it is relevant for CRI-O.

I guess a simple metric could be the cri-o e2e test time. Theoretically, any additional lock contention would cause them to run slower. they currently run ~1 hour.

Past that, if you get me a library to vendor, we could try some manual tests. We don't currently have a benchmarking suite handy

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 8, 2022

I just can’t see how calling ReloadIfChanged with a shared reader lock is supposed to work in this case.

The original intent seems to be #898 (comment) ; that seems to work for the container and image stores, where reloads are only driven by lockfile.Modified(), but it doesn’t work for the layer store, where reloads can happen for other reasons, notably including layerspathModified, which can happen any time at all.

@mtrmac mtrmac linked a pull request Nov 14, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants