Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race between ReloadIfChanged and read-only usage of layerStore #1322

Merged
merged 1 commit into from Sep 6, 2022

Conversation

tyler92
Copy link
Contributor

@tyler92 tyler92 commented Sep 5, 2022

There was a race condition if a goroutine accessed to layerStore public
methods under RO lock and at the same time ReloadIfChanged was called.
In real life, it can occurr when there are two concurrent PlayKube
requests in Podman.

Signed-off-by: Mikhail Khachayants tyler92@inbox.ru

@tyler92
Copy link
Contributor Author

tyler92 commented Sep 5, 2022

Sanitizer output:

==================
WARNING: DATA RACE
Write at 0x00c0004fee80 by goroutine 127:
  github.com/containers/storage.(*layerStore).Load()
      podman/vendor/github.com/containers/storage/layers.go:390 +0x4c4
  github.com/containers/storage.(*layerStore).ReloadIfChanged()
      podman/vendor/github.com/containers/storage/layers.go:1951 +0xf0
  github.com/containers/storage.(*store).Mounted()
      podman/vendor/github.com/containers/storage/store.go:2811 +0x181
  github.com/containers/common/libimage.(*Image).driverData()
      podman/vendor/github.com/containers/common/libimage/image.go:874 +0x132
  github.com/containers/common/libimage.(*Image).Inspect()
      podman/vendor/github.com/containers/common/libimage/inspect.go:107 +0x58d
  github.com/containers/podman/v4/pkg/specgen/generate.getImageFromSpec()
      podman/pkg/specgen/generate/container.go:56 +0x42a
  github.com/containers/podman/v4/pkg/specgen/generate.CompleteSpec()
      podman/pkg/specgen/generate/container.go:67 +0x7a
  github.com/containers/podman/v4/pkg/specgen/generate.MakePod()
      podman/pkg/specgen/generate/pod_create.go:76 +0x86f
  github.com/containers/podman/v4/pkg/domain/infra/abi.(*ContainerEngine).playKubePod()
      podman/pkg/domain/infra/abi/play.go:514 +0x2884
  github.com/containers/podman/v4/pkg/domain/infra/abi.(*ContainerEngine).PlayKube()
      podman/pkg/domain/infra/abi/play.go:190 +0x2947
  github.com/containers/podman/v4/pkg/api/handlers/libpod.KubePlay()
      podman/pkg/api/handlers/libpod/kube.go:103 +0xa64
  github.com/containers/podman/v4/pkg/api/handlers/libpod.PlayKube()
      podman/pkg/api/handlers/libpod/play.go:8 +0x44
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).apiWrapper()
      podman/pkg/api/server/handler_api.go:62 +0xdf0
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).APIHandler.func1()
      podman/pkg/api/server/handler_api.go:24 +0x64
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/containers/podman/v4/pkg/api/server.referenceIDHandler.func1.1()
      podman/pkg/api/server/handler_rid.go:38 +0x42f
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/gorilla/handlers.loggingHandler.ServeHTTP()
      podman/vendor/github.com/gorilla/handlers/logging.go:47 +0x149
  github.com/gorilla/handlers.(*loggingHandler).ServeHTTP()
      <autogenerated>:1 +0xc5
  github.com/containers/podman/v4/pkg/api/server.panicHandler.func1.1()
      podman/pkg/api/server/handler_panic.go:29 +0xf0
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      podman/vendor/github.com/gorilla/mux/mux.go:210 +0x366
  net/http.serverHandler.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2879 +0x89a
  net/http.(*conn).serve()
      /home/misha/go/go1.17.7/src/net/http/server.go:1930 +0x12e4
  net/http.(*Server).Serve·dwrap·87()
      /home/misha/go/go1.17.7/src/net/http/server.go:3034 +0x58

Previous read at 0x00c0004fee80 by goroutine 41:
  github.com/containers/storage.(*layerStore).lookup()
      podman/vendor/github.com/containers/storage/layers.go:587 +0x51
  github.com/containers/storage.(*layerStore).Get()
      podman/vendor/github.com/containers/storage/layers.go:1370 +0x44
  github.com/containers/storage.(*store).Layer()
      podman/vendor/github.com/containers/storage/store.go:3244 +0x30a
  github.com/containers/image/v5/storage.(*storageImageSource).getSize()
      podman/vendor/github.com/containers/image/v5/storage/storage_src.go:360 +0x5d2
  github.com/containers/image/v5/storage.newImage()
      podman/vendor/github.com/containers/image/v5/storage/storage_image.go:54 +0xed
  github.com/containers/image/v5/storage.storageReference.NewImage()
      podman/vendor/github.com/containers/image/v5/storage/storage_reference.go:264 +0xef
  github.com/containers/image/v5/storage.(*storageReference).NewImage()
      <autogenerated>:1 +0x4e
  github.com/containers/common/libimage.(*Image).inspectInfo()
      podman/vendor/github.com/containers/common/libimage/inspect.go:219 +0xe9
  github.com/containers/common/libimage.(*Image).matchesPlatform()
      podman/vendor/github.com/containers/common/libimage/platform.go:66 +0x9a
  github.com/containers/common/libimage.(*Runtime).lookupImageInLocalStorage()
      podman/vendor/github.com/containers/common/libimage/runtime.go:399 +0xe44
  github.com/containers/common/libimage.(*Runtime).LookupImage()
      podman/vendor/github.com/containers/common/libimage/runtime.go:300 +0x115b
  github.com/containers/podman/v4/pkg/specgen/generate.buildPauseImage()
      podman/pkg/specgen/generate/pause_image.go:50 +0x1f3
  github.com/containers/podman/v4/pkg/specgen/generate.PullOrBuildInfraImage()
      podman/pkg/specgen/generate/pause_image.go:35 +0x1ed
  github.com/containers/podman/v4/pkg/specgen/generate.MakePod()
      podman/pkg/specgen/generate/pod_create.go:31 +0x1ab
  github.com/containers/podman/v4/pkg/domain/infra/abi.(*ContainerEngine).playKubePod()
      podman/pkg/domain/infra/abi/play.go:514 +0x2884
  github.com/containers/podman/v4/pkg/domain/infra/abi.(*ContainerEngine).PlayKube()
      podman/pkg/domain/infra/abi/play.go:190 +0x2947
  github.com/containers/podman/v4/pkg/api/handlers/libpod.KubePlay()
      podman/pkg/api/handlers/libpod/kube.go:103 +0xa64
  github.com/containers/podman/v4/pkg/api/handlers/libpod.PlayKube()
      podman/pkg/api/handlers/libpod/play.go:8 +0x44
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).apiWrapper()
      podman/pkg/api/server/handler_api.go:62 +0xdf0
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).APIHandler.func1()
      podman/pkg/api/server/handler_api.go:24 +0x64
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/containers/podman/v4/pkg/api/server.referenceIDHandler.func1.1()
      podman/pkg/api/server/handler_rid.go:38 +0x42f
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/gorilla/handlers.loggingHandler.ServeHTTP()
      podman/vendor/github.com/gorilla/handlers/logging.go:47 +0x149
  github.com/gorilla/handlers.(*loggingHandler).ServeHTTP()
      <autogenerated>:1 +0xc5
  github.com/containers/podman/v4/pkg/api/server.panicHandler.func1.1()
      podman/pkg/api/server/handler_panic.go:29 +0xf0
  net/http.HandlerFunc.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2047 +0x4d
  github.com/gorilla/mux.(*Router).ServeHTTP()
      podman/vendor/github.com/gorilla/mux/mux.go:210 +0x366
  net/http.serverHandler.ServeHTTP()
      /home/misha/go/go1.17.7/src/net/http/server.go:2879 +0x89a
  net/http.(*conn).serve()
      /home/misha/go/go1.17.7/src/net/http/server.go:1930 +0x12e4
  net/http.(*Server).Serve·dwrap·87()
      /home/misha/go/go1.17.7/src/net/http/server.go:3034 +0x58

Goroutine 127 (running) created at:
  net/http.(*Server).Serve()
      /home/misha/go/go1.17.7/src/net/http/server.go:3034 +0x847
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).Serve.func3()
      podman/pkg/api/server/server.go:215 +0x6f

Goroutine 41 (running) created at:
  net/http.(*Server).Serve()
      /home/misha/go/go1.17.7/src/net/http/server.go:3034 +0x847
  github.com/containers/podman/v4/pkg/api/server.(*APIServer).Serve.func3()
      podman/pkg/api/server/server.go:215 +0x6f
==================

@tyler92
Copy link
Contributor Author

tyler92 commented Sep 5, 2022

The explanation #898 (comment) sometimes fails because the Modified function has three cases when it can return true:

  1. r.lockfile.Modified() returns true
  2. r.mountsLockfile.Modified() returns true
  3. info.ModTime() != r.layerspathModified is true

And there is a possibility to execute ReloadIfChanged concurrently with read-only access to layerStore's internals. I understand, that my patch doesn't provide thread safety in all cases, but the best solution requires mutex throughout the layers.go (from my point of view). And it can lead to performance degradation (see #278)

@@ -1924,7 +1931,6 @@ func (r *layerStore) Modified() (bool, error) {
}
if info != nil {
tmodified = info.ModTime() != r.layerspathModified
r.layerspathModified = info.ModTime()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the state of an object in a constant (by meaning) function is a bit strange. From my point of view, such functions should repeatedly return the same value. This was not the case before the fix.

There was a race condition if a goroutine accessed to layerStore public
methods under RO lock and at the same time ReloadIfChanged was called.
In real life, it can occurr when there are two concurrent PlayKube
requests in Podman.

Signed-off-by: Mikhail Khachayants <tyler92@inbox.ru>
Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe
Copy link
Member

giuseppe commented Sep 6, 2022

@saschagrunert the code was added by you with f637c5e.

Would this PR break CRI-O?

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change LGTM

@saschagrunert
Copy link
Member

Would this PR break CRI-O?

Maybe, the corresponding CRI-O test never made it into our suite: https://github.com/saschagrunert/cri-o/blob/e1d2a3cb4961c2af8bf8bb2d2fb09857c54c638f/test/image.bats#L305-L327

I'll can add it so that we can catch the regression when updating c/storage.

@rhatdan
Copy link
Member

rhatdan commented Sep 6, 2022

LGTM

@rhatdan rhatdan merged commit abfc271 into containers:main Sep 6, 2022
layers.go Show resolved Hide resolved
layers.go Show resolved Hide resolved
@mtrmac
Copy link
Collaborator

mtrmac commented Sep 9, 2022

@tyler92 @saschagrunert @giuseppe @nalind

Actually, what does this fix and how, at all??

Looking at the trace

  • store.Mounted (locks layerStore for reading) → ReloadIfChanged (locks loadMut, irrelevant because there is no other concurrent ReloadIfChanged) → Load (locks nothing, updates things like r.byid)
  • vs. store.Layer (locks layerStore for reading → Get (locks nothing, reads things like r.byid)

The read lock is shared, right? So there is no exclusion of Load modifying the data structure vs. any other reader of the store AFAICT. How does this PR change anything about that?

If it doesn’t change that, what does it change? AFAICT the write to layersPathModified is done under exactly the same locks both before and after this PR; so (per the other comments) it only makes that time-based check less accurate.

(I do agree that a Modified function should not be writing state… but that’s a style thing rather than a data race fix. Speaking of style, I’m tempted to say that Modified should be made private and/or deprecated, when the logic depends on the r.loadMut for correctness.)


I just can’t see how calling ReloadIfChanged with a shared reader lock is supposed to work in this case. My initial guess is that it is not supported at all to use the same store from multiple goroutines, and each goroutine should have its own store — but that’s clearly not correct, store.GetStore carefully returns a pre-existing matching store instance for concurrent callers.

I’m almost certainly missing something essential — what is it?

@tyler92
Copy link
Contributor Author

tyler92 commented Sep 9, 2022

It's ver-very-very not obvious PR. I realize that and think this module should be rewritten to avoid such things. What happened here:

Goroutine 1:

  1. Lock read lock for layer.
  2. Check ReloadIfChanged - this function return true due to r.lockfile.Modified() returns true
  3. Load called and goroutine continues work with layer assuming that read lock is already acquired.
  4. At some point this goroutine calls Get. This is part Previous read at 0x00c0004fee80 by goroutine 41 from the sanitizer report.

Goroutine 2:

  1. Lock read lock for layer.
  2. Check ReloadIfChanged - this function return true because info.ModTime() != r.layerspathModified condition is true.
  3. Load called and private data of the layer is modified - this is Write at 0x00c0004fee80 by goroutine 127: from the sanitizer report.

What my PR does? In step (3) from Goroutine 1 we change layerspathModified and the condition in step (2) from Goroutine 2 will be false. Once again - this PR is just patch, which avoid data race, but without refactoring this module I don't know how it's possible to make a good solution.

In my opinion - it's a bad idea to call Load under read-lock. Maybe something like an upgradable lock is required here.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 9, 2022

What my PR does? In step (3) from Goroutine 1 we change layerspathModified and the condition in step (2) from Goroutine 2 will be false.

Ah, so layerspathModified is now updated even if it was not the reason for the load; and that reduces the risk that future readers will reload again, without fundamentally eliminating the race. Thanks, that makes sense.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 9, 2022

I have also filed #1332 for the Load calls under read-lock, so that it is not lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants