New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent race when resolving a property map #11186
Conversation
Changelog[uncommitted] (2022-10-28)Bug Fixes
|
Can you give an example of a panic trace being fixed? 🙇 |
The PR fixes 3 data races. The first (encountered when resolving a property map), causes raw panics: The panicking thread looks like this:
Full Stacktrace
|
@@ -1204,6 +1202,10 @@ func (state *resourceState) resolve(ctx *Context, err error, inputs *resourceInp | |||
} | |||
} | |||
|
|||
// We need to wait until after we finish mutating outprops to resolve. Resolving | |||
// unlocks multithreaded access to the resolved value, making mutation a data race. | |||
state.rawOutputs.getState().resolve(outprops, true, false, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me at some basic level. resolve(outprops) shares outprops with other goroutines, but it has not finished populating in the original code. So doing it after outprops["id"]=
.. stanzas seems to be better. The outprops[""]
I'd love to understand what that does and what for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a marker indicating either started to resolve (resource.NewObjectProperty(remaining)
), finished resolving (resource.NewObjectProperty(remaining)
where remaining = resource.PropertyMap{}
) or will not resolve (resource.MakeComputed(resource.NewStringProperty(""))
).
I'm not 100% sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grepping for the key ""
in sdk/go/pulumi
, its only uses are in resolve
function. Despite that, remove it results in test failures for TestRegisterResource
.
Running with a race detector yielded two additional data races, but resolvable by adding locks. IsSecret
dependencies
|
sdk/go/pulumi/types.go
Outdated
@@ -133,6 +133,8 @@ func (o *OutputState) dependencies() []Resource { | |||
if o == nil { | |||
return nil | |||
} | |||
o.cond.L.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious do you need this? Is o.deps assign-once? I see it assigned once by fullfillValue under the o.cond.L.Lock(). Wonder if the reader lock buys us something here. It can still observe nil. I guess one thing this buys is that after adding the lock the caller of dependencies() will either observe "before" or "after" state with respect to all the other pieces being populated in fullValue, and never observe it partially populated (like dependencies populated but value not).
sdk/go/pulumi/types.go
Outdated
@@ -520,7 +522,10 @@ func (o *OutputState) ApplyTWithContext(ctx context.Context, applier interface{} | |||
|
|||
// IsSecret returns a bool representing the secretness of the Output | |||
func IsSecret(o Output) bool { | |||
return o.getState().secret | |||
s := o.getState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a .. misleading func. It doesn't await the future to complete, but secretness in the limit is only known after awaiting the result. So it can return false on a secret output at first, and then return true once it completes. Right? So it's on the caller to make sure the output is complete before calling this. I thin this remains true before and after the change. What does the lock buy?
TestConfigSecrets race: consider awaiting the output before calling IsSecret. https://github.com/pulumi/pulumi-yaml/blob/287d989a179e9d618a547ce334d04318988a7b17/pkg/pulumiyaml/run_test.go#L499 I think since the test code does not await the output it does not return what you want! Adding a lock passes the race detector but it's still a race I think no? |
Responding to #11186 (comment) and #11186 (comment):
For slices in particular ( However, the biggest win is that we (and our downstream, including customers) can run their tests with the |
The second race is brilliant and worrisome. ApplyTWithContext races with All ? Like ApplyTWithContext reads dependencies() before toOutputTWithContext called by All resolves the output and populates dependencies? This sounds like it could be a real problem that may or may not be solved by throwing a lock in there. I'm reading ApplyTWithContext. |
We need to distinguish between two types of data race:
Adding the lock makes it not a data race according to (1), but it remains a data race according to (2). |
Well precisely. My worry is that adding locks makes (2) undetectable by race detector while leaving bugs in our SDK. |
This is interesting Line 492 in 10dd3d0
|
And then Line 516 in 10dd3d0
We take |
Yeah I think I got to the bottom of that one.. It is "kind of" okay but perhaps not really? 👻 What's happening is this. y = x.ApplyT(f) needs to compute dependencies as a union of deps(x) and deps(f(x)). The way it does it in Go, it allocates an output state for y right away with Then, it awaits the x output and finds out real deps(x). Then, it computes f(x). It checks if that's an output (it can be). In that case it tries to compute Finally, it pushes y.fullfill() with what it thinks is |
Let me try out a test case for this. This would be something like:
|
Regarding your changes! I think we should merge the one that fixes concurrent map access, that makes total sense and fixes panic! I'd separate the race detector once into separate PRs and take it a little easier. I don't think they surface as panics. I'd like to take it a little slower there and maybe either accept those changes with TODO notes and new tickets about the underlying bugs, or perhaps if we fix the bugs then we might not need those changes. I think @AaronFriel recently worked on revamping the Go SDK so might have good context, we might also ask around and bring some Go expertise on that. |
BTW
This is a great callout but I'm not seeing this happening in this case. The slices are not edited they are replaced wholesale. I don't think this would lead to memory corruption. The race is around a pointer to a slice not the contents. |
We don't see the error, but we definitely could in a future version of go. The race isn't around a pointer to a slice (which would be safe, since ptrs are word sized and word writes are atomic in go), but around a slice itself (3 words). A slice is represented in memory as |
0f374b1
to
211bd3e
Compare
I've removed all but the panic causing fix from this PR. I'll open a subsequent PR to address the remaining issue separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks sensible if just for not marking data as resolved while it's still under mutation. A more pure language wouldn't have even allowed the current code 🦀
bors r+ |
11186: Prevent race when resolving a property map r=iwahbe a=iwahbe I encountered a problem trying to flush non-determinism out of pulumi-yaml. Running our tests with high `--count=N` values resulted in panics from concurrent accesses to `Ouptput` values. These changes allow running `go test -count=500 -race ./...` without errors (caused by pulumi/pulumi). Co-authored-by: Ian Wahbe <ian@wahbe.com>
Build failed: |
Thanks for that slice comment @iwahbe ! TIL slices are structs. Thanks for pointing this out. That's definitely bad then. Your original locks make sense then to add to compensate for this possibility.
Sounds like we could see an error in this version of Go then also right? The race detector traces are gold - if we lose them by adding locks we just need to preserve them in the tracker that's fine. I'll follow up on monday a bit more to see if we can use them to find more bugs. |
bors retry |
11186: Prevent race when resolving a property map r=iwahbe a=iwahbe I encountered a problem trying to flush non-determinism out of pulumi-yaml. Running our tests with high `--count=N` values resulted in panics from concurrent accesses to `Ouptput` values. These changes allow running `go test -count=500 -race ./...` without errors (caused by pulumi/pulumi). Co-authored-by: Ian Wahbe <ian@wahbe.com>
There is nothing to say we couldn't, beyond the fact that we haven't, yet. A new optimization in the go compiler could change that, even if we were still targeting go@1.17. |
Build failed: |
bors retry |
Build succeeded: |
11189: Make IsSecret threadsafe r=iwahbe a=iwahbe This PR ensures that `pulumi.IsSecret` works as expected. `IsSecret` presented 3 problems. 1. It wasn't thread safe, which triggered go's `-race` flag. (brought up by #11186) 2. It returned the currently known status of secret, meaning that `[]bool{IsSecret(a), IsSecret(a)}` could evaluate to `[]bool{false, true}` if the underlying value resolved between calls. 3. `IsSecret` may return the wrong value for unknown or erred `Output`s. This PR fixes (1) and (2) by awaiting the underlying value during the `IsSecret` call. (3) is unfixable, and I explain why in a comment. Blocking on calls to `IsSecret` is necessary for correctness in the general case. We can add optimizations later if necessary. The two I have in mind are 1. Adding a concept of "known secret". `pulumi.Secret(output)` is always secret, so we don't need to block on `output` to resolve this. 2. Distinguishing between functions that return an `Output` (which might be secret) and functions that return raw values (`string`, `int`, ...) which never change the "secretes" of an `Output`. We don't need to block on functions that return raw values. I believe we should merge this PR without optimizations, and ensure that `IsSecret` is as correct as we can make it. We can add optimizations as they become necessary. I don't think we have customers who rely heavily on `IsSecret` evaluating instantly on not yet resolved outputs, since that is precisely the scenario where `IsSecret` may return the wrong answer. Co-authored-by: Ian Wahbe <ian@wahbe.com>
11248: Make OutputState pass the `-race` detector r=iwahbe a=iwahbe This is the last step necessary to allow pulumi-yaml to run a subset of its tests with `-race=true`. The race is is between `ApplyT` and `fulfillValue` on copying a slice. Both me (`@iwahbe)` and `@t0yv0` have independently confirmed that the race is not hiding a deeper issue. Further discussion of the issue is at #11186. A stack trace from the `-race` flag that this issue fixes: ``` WARNING: DATA RACE Write at 0x00c000422ec0 by goroutine 463: github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).fulfillValue() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:198 +0x590 github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyTWithContext.func1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:516 +0x8a5 Previous read at 0x00c000422ec0 by goroutine 306: github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).dependencies() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:136 +0x406 github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyTWithContext() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:492 +0x14a github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyT() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:440 +0xb8 github.com/pulumi/pulumi/sdk/v3/go/pulumi.AnyOutput.ApplyT() <autogenerated>:1 +0x49 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.TestSplit.func1.1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:890 +0x1d3 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplateDiags.func2() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:286 +0x516 github.com/pulumi/pulumi/sdk/v3/go/pulumi.RunWithContext() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:120 +0x258 github.com/pulumi/pulumi/sdk/v3/go/pulumi.runErrInner() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:96 +0x4b1 github.com/pulumi/pulumi/sdk/v3/go/pulumi.RunErr() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:63 +0x23a github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplateDiags() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:274 +0x2e github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplate() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:335 +0x44 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.TestSplit.func1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:886 +0x2f5 testing.tRunner() /usr/local/Cellar/go/1.19.2/libexec/src/testing/testing.go:1446 +0x216 testing.(*T).Run.func1() /usr/local/Cellar/go/1.19.2/libexec/src/testing/testing.go:1493 +0x47 ``` Co-authored-by: Ian Wahbe <ian@wahbe.com>
11248: Make OutputState pass the `-race` detector r=iwahbe a=iwahbe This is the last step necessary to allow pulumi-yaml to run a subset of its tests with `-race=true`. The race is is between `ApplyT` and `fulfillValue` on copying a slice. Both me (`@iwahbe)` and `@t0yv0` have independently confirmed that the race is not hiding a deeper issue. Further discussion of the issue is at #11186. A stack trace from the `-race` flag that this issue fixes: ``` WARNING: DATA RACE Write at 0x00c000422ec0 by goroutine 463: github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).fulfillValue() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:198 +0x590 github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyTWithContext.func1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:516 +0x8a5 Previous read at 0x00c000422ec0 by goroutine 306: github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).dependencies() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:136 +0x406 github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyTWithContext() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:492 +0x14a github.com/pulumi/pulumi/sdk/v3/go/pulumi.(*OutputState).ApplyT() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/types.go:440 +0xb8 github.com/pulumi/pulumi/sdk/v3/go/pulumi.AnyOutput.ApplyT() <autogenerated>:1 +0x49 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.TestSplit.func1.1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:890 +0x1d3 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplateDiags.func2() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:286 +0x516 github.com/pulumi/pulumi/sdk/v3/go/pulumi.RunWithContext() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:120 +0x258 github.com/pulumi/pulumi/sdk/v3/go/pulumi.runErrInner() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:96 +0x4b1 github.com/pulumi/pulumi/sdk/v3/go/pulumi.RunErr() /Users/ianwahbe/go/src/github.com/pulumi/pulumi/sdk/go/pulumi/run.go:63 +0x23a github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplateDiags() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:274 +0x2e github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.testTemplate() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:335 +0x44 github.com/pulumi/pulumi-yaml/pkg/pulumiyaml.TestSplit.func1() /Users/ianwahbe/go/src/github.com/pulumi/pulumi-yaml/pkg/pulumiyaml/run_test.go:886 +0x2f5 testing.tRunner() /usr/local/Cellar/go/1.19.2/libexec/src/testing/testing.go:1446 +0x216 testing.(*T).Run.func1() /usr/local/Cellar/go/1.19.2/libexec/src/testing/testing.go:1493 +0x47 ``` Co-authored-by: Ian Wahbe <ian@wahbe.com>
I encountered a problem trying to flush non-determinism out of pulumi-yaml. Running our tests with high
--count=N
values resulted in panics from concurrent accesses toOuptput
values. These changes allow runninggo test -count=500 -race ./...
without errors (caused by pulumi/pulumi).