New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270
Conversation
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
cc0092b
to
8822361
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #7270 +/- ##
==========================================
- Coverage 64.58% 64.56% -0.03%
==========================================
Files 225 225
Lines 21109 21119 +10
==========================================
+ Hits 13633 13635 +2
- Misses 6308 6314 +6
- Partials 1168 1170 +2 ☔ View full report in Codecov by Sentry. |
if errors.As(err, &codeErr) { | ||
if len(codeErr.headers) > 0 { | ||
invokev1.InternalMetadataToHTTPHeader(r.Context(), codeErr.headers, w.Header().Add) | ||
if rResp == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are multiple instances of the function running in parallel? Otherwise, how does execution get to 211 if the first execution returns an error that prevents retries? (I'm sure it's just my lack of experience with Go and the policy runner APIs.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, multiple instances of the function can be running in parallel if one of them times out. When there's a timeout the policy runner cancels the context, which is a request (not an order) to stop processing; at the same time, it invokes the function again.
The reason why we don't have issues with concurrency is the lines just above:
if !success.CompareAndSwap(false, true) {
return rResp, backoff.Permanent(errors.New("already completed"))
}
This code uses an atomic compare-and-swap:
- if
success
is false (initial value), then its value is changed to true (atomically) and the function returns true - if
success
is already true, then the function returns false, so we return a permanent error
In all the lines below that, we return all errors as backoff.Permanent
, which makes the policy runner not retry in case of errors.
If the retry was due to a timeout, instead, it would hit the compare-and-swap and return right away. In this case we can't go past this if
block because we have already started sending data to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ItalyPaleAle before we merge, can you please look into intg tests failing? I re-ran several times |
It looks like tests that fail are mostly actors-related, the same ones that are failing in master. I've re-run the unit tests, but they seemed unrelated too. Is there any test in particular I should be concerned with? |
… timeouts are applied (dapr#7270) * Fix race condition in policy runner when there's a timeout Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Better way to fix the error Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
…y policies with timeouts are applied Backport of dapr#7270 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
…y policies with timeouts are applied (#7310) * [release-1.12] Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied Backport of #7270 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Added release notes Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Fixes #7173
Fixes #7145
When resiliency policies with a timeout are applied, HTTP service invocation could fail with a truncated response and this error:
This was reproducible mostly on faster machines, indicating it was a timing issue.
The root cause of the issue was due to the fact that in HTTP service invocation, we pass to the
Invoke
method a context that is tied to the resiliency policy runner's context. This meant that when the policy function returns, the context passed to theInvoke
method is canceled, even if the request has not timed out. However, with streaming, the context has to remain valid outside of the policy runner function, to be able to respond to the caller.The fix involves sending the response within the policy function. This way, the context is still valid throughout that time, and the timeout applies to the entire response.
The fix was validated with the repro code in #7173, which at least on my M1 Pro Mac allowed reproducing the error reliably.
It is sadly not possible to write a test for this, because it was a timing bug that is hard to reproduce deterministically (and it mostly appears on fast machines, so not on GH Actions). In fact, our E2E tests already have test cases for service invocation with resiliency, and never detected this issue.