Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

ItalyPaleAle · 2023-12-05T19:10:46Z

When resiliency policies with a timeout are applied, HTTP service invocation could fail with a truncated response and this error:

{"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: back-end, err: error receiving message: rpc error: code = Canceled desc = context canceled"}

This was reproducible mostly on faster machines, indicating it was a timing issue.

The root cause of the issue was due to the fact that in HTTP service invocation, we pass to the Invoke method a context that is tied to the resiliency policy runner's context. This meant that when the policy function returns, the context passed to the Invoke method is canceled, even if the request has not timed out. However, with streaming, the context has to remain valid outside of the policy runner function, to be able to respond to the caller.

The fix involves sending the response within the policy function. This way, the context is still valid throughout that time, and the timeout applies to the entire response.

The fix was validated with the repro code in #7173, which at least on my M1 Pro Mac allowed reproducing the error reliably.

It is sadly not possible to write a test for this, because it was a timing bug that is hard to reproduce deterministically (and it mostly appears on fast machines, so not on GH Actions). In fact, our E2E tests already have test cases for service invocation with resiliency, and never detected this issue.

Additionally, this PR fixes a race condition in pkg/resiliency/policy.go, which could have caused a goroutine leak (due to writing on a channel with no reader) if an operation completed immediately after the timeout.

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

codecov · 2023-12-05T19:31:28Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (ed34172) 64.58% compared to head (6f36749) 64.56%.

Files	Patch %	Lines
pkg/http/api_directmessaging.go	82.35%	3 Missing and 3 partials ⚠️
pkg/resiliency/policy.go	66.66%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7270      +/-   ##
==========================================
- Coverage   64.58%   64.56%   -0.03%     
==========================================
  Files         225      225              
  Lines       21109    21119      +10     
==========================================
+ Hits        13633    13635       +2     
- Misses       6308     6314       +6     
- Partials     1168     1170       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

philliphoff · 2023-12-05T19:33:26Z

pkg/http/api_directmessaging.go

-	if errors.As(err, &codeErr) {
-		if len(codeErr.headers) > 0 {
-			invokev1.InternalMetadataToHTTPHeader(r.Context(), codeErr.headers, w.Header().Add)
+		if rResp == nil {


Are multiple instances of the function running in parallel? Otherwise, how does execution get to 211 if the first execution returns an error that prevents retries? (I'm sure it's just my lack of experience with Go and the policy runner APIs.)

Yes, multiple instances of the function can be running in parallel if one of them times out. When there's a timeout the policy runner cancels the context, which is a request (not an order) to stop processing; at the same time, it invokes the function again.

The reason why we don't have issues with concurrency is the lines just above:

if !success.CompareAndSwap(false, true) { return rResp, backoff.Permanent(errors.New("already completed")) }

This code uses an atomic compare-and-swap:

if success is false (initial value), then its value is changed to true (atomically) and the function returns true

if success is already true, then the function returns false, so we return a permanent error

In all the lines below that, we return all errors as backoff.Permanent, which makes the policy runner not retry in case of errors.

If the retry was due to a timeout, instead, it would hit the compare-and-swap and return right away. In this case we can't go past this if block because we have already started sending data to the client.

daixiang0

LGTM

yaron2 · 2023-12-06T23:26:32Z

@ItalyPaleAle before we merge, can you please look into intg tests failing? I re-ran several times

ItalyPaleAle · 2023-12-07T00:07:44Z

@ItalyPaleAle before we merge, can you please look into intg tests failing? I re-ran several times

It looks like tests that fail are mostly actors-related, the same ones that are failing in master. I've re-run the unit tests, but they seemed unrelated too.

Is there any test in particular I should be concerned with?

… timeouts are applied (dapr#7270) * Fix race condition in policy runner when there's a timeout Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Better way to fix the error Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>

…y policies with timeouts are applied Backport of dapr#7270 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

…y policies with timeouts are applied (#7310) * [release-1.12] Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied Backport of #7270 Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> * Added release notes Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com> --------- Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

Fix race condition in policy runner when there's a timeout

16bd291

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle requested review from a team as code owners December 5, 2023 19:10

Better way to fix the error

8822361

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle force-pushed the fix-7173 branch from cc0092b to 8822361 Compare December 5, 2023 19:18

philliphoff reviewed Dec 5, 2023

View reviewed changes

ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Dec 6, 2023

dapr-bot added 2 commits December 5, 2023 20:26

Merge branch 'master' into fix-7173

f5ee414

Merge branch 'master' into fix-7173

6f36749

daixiang0 approved these changes Dec 6, 2023

View reviewed changes

berndverst approved these changes Dec 6, 2023

View reviewed changes

philliphoff mentioned this pull request Dec 9, 2023

DAPR 1.12 cross-service REST call cuts response body #7145

Closed

yaron2 merged commit 53ceb4a into dapr:master Dec 11, 2023
19 of 22 checks passed

ItalyPaleAle mentioned this pull request Dec 16, 2023

[release-1.12] Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7310

Merged

ItalyPaleAle mentioned this pull request Dec 21, 2023

[release-1.12] Fix handling errors in HTTP service invocation #7327

Merged

JoshVanL added this to the v1.13 milestone Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

ItalyPaleAle commented Dec 5, 2023

codecov bot commented Dec 5, 2023 •

edited

philliphoff Dec 5, 2023

ItalyPaleAle Dec 6, 2023

daixiang0 left a comment

yaron2 commented Dec 6, 2023

ItalyPaleAle commented Dec 7, 2023

Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

Conversation

ItalyPaleAle commented Dec 5, 2023

codecov bot commented Dec 5, 2023 • edited

Codecov Report

philliphoff Dec 5, 2023

Choose a reason for hiding this comment

ItalyPaleAle Dec 6, 2023

Choose a reason for hiding this comment

daixiang0 left a comment

Choose a reason for hiding this comment

yaron2 commented Dec 6, 2023

ItalyPaleAle commented Dec 7, 2023

codecov bot commented Dec 5, 2023 •

edited