Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied #7270

Merged
merged 4 commits into from Dec 11, 2023

Conversation

ItalyPaleAle
Copy link
Contributor

Fixes #7173
Fixes #7145

When resiliency policies with a timeout are applied, HTTP service invocation could fail with a truncated response and this error:

{"errorCode":"ERR_DIRECT_INVOKE","message":"failed to invoke, id: back-end, err: error receiving message: rpc error: code = Canceled desc = context canceled"}

This was reproducible mostly on faster machines, indicating it was a timing issue.

The root cause of the issue was due to the fact that in HTTP service invocation, we pass to the Invoke method a context that is tied to the resiliency policy runner's context. This meant that when the policy function returns, the context passed to the Invoke method is canceled, even if the request has not timed out. However, with streaming, the context has to remain valid outside of the policy runner function, to be able to respond to the caller.

The fix involves sending the response within the policy function. This way, the context is still valid throughout that time, and the timeout applies to the entire response.

The fix was validated with the repro code in #7173, which at least on my M1 Pro Mac allowed reproducing the error reliably.

It is sadly not possible to write a test for this, because it was a timing bug that is hard to reproduce deterministically (and it mostly appears on fast machines, so not on GH Actions). In fact, our E2E tests already have test cases for service invocation with resiliency, and never detected this issue.

Additionally, this PR fixes a race condition in pkg/resiliency/policy.go, which could have caused a goroutine leak (due to writing on a channel with no reader) if an operation completed immediately after the timeout.

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle ItalyPaleAle requested review from a team as code owners December 5, 2023 19:10
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (ed34172) 64.58% compared to head (6f36749) 64.56%.

Files Patch % Lines
pkg/http/api_directmessaging.go 82.35% 3 Missing and 3 partials ⚠️
pkg/resiliency/policy.go 66.66% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7270      +/-   ##
==========================================
- Coverage   64.58%   64.56%   -0.03%     
==========================================
  Files         225      225              
  Lines       21109    21119      +10     
==========================================
+ Hits        13633    13635       +2     
- Misses       6308     6314       +6     
- Partials     1168     1170       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if errors.As(err, &codeErr) {
if len(codeErr.headers) > 0 {
invokev1.InternalMetadataToHTTPHeader(r.Context(), codeErr.headers, w.Header().Add)
if rResp == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are multiple instances of the function running in parallel? Otherwise, how does execution get to 211 if the first execution returns an error that prevents retries? (I'm sure it's just my lack of experience with Go and the policy runner APIs.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, multiple instances of the function can be running in parallel if one of them times out. When there's a timeout the policy runner cancels the context, which is a request (not an order) to stop processing; at the same time, it invokes the function again.

The reason why we don't have issues with concurrency is the lines just above:

if !success.CompareAndSwap(false, true) {
	return rResp, backoff.Permanent(errors.New("already completed"))
}

This code uses an atomic compare-and-swap:

  • if success is false (initial value), then its value is changed to true (atomically) and the function returns true
  • if success is already true, then the function returns false, so we return a permanent error

In all the lines below that, we return all errors as backoff.Permanent, which makes the policy runner not retry in case of errors.

If the retry was due to a timeout, instead, it would hit the compare-and-swap and return right away. In this case we can't go past this if block because we have already started sending data to the client.

@ItalyPaleAle ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Dec 6, 2023
Copy link
Member

@daixiang0 daixiang0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaron2
Copy link
Member

yaron2 commented Dec 6, 2023

@ItalyPaleAle before we merge, can you please look into intg tests failing? I re-ran several times

@ItalyPaleAle
Copy link
Contributor Author

@ItalyPaleAle before we merge, can you please look into intg tests failing? I re-ran several times

It looks like tests that fail are mostly actors-related, the same ones that are failing in master. I've re-run the unit tests, but they seemed unrelated too.

Is there any test in particular I should be concerned with?

@yaron2 yaron2 merged commit 53ceb4a into dapr:master Dec 11, 2023
19 of 22 checks passed
JoshVanL pushed a commit to JoshVanL/dapr that referenced this pull request Dec 14, 2023
… timeouts are applied (dapr#7270)

* Fix race condition in policy runner when there's a timeout

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Better way to fix the error

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
ItalyPaleAle added a commit to ItalyPaleAle/dapr that referenced this pull request Dec 16, 2023
…y policies with timeouts are applied

Backport of dapr#7270

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
yaron2 pushed a commit that referenced this pull request Dec 16, 2023
…y policies with timeouts are applied (#7310)

* [release-1.12] Fix timeouts in HTTP service invocation when resiliency policies with timeouts are applied

Backport of #7270

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

* Added release notes

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

---------

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@JoshVanL JoshVanL added this to the v1.13 milestone Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoupdate DaprBot will keep the Pull Request up to date with master branch
Projects
None yet
7 participants