Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: synchronously verify server preface in newClientTransport #5731

Merged
merged 7 commits into from Oct 20, 2022

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Oct 18, 2022

Fixes #5385

  • Add grpcsync.OnceFunc, which is a pattern I also needed in the ORCA PR (orca: create ORCA producer for LB policies to use to receive OOB load reports #5669).
  • Fix some transport tests that were not writing a server preface.
  • We had one e2e test that was validating a deadline was set during the handshake. This was moved to be a transport test, and can verify that newClientTransport honors context cancelation in addition to just the deadline. This might make the goal of making ClientConn.Close() block until all resources are released easier -- previously, if a connection was in flight it would take 20+s for it to stop, which would/should delay that return. With these changes it will stop very quickly.

RELEASE NOTES: none

@dfawley dfawley added the Type: Internal Cleanup Refactors, etc label Oct 18, 2022
@dfawley dfawley added this to the 1.51 Release milestone Oct 18, 2022
@dfawley dfawley requested a review from easwars October 18, 2022 20:45

// OnceFunc returns a function wrapping f which ensures f is only executed
// once even if the returned function is executed multiple times.
func OnceFunc(f func()) func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is super trivial, but given that it is exported and can be used from non-test code, adding a test would be good.

internal/transport/http2_client.go Show resolved Hide resolved
return
}
t.onPrefaceReceipt()
close(errCh) // received settings frame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is closing this channel here any different to pushing a nil error? I believe not. Should we instead push a nil error here instead of closing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

They're equivalent. I'm okay either way but this seems fine and I see no compelling arguments in favor of pushing a nil explicitly. Is there a better primitive for this purpose than a channel, perhaps?

WDYT about this (see diffs) refactor into another function?

internal/transport/http2_client.go Show resolved Hide resolved
internal/transport/http2_client.go Show resolved Hide resolved
Comment on lines 226 to 245
// Monitor context; close connection if expired or canceled before returning.
ctxMonitorDone := grpcsync.NewEvent()
newClientCtx, newClientDone := context.WithCancel(connectCtx)
defer func() {
newClientDone()
// Wait for the goroutine to exit. If we do not wait before returning,
// the caller could cancel the connectCtx after we return, but we might
// see this and close the connection.
<-ctxMonitorDone.Done()
}()
go func(conn net.Conn) {
defer ctxMonitorDone.Fire() // Signal this goroutine has exited.
<-newClientCtx.Done()
if connectCtx.Err() == nil {
// Only newClientCtx was canceled; success.
return
}
// connectCtx expired. Hard close the connection.
conn.Close()
}(conn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block of code was not easy to read and parse and understand. All your comments make absolute sense after I understood what the code is doing. But it did not help initially to understand what is happening in here and why.

In fact my other comment (and concern) about the deadline not being set on the underlying conn is cleared now that I understand what is happening in here. But I think if I come back to this piece of code after 6 months, I will again spend a lot of time trying to understand what is happening.

Could you please be a little more verbose in the comments here. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done?

internal/transport/http2_client.go Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
@easwars easwars assigned dfawley and unassigned easwars Oct 19, 2022
Copy link
Member Author

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Responses inline.

internal/transport/http2_client.go Show resolved Hide resolved
clientconn.go Show resolved Hide resolved
Comment on lines 226 to 245
// Monitor context; close connection if expired or canceled before returning.
ctxMonitorDone := grpcsync.NewEvent()
newClientCtx, newClientDone := context.WithCancel(connectCtx)
defer func() {
newClientDone()
// Wait for the goroutine to exit. If we do not wait before returning,
// the caller could cancel the connectCtx after we return, but we might
// see this and close the connection.
<-ctxMonitorDone.Done()
}()
go func(conn net.Conn) {
defer ctxMonitorDone.Fire() // Signal this goroutine has exited.
<-newClientCtx.Done()
if connectCtx.Err() == nil {
// Only newClientCtx was canceled; success.
return
}
// connectCtx expired. Hard close the connection.
conn.Close()
}(conn)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done?

internal/transport/http2_client.go Show resolved Hide resolved
internal/transport/http2_client.go Show resolved Hide resolved
internal/transport/http2_client.go Show resolved Hide resolved
return
}
t.onPrefaceReceipt()
close(errCh) // received settings frame
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷

They're equivalent. I'm okay either way but this seems fine and I see no compelling arguments in favor of pushing a nil explicitly. Is there a better primitive for this purpose than a channel, perhaps?

WDYT about this (see diffs) refactor into another function?

@dfawley dfawley assigned easwars and unassigned dfawley Oct 19, 2022
block.Fire() // Unblock them.
wg.Wait() // Wait for them to complete.
if v != 1 {
t.Fatalf("v = %v; want 1", v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

t.Fatalf("OnceFunc() called %v times, want 1", v)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 1278 to 1279
_, err := sfr.ReadFrame()
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: combine the two lines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func (t *http2Client) reader(errCh chan<- error) {
defer close(t.readerDone)

t.readServerPreface(errCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead make readServerPreface() return an error and handle it here (push it on to the channel and return)? Currently, readServerPreface() might have run into an error and pushed something on to the channel. But since we don't have a return value from it, we continue in this function instead of returning early.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly liked it this way because it could defer the channel closure. But we probably should be returning early from this function even though it's okay not to (it will exit when a connection error occurs). Maybe it's not that useful to be a separate function, but I'll leave it. Reworked.

internal/transport/transport_test.go Show resolved Hide resolved
copts := ConnectOptions{ChannelzParentID: channelz.NewIdentifierForTesting(channelz.RefSubChannel, time.Now().Unix(), nil)}
_, err = NewClientTransport(connectCtx, context.Background(), resolver.Address{Addr: lis.Addr().String()}, copts, func(GoAwayReason) {}, func() {})
t.Logf("NewClientTransport() = _, %v", err)
if time.Now().Sub(timeBefore) > 2*time.Second {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this 2s and 1.5s come from. The connectCtx is canceled in 100ms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to say "1.9". I tried different values here to balance the test running quickly with not having false failures/successes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we want to be so generous with the time it takes to return from NewClientTransport. We cancel the context in 100ms. Do we really have to give it 2s?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tighter you make this check, the more likely you are to have a spurious failure caused by it taking too long to exit, which could be due to a hiccup on github actions / etc. Ultimately, if we require that it is less than the deadline (3s) then we should be able to distinguish the reason it exited. And actually, I'll extend the deadline and loosen this slightly since we're not concerned about how long it takes to work if it fails.

And, since there is no secondary cancelation mechanism in the second case, I just removed the time check entirely; the only failure is if no error is returned or if it never returns and times out because it's not honoring the context.

_, err = NewClientTransport(connectCtx, context.Background(), resolver.Address{Addr: lis.Addr().String()}, copts, func(GoAwayReason) {}, func() {})
t.Logf("NewClientTransport() = _, %v", err)
if time.Now().Sub(timeBefore) > 2*time.Second {
t.Fatalf("NewClientTransport returned > 1.5s after context deadline")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments for this block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, too. Couldn't find this comment in the diffs view for some reason.

@easwars easwars assigned dfawley and unassigned easwars Oct 19, 2022
@dfawley dfawley assigned easwars and unassigned dfawley Oct 19, 2022
copts := ConnectOptions{ChannelzParentID: channelz.NewIdentifierForTesting(channelz.RefSubChannel, time.Now().Unix(), nil)}
_, err = NewClientTransport(connectCtx, context.Background(), resolver.Address{Addr: lis.Addr().String()}, copts, func(GoAwayReason) {}, func() {})
t.Logf("NewClientTransport() = _, %v", err)
if time.Now().Sub(timeBefore) > 2*time.Second {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we want to be so generous with the time it takes to return from NewClientTransport. We cancel the context in 100ms. Do we really have to give it 2s?


// Test context cancelation.
timeBefore := time.Now()
connectCtx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please use defaultTestTimeout which is defined for this package instead of a hard-coded value. We usually end up with different values in different tests over time. Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. If it ever changes to a value less than 3s then this test becomes broken, though, so it is more sensitive to the value than most other tests.

@easwars easwars assigned dfawley and unassigned easwars Oct 19, 2022
@dfawley dfawley merged commit 9127159 into grpc:master Oct 20, 2022
1 check passed
@dfawley dfawley deleted the syncpreface branch October 25, 2022 21:53
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

transport: perform blocking read of server preface while creating a new client transport
2 participants