Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolver: improve signaling for missing account lookups #4151

Merged
merged 6 commits into from May 14, 2023

Conversation

wallyqs
Copy link
Member

@wallyqs wallyqs commented May 11, 2023

When using the account resolver and a JWT is not found, the client could often get an i/o timeout error due to not receiving a timely response before the account resolver fetch request times out. In this PR, instead of waiting for the fetch request to timeout, a resolver without JWTs will notify as well that it could not find a matching JWT, waiting for a response from all servers but with a delay to let resolver that do have the JWT to respond first.

Also included in this PR is some cleanup to the logs emitted by the resolver.

  • Tests added
  • Branch rebased on top of current main (git pull --rebase origin main)
  • Changes squashed to a single commit (described here)
  • Build is green in Travis CI
  • You have certified that the contribution is your original work and that you license the work to the project under the Apache 2 license

@wallyqs wallyqs marked this pull request as ready for review May 11, 2023 22:14
@wallyqs wallyqs requested a review from a team as a code owner May 11, 2023 22:14
dr.DirJWTStore.changed = func(pubKey string) {
if v, ok := s.accounts.Load(pubKey); ok {
if theJwt, err := dr.LoadAcc(pubKey); err != nil {
s.Errorf("update got error on load: %v", err)
s.Errorf("RESOLVER - Update got error on load: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have cases where we all caps things similar? Maybe DirResolver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used all caps so that it would look similar to SYSTEM logs, but I will update to DirResolver instead which would be clearer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to DirResolver to label these instead, logs now look like this: https://gist.github.com/wallyqs/9c968143014f2e5fb61b7d99ec0f5326

server/accounts.go Outdated Show resolved Hide resolved
if theJWT, err := dr.DirJWTStore.LoadAcc(accName); err != nil {
if errors.Is(err, fs.ErrNotExist) {
s.Debugf("DirResolver - Could not find account %q", accName)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So wondering if this should be a concern. I know why you are putting it into a separate go routine because of the sleep, but this could cause of an explosion of Go routines yes?

Like if we did a tight loop of clients trying to connect with a bad account we would be spinning these in every server for every lookup request no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe what we do for this version of the server, is wait for all servers if we receive a nil response. This way we can send a nil response immediately but on the receiving side we see the nil so wait for more responses. We know how many other servers are in the system internally so we can know when we are done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I will explore this idea and update. I was concerned about blocking but would now prefer to be cautious with spinning to many go routines here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do a quick debug.PrintStack() there too to see if we are rooted in a route or GW read..

If we are we can not block for sure, but this approach might work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that approach will work much better and can get faster responses for the happy path when servers online, I updated with this new heuristic.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
Also update logs based on comments

Signed-off-by: Waldemar Quevedo <wally@nats.io>
Signed-off-by: Waldemar Quevedo <wally@nats.io>
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, we are close.

server/accounts.go Outdated Show resolved Hide resolved
server/accounts.go Show resolved Hide resolved
server/accounts.go Outdated Show resolved Hide resolved
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, we are close.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
// Resolver will wait for available routes + gateways to reply
// before serving an error in case there weren't any found.
expectedServers := len(s.routes)
if s.gateway != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is always non nil IIRC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, instead the test could be:

if s.gateway.enabled {
..

There is currently no API to get the number of remotes, you could add one, but otherwise you should lock:

    s.gateway.RLock()
    expectedServers += len(s.gateway.remotes)
    s.gateway.RUnlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use s.sys.servers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to use s.sys.servers to use the detected active servers instead.

@@ -4223,24 +4233,39 @@ func (s *Server) fetch(res AccountResolver, name string, timeout time.Duration)
respC := make(chan []byte, 1)
accountLookupRequest := fmt.Sprintf(accLookupReqSubj, name)
s.mu.Lock()
// Resolver will wait for available routes + gateways to reply
// before serving an error in case there weren't any found.
expectedServers := len(s.routes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server maps all other servers in a different data struct when it received remote statsz updates etc.
Could use that since if this is an expanded supercluster with leafnodes sharing system accounts this will not account for that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep in mind when merging to dev, one then should use s.numRoutes() because it accounts for pooled routes, per-account, etc...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using s.sys.servers instead for the check now.

if s.sys == nil || s.sys.replies == nil {
s.mu.Unlock()
return _EMPTY_, fmt.Errorf("eventing shut down")
}
replySubj := s.newRespInbox()
replies := s.sys.replies

// Store our handler.
replies[replySubj] = func(sub *subscription, _ *client, _ *Account, subject, _ string, msg []byte) {
clone := make([]byte, len(msg))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for the nil responses here?

var clone []byte
if len(msg) > 0 {
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this check to do the copy when not empty.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
server/accounts.go Outdated Show resolved Hide resolved
clone := make([]byte, len(msg))
copy(clone, msg)
var clone []byte
var isEmpty bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just set this directly..

isEmpty := len(msg)

Signed-off-by: Waldemar Quevedo <wally@nats.io>
@derekcollison derekcollison self-requested a review May 14, 2023 17:27
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wallyqs wallyqs merged commit 3c4ed54 into main May 14, 2023
2 checks passed
@wallyqs wallyqs deleted the jwt-resolver-bad-creds branch May 14, 2023 18:10
ReubenMathew pushed a commit to ReubenMathew/nats-server that referenced this pull request May 30, 2023
When using the nats account resolver and a JWT is not found, the client could
often get an i/o timeout error due to not receiving a timely response
before the account resolver fetch request times out. Now instead
of waiting for the fetch request to timeout, a resolver without JWTs
will notify as well that it could not find a matching JWT, waiting for a
response from all active servers.

Also included in this PR is some cleanup to the logs emitted by the
resolver.

Signed-off-by: Waldemar Quevedo <wally@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants