New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resolver: improve signaling for missing account lookups #4151
Conversation
server/accounts.go
Outdated
dr.DirJWTStore.changed = func(pubKey string) { | ||
if v, ok := s.accounts.Load(pubKey); ok { | ||
if theJwt, err := dr.LoadAcc(pubKey); err != nil { | ||
s.Errorf("update got error on load: %v", err) | ||
s.Errorf("RESOLVER - Update got error on load: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have cases where we all caps things similar? Maybe DirResolver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used all caps so that it would look similar to SYSTEM
logs, but I will update to DirResolver
instead which would be clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to DirResolver to label these instead, logs now look like this: https://gist.github.com/wallyqs/9c968143014f2e5fb61b7d99ec0f5326
6dcdd7e
to
f58d3d0
Compare
server/accounts.go
Outdated
if theJWT, err := dr.DirJWTStore.LoadAcc(accName); err != nil { | ||
if errors.Is(err, fs.ErrNotExist) { | ||
s.Debugf("DirResolver - Could not find account %q", accName) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So wondering if this should be a concern. I know why you are putting it into a separate go routine because of the sleep, but this could cause of an explosion of Go routines yes?
Like if we did a tight loop of clients trying to connect with a bad account we would be spinning these in every server for every lookup request no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe what we do for this version of the server, is wait for all servers if we receive a nil response. This way we can send a nil response immediately but on the receiving side we see the nil so wait for more responses. We know how many other servers are in the system internally so we can know when we are done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, I will explore this idea and update. I was concerned about blocking but would now prefer to be cautious with spinning to many go routines here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do a quick debug.PrintStack() there too to see if we are rooted in a route or GW read..
If we are we can not block for sure, but this approach might work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that approach will work much better and can get faster responses for the happy path when servers online, I updated with this new heuristic.
Signed-off-by: Waldemar Quevedo <wally@nats.io>
Also update logs based on comments Signed-off-by: Waldemar Quevedo <wally@nats.io>
Signed-off-by: Waldemar Quevedo <wally@nats.io>
3c0ee93
to
94b8d0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, we are close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, we are close.
Signed-off-by: Waldemar Quevedo <wally@nats.io>
0c3144a
to
13db582
Compare
server/accounts.go
Outdated
// Resolver will wait for available routes + gateways to reply | ||
// before serving an error in case there weren't any found. | ||
expectedServers := len(s.routes) | ||
if s.gateway != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is always non nil IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, instead the test could be:
if s.gateway.enabled {
..
There is currently no API to get the number of remotes, you could add one, but otherwise you should lock:
s.gateway.RLock()
expectedServers += len(s.gateway.remotes)
s.gateway.RUnlock()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use s.sys.servers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to use s.sys.servers
to use the detected active servers instead.
server/accounts.go
Outdated
@@ -4223,24 +4233,39 @@ func (s *Server) fetch(res AccountResolver, name string, timeout time.Duration) | |||
respC := make(chan []byte, 1) | |||
accountLookupRequest := fmt.Sprintf(accLookupReqSubj, name) | |||
s.mu.Lock() | |||
// Resolver will wait for available routes + gateways to reply | |||
// before serving an error in case there weren't any found. | |||
expectedServers := len(s.routes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The server maps all other servers in a different data struct when it received remote statsz updates etc.
Could use that since if this is an expanded supercluster with leafnodes sharing system accounts this will not account for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep in mind when merging to dev
, one then should use s.numRoutes() because it accounts for pooled routes, per-account, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using s.sys.servers instead for the check now.
server/accounts.go
Outdated
if s.sys == nil || s.sys.replies == nil { | ||
s.mu.Unlock() | ||
return _EMPTY_, fmt.Errorf("eventing shut down") | ||
} | ||
replySubj := s.newRespInbox() | ||
replies := s.sys.replies | ||
|
||
// Store our handler. | ||
replies[replySubj] = func(sub *subscription, _ *client, _ *Account, subject, _ string, msg []byte) { | ||
clone := make([]byte, len(msg)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for the nil responses here?
var clone []byte
if len(msg) > 0 {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this check to do the copy when not empty.
Signed-off-by: Waldemar Quevedo <wally@nats.io>
server/accounts.go
Outdated
clone := make([]byte, len(msg)) | ||
copy(clone, msg) | ||
var clone []byte | ||
var isEmpty bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just set this directly..
isEmpty := len(msg)
Signed-off-by: Waldemar Quevedo <wally@nats.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When using the nats account resolver and a JWT is not found, the client could often get an i/o timeout error due to not receiving a timely response before the account resolver fetch request times out. Now instead of waiting for the fetch request to timeout, a resolver without JWTs will notify as well that it could not find a matching JWT, waiting for a response from all active servers. Also included in this PR is some cleanup to the logs emitted by the resolver. Signed-off-by: Waldemar Quevedo <wally@nats.io>
When using the account resolver and a JWT is not found, the client could often get an i/o timeout error due to not receiving a timely response before the account resolver fetch request times out. In this PR, instead of waiting for the fetch request to timeout, a resolver without JWTs will notify as well that it could not find a matching JWT, waiting for a response from all servers
but with a delay to let resolver that do have the JWT to respond first.Also included in this PR is some cleanup to the logs emitted by the resolver.
git pull --rebase origin main
)