Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sotw v3 server causing deadlock in LinearCache #530

Open
rueian opened this issue Dec 21, 2021 · 8 comments
Open

Sotw v3 server causing deadlock in LinearCache #530

rueian opened this issue Dec 21, 2021 · 8 comments

Comments

@rueian
Copy link

rueian commented Dec 21, 2021

The LinearCache will clear the watches under the name of changed resources in notifyAll calls (L147):

func (cache *LinearCache) notifyAll(modified map[string]struct{}) {
// de-duplicate watches that need to be responded
notifyList := make(map[chan Response][]string)
for name := range modified {
for watch := range cache.watches[name] {
notifyList[watch] = append(notifyList[watch], name)
}
delete(cache.watches, name)
}

However, just cleaning the the watches under the name is not enough. It needs to clean all the watches to the same chan Response. Because the Sotw v3 server creates the chan Response with only 1 buffer (L369):

// cancel existing watches to (re-)request a newer version
switch {
case req.TypeUrl == resource.EndpointType:
if values.endpointNonce == "" || values.endpointNonce == nonce {
if values.endpointCancel != nil {
values.endpointCancel()
}
values.endpoints = make(chan cache.Response, 1)
values.endpointCancel = s.cache.CreateWatch(req, streamState, values.endpoints)
}

Consider the following sequence:

  1. The sotw server receives a DiscoveryRequest with 2 resource names, and calls the cache.CreateWatch to the LinearCache.
  2. The LinearCache registers the chan Response provided by the sotw server with 2 watch entries corresponding to the requested resources.
  3. The LinearCache's UpdateResource is called with the first resource name.
  4. The LinearCache's UpdateResource is called with the second resource name and the chan Response is blocked.
  5. The sotw server receives another DiscoveryRequest but the LinearCache is still locked therefore they are deadlocked.

The LinearCache could just maintain another chan Response -> resource name map for fast cleaning in notifyAll call. But I think the root cause is that the sotw server uses a single goroutine to handle both streams of bi-di grpc.

If it handled them in separated goroutines, there would be no such deadlock, and there might be even no need to recreate a new chan Response on each DiscoveryRequest and unregister watches on each notifyAll call in LinearCache.

rueian added a commit to rueian/go-control-plane that referenced this issue Dec 21, 2021
rueian added a commit to rueian/go-control-plane that referenced this issue Dec 21, 2021
rueian added a commit to rueian/go-control-plane that referenced this issue Dec 21, 2021
rueian added a commit to rueian/go-control-plane that referenced this issue Dec 21, 2021
@rueian
Copy link
Author

rueian commented Dec 22, 2021

I have opened #531 to address this.

rueian added a commit to rueian/go-control-plane that referenced this issue Jan 7, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 7, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 12, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 12, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 12, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 13, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 13, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jan 20, 2022
Signed-off-by: Rueian <rueiancsie@gmail.com>
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jan 21, 2022
@rueian
Copy link
Author

rueian commented Jan 21, 2022

no stalebot

@github-actions github-actions bot removed the stale label Jan 21, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 20, 2022
@rueian
Copy link
Author

rueian commented Feb 20, 2022

no stalebot

@github-actions github-actions bot removed the stale label Feb 20, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 23, 2022
@rueian
Copy link
Author

rueian commented Mar 23, 2022

no stalebot

@github-actions github-actions bot removed the stale label Mar 23, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jul 4, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jul 4, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Jul 4, 2022
Signed-off-by: Rueian <rueiancsie@gmail.com>
rueian added a commit to rueian/go-control-plane that referenced this issue Aug 10, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Aug 10, 2022
rueian added a commit to rueian/go-control-plane that referenced this issue Aug 10, 2022
Signed-off-by: Rueian <rueiancsie@gmail.com>
@valerian-roche
Copy link
Contributor

I believe this issue has been solved in recent reworks changing the channel model of the server and cache. Can you confirm if this issue is still present in recent versions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants