-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock during NRI plugin registration #79
Conversation
What are you looking for to make this non-WIP? |
I think this should be fine. |
During NRI external plugin registration: * `acceptPluginConnections()` is called * adaptation lock from `nri/pkg/adaptation` is acquired * `syncFn` is invoked * `syncFn` acquires NRI lock in `pkg/nri/nri.go` During container lifecycle events such as `ContainerStart` * NRI lock is acquired in pkg/nri.go * adaptation lock is acquired in `StateChange()` in `nri/pkg/adaptation` As a result, the locking order during NRI plugin registration is: * adaptation lock -> NRI lock While the locking order during container starts is: * NRI lock -> adaptation lock Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock. To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via `syncFn` call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events. Fixes containerd/containerd#10085 Signed-off-by: David Porter <porterdavid@google.com>
I removed WIP, wanted to test it again a few times against the repro in containerd/containerd#10085 (comment) After a few runs of those steps, I can confirm that without this patch the deadlock always repros and with this patch I am no longer able to repro the deadlock. |
During NRI external plugin registration:
nri/pkg/adaptation
is acquiredsyncFn
is invokedsyncFn
acquires NRI lock inpkg/nri/nri.go
During container lifecycle events such as
ContainerStart
StateChange()
innri/pkg/adaptation
As a result, the locking order during NRI plugin registration is:
While the locking order during container starts is:
Due the fact that the locking order is inverted and not consistent, it it possible to encounter a deadlock.
To fix the issue, during NRI plugin registration, first acquire the NRI lock (done via
syncFn
call) and only after acquire the adaptation lock. This ensures that NRI plugin registration the locking order is adaption lock -> NRI lock, which is consistent with the locking order during container lifecycle events.Fixes containerd/containerd#10085