-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Three fixes for deadlocks #24672
Three fixes for deadlocks #24672
Conversation
433249c
to
62f0948
Compare
/test Job 'Cilium-PR-K8s-1.25-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
62f0948
to
054b7f6
Compare
/test fixed a legitimate test failure in priv unit tests; retriggering Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
nit: It seems atomic.Pointer
is used after all
/test Rebased on master to get #24389 since helm charts were out of sync. We'll have to rebase again once #24803 is in, since that caused other CI breakage we saw. Also following up with Casey on the |
This fixes a deadlock with the IPCache and the endpoint mutex where writeHeaderfile calls into the proxy to fetch the endpoint's DNSRules while holding on to the endpoint mutex. That pattern causes a deadlock, because the proxy code itself then calls into IPCache while the endpoint mutex is held. This violates the assumed lock ordering, as we require that the IPCache mutex is always acquired _before_ any endpoint mutex. We fix this deadlock by hoisting call to update the endpoint's DNSRules out of writeHeaderfile and into the syncEndpointHeaderFile function, which is triggered whenever there is a change in DNSRules. Fixes: cilium#23836 Ref: cilium#20915 Co-authored-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
GetRules held the DNSProxy lock while accessing IPCache. This established a partial lock order in which DNSProxy > IPCache. (Read DNSProxy > IPCache as "DNSProxy lock held while attempting to acquire IPCache lock." We use '>' as a convenient notation as it intuitively implies transitivity.) Since we also have IPCache > Endpoint (cf InjectLabels), this established DNSProxy > Endpoint by transitivity. That, however, is incompatible with instances where we hold the Endpoint lock while acquiring the DNSProxy lock (as that establishes Endpoint > DNSProxy). Having both EP > DNSProxy and DNSProxy > EP is not allowed and leads to deadlocks. To break the cycle, hoist the IPCache lookups out of the DNSProxy critical section. Without DNSProxy > IPCache, Endpoint > DNSProxy is fine, as there's no more DNSPRoxy > IPCache > Endpoint and wait cycle. Of course, any other instance where X > IPCache and Endpoint > X can still result in a deadlock. Somewhat insidiously, these wait chains can be arbitrarily long, and hence arbitrarily difficult to find. Suggested-by: Sebastian Wicki <sebastian@isovalent.com> Suggested-by: Chris Tarazi <chris@isovalent.com> Suggested-by: Joe Stringer <joe@cilium.io> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
We have a partial lock ordering of IPCache > Endpoint as established in InjectLabels, where the IPCache lock is held and all endpoint locks are acquired one by one. On the other hand, GetNamedPorts was called in a context with the Endpoint lock held (cf *Endpoint.GetNamedPortLocked), and attempts to acquire the IPCache lock. This would establish EP > IPCache, which potentially deadlocks with the above. To fix it, remove the requirement for a write lock on IPCache for GetNamedPorts. Instead, we always pre-compute the namedPorts map in Upsert and Delete and simply return it in GetNamedPorts. To ensure atomicity, we use atomics to load and store read-only pointer to the namedPorts map and the needNamedPorts flag. Co-authored-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
bba2da0
to
5b155ce
Compare
Previous CI was green except ci-datapath and EKS. ci-datapath was broken in cilium-cli, c.f. cilium/cilium-cli@0c03922. EKS was broken due to a known issue on master which just got fixed. Rebased again to pull in the EKS changes. Restarting the tests again. |
/test |
It looks like the ci-datapath issue is being addressed and otherwise the tests are passing & reviews are in. Ready to merge? |
/ci-datapath |
The ci-datapath fix was just merged. Let me try to re-run it (I think it should work without a rebase), to see if we can get all green for increased confidence. But if it's still broken, yeah, let's merge it. |
This is ready to merge. |
This PR collects three commits with deadlock fixes:
writeHeaderFile
. See endpoint, proxy: Fix deadlock in writeHeaderfile #24582 for detailsGetRules
GetNamedPorts
.IPCache.InjectLabels
holds the ipcache mutex while calling intoendpointManager.UpdatePolicyMaps
, which holds each EP lock in turn. On the other hand,GetNamedPorts
holds an EP lock and calls into IPCache.Please review commit by commit, and carefully - all changes are subtle.
Could use some help in word-smithing the release note.
[Marking this as backport/author,
atomic.Bool
and friends aren't available in Go 1.18 for v1.12]Fixes: #23836
Ref: #20915
cc @gandro @joestringer @christarazi