-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v1.13 backport] Three fixes for deadlocks #24814
[v1.13 backport] Three fixes for deadlocks #24814
Conversation
[ upstream commit 3067e16 ] This fixes a deadlock with the IPCache and the endpoint mutex where writeHeaderfile calls into the proxy to fetch the endpoint's DNSRules while holding on to the endpoint mutex. That pattern causes a deadlock, because the proxy code itself then calls into IPCache while the endpoint mutex is held. This violates the assumed lock ordering, as we require that the IPCache mutex is always acquired _before_ any endpoint mutex. We fix this deadlock by hoisting call to update the endpoint's DNSRules out of writeHeaderfile and into the syncEndpointHeaderFile function, which is triggered whenever there is a change in DNSRules. Fixes: cilium#23836 Ref: cilium#20915 Co-authored-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
[ upstream commit d91219a ] GetRules held the DNSProxy lock while accessing IPCache. This established a partial lock order in which DNSProxy > IPCache. (Read DNSProxy > IPCache as "DNSProxy lock held while attempting to acquire IPCache lock." We use '>' as a convenient notation as it intuitively implies transitivity.) Since we also have IPCache > Endpoint (cf InjectLabels), this established DNSProxy > Endpoint by transitivity. That, however, is incompatible with instances where we hold the Endpoint lock while acquiring the DNSProxy lock (as that establishes Endpoint > DNSProxy). Having both EP > DNSProxy and DNSProxy > EP is not allowed and leads to deadlocks. To break the cycle, hoist the IPCache lookups out of the DNSProxy critical section. Without DNSProxy > IPCache, Endpoint > DNSProxy is fine, as there's no more DNSPRoxy > IPCache > Endpoint and wait cycle. Of course, any other instance where X > IPCache and Endpoint > X can still result in a deadlock. Somewhat insidiously, these wait chains can be arbitrarily long, and hence arbitrarily difficult to find. Suggested-by: Sebastian Wicki <sebastian@isovalent.com> Suggested-by: Chris Tarazi <chris@isovalent.com> Suggested-by: Joe Stringer <joe@cilium.io> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 93cd67f ] We have a partial lock ordering of IPCache > Endpoint as established in InjectLabels, where the IPCache lock is held and all endpoint locks are acquired one by one. On the other hand, GetNamedPorts was called in a context with the Endpoint lock held (cf *Endpoint.GetNamedPortLocked), and attempts to acquire the IPCache lock. This would establish EP > IPCache, which potentially deadlocks with the above. To fix it, remove the requirement for a write lock on IPCache for GetNamedPorts. Instead, we always pre-compute the namedPorts map in Upsert and Delete and simply return it in GetNamedPorts. To ensure atomicity, we use atomics to load and store read-only pointer to the namedPorts map and the needNamedPorts flag. Co-authored-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
b8e3f84
to
117cc6e
Compare
/test-backport-1.13 Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/1781/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.20-kernel-4.9' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.20-kernel-4.9/1789/ If it is a flake and a GitHub issue doesn't already exist to track it, comment |
/test-1.20-4.9 Created: #24839 |
test-1.20-4.9 keeps failing with the same error in different tests. Not clear to me if this is due to an unsupported kernel version in the pipeline or if something else is off. |
/test-1.20-4.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
how do i trigger |
yeah maybe. @joestringer could you check the branch protection roles for v1.13 and see if |
I've removed the requirement for |
Once this PR is merged, you can update the PR labels via: