identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

giorio94 · 2023-05-03T07:32:49Z

This PR fixes the propagation of the identity labels associated with the kube-apiserver IPs (if they are external to the cluster, as in cloud environments) to the selector cache, to ensure that the ones representing the CIDR itself and the enclosing ones are not overwritten. That could cause traffic towards the API server to be incorrectly dropped when it should be allowed by ToCIDR network policy entries. This bug affects only v1.13, since the logic in the main branch has been refactored, and (among others) it already includes the changes proposed here.

It additionally fixes the occurrence of the warning log reported in #24401, since the kube-apiserver identity labels are now consistent across different invocations of selectorcache.UpdateIdentities.

Fixes: #24401

Fix bug that caused ToCIDR netpols matching kube-apiserver IPs (when external to the cluster) to not reliably allow connectivity.

The label injection controller is responsible for iterating through the ipcache.IdentityMetadata map, potentially allocating new identities in case of label changes. The new identity can be enriched with additional labels besides the ones present in the metadata map. This happens, in particular, when the identity to be allocated corresponds to the IP of kube-apiserver, and it does not belong to any of the cluster nodes (e.g., in cloud-provider environments). In this case, the additional labels represent the CIDR itself and all broader CIDRs which include it. In such scenario, there is currently a possible race condition, which possibly causes the additional CIDR labels to be disregarded, leading to an incorrect enforcement of ToCIDR policies matching the IPs of the kube-apiserver (hence, the corresponding traffic is incorrectly dropped). When this issue occurs, the policy selector for a broader CIDR does not include the identities corresponding to the kube-apiserver IP addresses, and affected agents report a warning message along the lines of (all intermediate CIDR labels are omitted for brevity): level=warning msg="UpdateIdentities: Updating an existing identity" identity=16777218 labels="[cidr:0.0.0.0/0 ... cidr:10.2.0.6/32 reserved:kube-apiserver reserved:world]" labels(new)="[reserved:kube-apiserver]" subsys=policy Differently, the warning message is benign in case the contents of "labels" and "labels(new)" are swapped (i.e., "labels(new)" includes all CIDR labels in addition to the kube-apiserver reserved one). More in detail, the race condition is the following: * The prefix corresponding to the IP address of the kube-apiserver is processed by ipc.InjectLabels; in terms of metadata, it is associated with the reserved:kube-apiserver label only [1]. * A new local identity gets allocated for the given prefix, after enriching the labels with the ones representing CIDR information (ipc.injectLabelsForCIDR) [2]. The identity creation causes an event to be written to the event channel [3] (with the new ID and the full list of labels), asynchronously triggering sc.UpdateIdentities [4]. * Once all identities have been allocated, ipc.InjectLabels calls ipc.UpdatePolicyMaps with the set of updated identities, and the labels retrieved from the metadata [5,6]. Under the hood, this causes sc.UpdateIdentities to be executed. If this happens after the asynchronous execution triggered by the update above, the current labels (i.e., reserved:kube-apiserver only) overwrite the full list previously configured, causing the issue. Conversely, if the order is reversed, all labels are eventually present). The issue is fixed propagating the enriched labels associated with the newly generated identity to ipc.UpdatePolicyMaps, rather than only the ones present as part of the metadata. [1]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L198 [2]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L431-L432 [3]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/identity/cache/local.go#L108-L112 [4]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/identity/cache/cache.go#L187 [5]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L232 [6]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L266 Fixes: cilium#24401 Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-05-03T07:33:19Z

/test-backport-1.13

Job 'Cilium-PR-K8s-1.24-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathVerifier Runs the kernel verifier against Cilium's BPF datapath

Failure Output

FAIL: terminating containers are not deleted after timeout

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-4.19/31/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: connectivity-check pods are not ready after timeout

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/1247/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

christarazi · 2023-05-03T15:41:44Z

LGTM, given that this is a release-note/bug, can we describe the user impact of the bug in the release note?

giorio94 · 2023-05-03T16:14:07Z

LGTM, given that this is a release-note/bug, can we describe the user impact of the bug in the release note?

Sure, I've updated the release note to better highlight the user impact.

joestringer

LGTM thanks! As tophat I'll be happy to merge this after CI is triaged and proven not to be breaking due to the change (ideally achieved by re-running those tests to demonstrate they don't reliably fail).

michi-covalent · 2023-05-04T03:37:40Z

/mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19

👍 created #25255

michi-covalent · 2023-05-04T03:37:57Z

/test-1.24-4.19

michi-covalent · 2023-05-04T03:39:29Z

test-1.16-4.19: hit #22749

michi-covalent · 2023-05-04T03:41:04Z

/test-1.16-4.19

michi-covalent · 2023-05-04T03:45:27Z

i'm guessing ci-datapath-1.13 is no longer required since it got renamed to ci-e2e-1.13 in #25164. we can merge this once test-1.16-4.19 and test-1.24-4.19 pass.

giorio94 · 2023-05-04T06:33:44Z

Thanks @michi-covalent for triaging the tests! They all passed now

michi-covalent · 2023-05-04T14:18:44Z

is the button green? no. but this is an exceptional case since ci-datapath-1.13 has been renamed to ci-e2e-1.13 and ci-e2e-1.13 passed. time to 🚀

joestringer · 2023-05-24T19:24:55Z

@giorio94 please use the usual backporting processes when preparing backport commits like this. In particular this ensures:

Commit messages refer back to the original commit, so it is easy to trace and confirm that the backport was performed correctly
PR descriptions include the relevant references that allow the release notes tool to generate accurate release notes.

For your reference, the backporting guide is here: https://docs.cilium.io/en/latest/contributing/release/backports/#backporting-guide-for-the-backporter

joestringer · 2023-05-24T19:25:55Z

Ah, I may have spoke a bit early if this fix was only targeted at the 1.13 branch.. in that case this is expected. The vast majority of fixes land on main first, but in this case it looks like there is no corresponding commit on the main branch. Sorry about the noise.

giorio94 · 2023-05-25T06:43:42Z

Yeah, this change targeted only v1.13, since the same logic in the main branch had already been refactored, fixing at the same time this bug. This commit modifies only the very specific bits required to fix the issue, to avoid back porting the full refactoring to v1.13.

giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels May 3, 2023

giorio94 requested review from joestringer and christarazi May 3, 2023 07:32

giorio94 requested a review from a team as a code owner May 3, 2023 07:32

maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels May 3, 2023

christarazi approved these changes May 3, 2023

View reviewed changes

joestringer approved these changes May 3, 2023

View reviewed changes

joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 3, 2023

christarazi linked an issue May 3, 2023 that may be closed by this pull request

Cilium 1.13 complains about "UpdateIdentities: Updating an existing identity" #24401

Closed

christarazi mentioned this pull request May 3, 2023

Cilium 1.13 complains about "UpdateIdentities: Updating an existing identity" #24401

Closed

michi-covalent added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label May 4, 2023

michi-covalent merged commit 440e87d into cilium:v1.13 May 4, 2023
62 checks passed

giorio94 deleted the mio/kube-apiserver-identity-labels-fix branch May 4, 2023 14:23

thorn3r mentioned this pull request May 17, 2023

Prepare for release v1.13.3 #25514

Merged

schaffino mentioned this pull request May 18, 2023

kube-apiserver IP adresses disappear from bpf ipcache 10 minutes after restarting cilium-agent #24502

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

giorio94 commented May 3, 2023 •

edited

giorio94 commented May 3, 2023 •

edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

christarazi commented May 3, 2023 •

edited

giorio94 commented May 3, 2023

joestringer left a comment

michi-covalent commented May 4, 2023 •

edited by maintainer-s-little-helper bot

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

giorio94 commented May 4, 2023

michi-covalent commented May 4, 2023

joestringer commented May 24, 2023

joestringer commented May 24, 2023

giorio94 commented May 25, 2023

identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

Conversation

giorio94 commented May 3, 2023 • edited

giorio94 commented May 3, 2023 • edited by maintainer-s-little-helper bot

Test Name

Failure Output

Test Name

Failure Output

christarazi commented May 3, 2023 • edited

giorio94 commented May 3, 2023

joestringer left a comment

Choose a reason for hiding this comment

michi-covalent commented May 4, 2023 • edited by maintainer-s-little-helper bot

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

michi-covalent commented May 4, 2023

giorio94 commented May 4, 2023

michi-covalent commented May 4, 2023

joestringer commented May 24, 2023

joestringer commented May 24, 2023

giorio94 commented May 25, 2023

giorio94 commented May 3, 2023 •

edited

giorio94 commented May 3, 2023 •

edited by maintainer-s-little-helper bot

christarazi commented May 3, 2023 •

edited

michi-covalent commented May 4, 2023 •

edited by maintainer-s-little-helper bot