Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identities: fix incorrectly overwriting the labels of kube-apiserver IPs #25241

Merged

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented May 3, 2023

This PR fixes the propagation of the identity labels associated with the kube-apiserver IPs (if they are external to the cluster, as in cloud environments) to the selector cache, to ensure that the ones representing the CIDR itself and the enclosing ones are not overwritten. That could cause traffic towards the API server to be incorrectly dropped when it should be allowed by ToCIDR network policy entries. This bug affects only v1.13, since the logic in the main branch has been refactored, and (among others) it already includes the changes proposed here.

It additionally fixes the occurrence of the warning log reported in #24401, since the kube-apiserver identity labels are now consistent across different invocations of selectorcache.UpdateIdentities.

Fixes: #24401

Fix bug that caused ToCIDR netpols matching kube-apiserver IPs (when external to the cluster) to not reliably allow connectivity.

The label injection controller is responsible for iterating through the
ipcache.IdentityMetadata map, potentially allocating new identities in
case of label changes. The new identity can be enriched with additional
labels besides the ones present in the metadata map. This happens, in
particular, when the identity to be allocated corresponds to the IP of
kube-apiserver, and it does not belong to any of the cluster nodes
(e.g., in cloud-provider environments). In this case, the additional
labels represent the CIDR itself and all broader CIDRs which include it.

In such scenario, there is currently a possible race condition, which
possibly causes the additional CIDR labels to be disregarded, leading to
an incorrect enforcement of ToCIDR policies matching the IPs of the
kube-apiserver (hence, the corresponding traffic is incorrectly dropped).
When this issue occurs, the policy selector for a broader CIDR does not
include the identities corresponding to the kube-apiserver IP addresses,
and affected agents report a warning message along the lines of (all
intermediate CIDR labels are omitted for brevity):

level=warning msg="UpdateIdentities: Updating an existing identity" identity=16777218
labels="[cidr:0.0.0.0/0 ... cidr:10.2.0.6/32 reserved:kube-apiserver reserved:world]"
labels(new)="[reserved:kube-apiserver]" subsys=policy

Differently, the warning message is benign in case the contents of
 "labels" and "labels(new)" are swapped (i.e., "labels(new)" includes
all CIDR labels in addition to the kube-apiserver reserved one).

More in detail, the race condition is the following:
* The prefix corresponding to the IP address of the kube-apiserver is
  processed by ipc.InjectLabels; in terms of metadata, it is associated
  with the reserved:kube-apiserver label only [1].
* A new local identity gets allocated for the given prefix, after
  enriching the labels with the ones representing CIDR information
  (ipc.injectLabelsForCIDR) [2]. The identity creation causes an event
  to be written to the event channel [3] (with the new ID and the full
  list of labels), asynchronously triggering sc.UpdateIdentities [4].
* Once all identities have been allocated, ipc.InjectLabels calls
  ipc.UpdatePolicyMaps with the set of updated identities, and the
  labels retrieved from the metadata [5,6]. Under the hood, this causes
  sc.UpdateIdentities to be executed. If this happens after the
  asynchronous execution triggered by the update above, the current
  labels (i.e., reserved:kube-apiserver only) overwrite the full list
  previously configured, causing the issue. Conversely, if the order
  is reversed, all labels are eventually present).

The issue is fixed propagating the enriched labels associated with the
newly generated identity to ipc.UpdatePolicyMaps, rather than only the
ones present as part of the metadata.

[1]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L198
[2]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L431-L432
[3]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/identity/cache/local.go#L108-L112
[4]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/identity/cache/cache.go#L187
[5]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L232
[6]: https://github.com/cilium/cilium/blob/097fba3182f0f67fe0c6acab9cf5f1a500af36d2/pkg/ipcache/metadata.go#L266

Fixes: cilium#24401
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. labels May 3, 2023
@giorio94 giorio94 requested a review from a team as a code owner May 3, 2023 07:32
@maintainer-s-little-helper maintainer-s-little-helper bot added backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. labels May 3, 2023
@giorio94
Copy link
Member Author

giorio94 commented May 3, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.24-kernel-4.19' failed:

Click to show.

Test Name

K8sDatapathVerifier Runs the kernel verifier against Cilium's BPF datapath

Failure Output

FAIL: terminating containers are not deleted after timeout

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.24-kernel-4.19/31/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

Job 'Cilium-PR-K8s-1.16-kernel-4.19' failed:

Click to show.

Test Name

K8sAgentPolicyTest Multi-node policy test with L7 policy using connectivity-check to check datapath

Failure Output

FAIL: connectivity-check pods are not ready after timeout

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.16-kernel-4.19/1247/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.16-kernel-4.19 so I can create one.

Then please upload the Jenkins artifacts to that issue.

@christarazi
Copy link
Member

christarazi commented May 3, 2023

LGTM, given that this is a release-note/bug, can we describe the user impact of the bug in the release note?

@giorio94
Copy link
Member Author

giorio94 commented May 3, 2023

LGTM, given that this is a release-note/bug, can we describe the user impact of the bug in the release note?

Sure, I've updated the release note to better highlight the user impact.

Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! As tophat I'll be happy to merge this after CI is triaged and proven not to be breaking due to the change (ideally achieved by re-running those tests to demonstrate they don't reliably fail).

@joestringer joestringer added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 3, 2023
@michi-covalent michi-covalent added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label May 4, 2023
@michi-covalent
Copy link
Contributor

michi-covalent commented May 4, 2023

/mlh new-flake Cilium-PR-K8s-1.24-kernel-4.19

👍 created #25255

@michi-covalent
Copy link
Contributor

/test-1.24-4.19

@michi-covalent
Copy link
Contributor

test-1.16-4.19: hit #22749

@michi-covalent
Copy link
Contributor

/test-1.16-4.19

@michi-covalent
Copy link
Contributor

i'm guessing ci-datapath-1.13 is no longer required since it got renamed to ci-e2e-1.13 in #25164. we can merge this once test-1.16-4.19 and test-1.24-4.19 pass.

@giorio94
Copy link
Member Author

giorio94 commented May 4, 2023

Thanks @michi-covalent for triaging the tests! They all passed now

@michi-covalent
Copy link
Contributor

is the button green? no. but this is an exceptional case since ci-datapath-1.13 has been renamed to ci-e2e-1.13 and ci-e2e-1.13 passed. time to 🚀

@michi-covalent michi-covalent merged commit 440e87d into cilium:v1.13 May 4, 2023
62 checks passed
@giorio94 giorio94 deleted the mio/kube-apiserver-identity-labels-fix branch May 4, 2023 14:23
@joestringer
Copy link
Member

@giorio94 please use the usual backporting processes when preparing backport commits like this. In particular this ensures:

  • Commit messages refer back to the original commit, so it is easy to trace and confirm that the backport was performed correctly
  • PR descriptions include the relevant references that allow the release notes tool to generate accurate release notes.

For your reference, the backporting guide is here: https://docs.cilium.io/en/latest/contributing/release/backports/#backporting-guide-for-the-backporter

@joestringer
Copy link
Member

Ah, I may have spoke a bit early if this fix was only targeted at the 1.13 branch.. in that case this is expected. The vast majority of fixes land on main first, but in this case it looks like there is no corresponding commit on the main branch. Sorry about the noise.

@giorio94
Copy link
Member Author

Yeah, this change targeted only v1.13, since the same logic in the main branch had already been refactored, fixing at the same time this bug. This commit modifies only the very specific bits required to fix the issue, to avoid back porting the full refactoring to v1.13.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cilium 1.13 complains about "UpdateIdentities: Updating an existing identity"
4 participants