Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes #27798

Merged
merged 1 commit into from
Sep 4, 2023

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented Aug 29, 2023

This patch adds a check to determine whether a Pod is sending traffic to itself through a Service. With endpoint routes enabled, this traffic is subject to policy, and we want to avoid the user to have to explicitly allow traffic to+from itself.

Fixes: #27709
Strictly depends on #27602 (for backports)

bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes

@ti-mo ti-mo requested a review from a team as a code owner August 29, 2023 12:42
@ti-mo ti-mo requested a review from aspsk August 29, 2023 12:42
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 29, 2023
@ti-mo ti-mo marked this pull request as draft August 29, 2023 12:42
@ti-mo ti-mo added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Aug 29, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 29, 2023
@ti-mo ti-mo added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Aug 29, 2023
@ti-mo
Copy link
Contributor Author

ti-mo commented Aug 29, 2023

/test

Copy link
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find some minor comments below (even though I understand this is still a draft)

bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
@julianwiedmann julianwiedmann added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Aug 30, 2023
@ti-mo ti-mo force-pushed the tb/svc-hairpinning branch 2 times, most recently from e07938e to b465e93 Compare August 30, 2023 08:39
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
@ti-mo
Copy link
Contributor Author

ti-mo commented Aug 30, 2023

/test

@ti-mo ti-mo added needs-backport/1.12 needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Aug 30, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.13.7 Aug 30, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.2 Aug 30, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.12.14 Aug 30, 2023
@qmonnet qmonnet added the backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. label Sep 7, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.7 Sep 7, 2023
@julianwiedmann julianwiedmann added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Sep 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.7 Sep 8, 2023
@qmonnet qmonnet added backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. and removed backport-pending/1.12 labels Sep 8, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.12 to Backport done to v1.12 in 1.12.14 Sep 8, 2023
@michi-covalent michi-covalent added this to Needs backport from main in 1.14.3 Sep 9, 2023
@michi-covalent michi-covalent removed this from Needs backport from main in 1.14.2 Sep 9, 2023
@ti-mo ti-mo deleted the tb/svc-hairpinning branch September 11, 2023 12:14
@gandro
Copy link
Member

gandro commented Sep 12, 2023

@qmonnet @ti-mo This also needs a custom backport to v1.14 due to conflicts unfortunately

@gandro gandro added the backport/author The backport will be carried out by the author of the PR. label Sep 12, 2023
@gandro gandro mentioned this pull request Sep 12, 2023
15 tasks
@qmonnet
Copy link
Member

qmonnet commented Sep 12, 2023

OK thank you I will look into it

@julianwiedmann julianwiedmann added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 13, 2023
@jrajahalme jrajahalme moved this from Needs backport from main to Backport done to v1.14 in 1.14.3 Oct 18, 2023
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 17, 2024
The usual flow for handling service traffic to a local backend is as
follows:
* requests are load-balanced in from-container. This entails selecting
a backend (and caching the selection in a CT_SERVICE entry), DNATing the
packet, creating a CT_EGRESS entry for the resulting `client -> backend`
flow, applying egress network policy, and local delivery to the backend
pod. As part of the local delivery, we also create a CT_INGRESS entry and
apply ingress network policy.
* replies bypass the backend's egress network policy (because the CT
lookup returns CT_REPLY), and pass to the client via local delivery. In
the client's ingress path they bypass ingress network policy (the packets
match as reply against the CT_EGRESS entry), and we apply RevDNAT based on
the `rev_nat_index` in the CT_EGRESS entry.

For a loopback connection (where the client pod is selected as backend for
the connection) this looks slightly more complicated:
* As we can't establish a `client -> client` connection, the requests are
also SNATed with IPV4_LOOPBACK. Network policy in forward direction is
explicitly skipped (as the matched CT entries have the `.loopback` flag
set).
* In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a
valid IP for an endpoint lookup). So a reply already gets fully RevNATed
by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this
means that when passing into the client pod (either via to-container, or
via the ingress policy tail-call), the packet doesn't match as reply to the
CT_EGRESS entry - and so we don't benefit from automatic network policy
bypass. We ended up with two workarounds for this aspect:
(1) when to-container is installed, it contains custom logic to match the
    packet as a loopback reply, and skip ingress policy
    (see cilium#27798).
(2) otherwise we skip the ingress policy tailcall, and forward the packet
    straight into the client pod.

The downside of these workarounds is that we bypass the *whole* ingress
program, not just the network policy part. So the CT_EGRESS entry doesn't
get updated (lifetime, statistics, observed packet flags, ...), and we
have the hidden risk that when we add more logic to the ingress program,
it doesn't get executed for loopback replies.

This patch aims to eliminate the need for such workarounds. At its core,
it detects loopback replies in from-container and overrides the packet's
destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK,
we can now look up the actual client endpoint - and deliver to the ingress
policy program, *without* needing to early-RevNAT the packet. Instead the
replies follow the usual packet flow, match the CT_EGRESS entry in the
ingress program, naturally bypass ingress network policy, and are *then*
RevNATed based on the CT_EGRESS entry's `rev_nat_index`.

Consequently we follow the standard datapath, without needing to skip over
policy programs. The CT_EGRESS entry is updated for every reply.

Thus we can also remove the manual policy bypass for loopback replies,
when using per-EP routing. It's no longer needed and in fact the
replies will no longer match the lookup logic, as they haven't
been RevNATed yet. This effectively reverts
e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 22, 2024
The usual flow for handling service traffic to a local backend is as
follows:
* requests are load-balanced in from-container. This entails selecting
a backend (and caching the selection in a CT_SERVICE entry), DNATing the
packet, creating a CT_EGRESS entry for the resulting `client -> backend`
flow, applying egress network policy, and local delivery to the backend
pod. As part of the local delivery, we also create a CT_INGRESS entry and
apply ingress network policy.
* replies bypass the backend's egress network policy (because the CT
lookup returns CT_REPLY), and pass to the client via local delivery. In
the client's ingress path they bypass ingress network policy (the packets
match as reply against the CT_EGRESS entry), and we apply RevDNAT based on
the `rev_nat_index` in the CT_EGRESS entry.

For a loopback connection (where the client pod is selected as backend for
the connection) this looks slightly more complicated:
* As we can't establish a `client -> client` connection, the requests are
also SNATed with IPV4_LOOPBACK. Network policy in forward direction is
explicitly skipped (as the matched CT entries have the `.loopback` flag
set).
* In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a
valid IP for an endpoint lookup). So a reply already gets fully RevNATed
by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this
means that when passing into the client pod (either via to-container, or
via the ingress policy tail-call), the packet doesn't match as reply to the
CT_EGRESS entry - and so we don't benefit from automatic network policy
bypass. We ended up with two workarounds for this aspect:
(1) when to-container is installed, it contains custom logic to match the
    packet as a loopback reply, and skip ingress policy
    (see cilium#27798).
(2) otherwise we skip the ingress policy tailcall, and forward the packet
    straight into the client pod.

The downside of these workarounds is that we bypass the *whole* ingress
program, not just the network policy part. So the CT_EGRESS entry doesn't
get updated (lifetime, statistics, observed packet flags, ...), and we
have the hidden risk that when we add more logic to the ingress program,
it doesn't get executed for loopback replies.

This patch aims to eliminate the need for such workarounds. At its core,
it detects loopback replies in from-container and overrides the packet's
destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK,
we can now look up the actual client endpoint - and deliver to the ingress
policy program, *without* needing to early-RevNAT the packet. Instead the
replies follow the usual packet flow, match the CT_EGRESS entry in the
ingress program, naturally bypass ingress network policy, and are *then*
RevNATed based on the CT_EGRESS entry's `rev_nat_index`.

Consequently we follow the standard datapath, without needing to skip over
policy programs. The CT_EGRESS entry is updated for every reply.

Thus we can also remove the manual policy bypass for loopback replies,
when using per-EP routing. It's no longer needed and in fact the
replies will no longer match the lookup logic, as they haven't
been RevNATed yet. This effectively reverts
e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request May 23, 2024
The usual flow for handling service traffic to a local backend is as
follows:
* requests are load-balanced in from-container. This entails selecting
a backend (and caching the selection in a CT_SERVICE entry), DNATing the
packet, creating a CT_EGRESS entry for the resulting `client -> backend`
flow, applying egress network policy, and local delivery to the backend
pod. As part of the local delivery, we also create a CT_INGRESS entry and
apply ingress network policy.
* replies bypass the backend's egress network policy (because the CT
lookup returns CT_REPLY), and pass to the client via local delivery. In
the client's ingress path they bypass ingress network policy (the packets
match as reply against the CT_EGRESS entry), and we apply RevDNAT based on
the `rev_nat_index` in the CT_EGRESS entry.

For a loopback connection (where the client pod is selected as backend for
the connection) this looks slightly more complicated:
* As we can't establish a `client -> client` connection, the requests are
also SNATed with IPV4_LOOPBACK. Network policy in forward direction is
explicitly skipped (as the matched CT entries have the `.loopback` flag
set).
* In reply direction, we can't deliver to IPV4_LOOPBACK (as that's not a
valid IP for an endpoint lookup). So a reply already gets fully RevNATed
by from-container, using the CT_INGRESS entry's `rev_nat_index`. But this
means that when passing into the client pod (either via to-container, or
via the ingress policy tail-call), the packet doesn't match as reply to the
CT_EGRESS entry - and so we don't benefit from automatic network policy
bypass. We ended up with two workarounds for this aspect:
(1) when to-container is installed, it contains custom logic to match the
    packet as a loopback reply, and skip ingress policy
    (see cilium#27798).
(2) otherwise we skip the ingress policy tailcall, and forward the packet
    straight into the client pod.

The downside of these workarounds is that we bypass the *whole* ingress
program, not just the network policy part. So the CT_EGRESS entry doesn't
get updated (lifetime, statistics, observed packet flags, ...), and we
have the hidden risk that when we add more logic to the ingress program,
it doesn't get executed for loopback replies.

This patch aims to eliminate the need for such workarounds. At its core,
it detects loopback replies in from-container and overrides the packet's
destination IP. Instead of attempting an endpoint lookup for IPV4_LOOPBACK,
we can now look up the actual client endpoint - and deliver to the ingress
policy program, *without* needing to early-RevNAT the packet. Instead the
replies follow the usual packet flow, match the CT_EGRESS entry in the
ingress program, naturally bypass ingress network policy, and are *then*
RevNATed based on the CT_EGRESS entry's `rev_nat_index`.

Consequently we follow the standard datapath, without needing to skip over
policy programs. The CT_EGRESS entry is updated for every reply.

Thus we can also remove the manual policy bypass for loopback replies,
when using per-EP routing. It's no longer needed and in fact the
replies will no longer match the lookup logic, as they haven't
been RevNATed yet. This effectively reverts
e2829a0 ("bpf: lxc: support Pod->Service->Pod hairpinning with endpoint routes").

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport/author The backport will be carried out by the author of the PR. backport-done/1.12 The backport for Cilium 1.12.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
No open projects
1.12.14
Backport done to v1.12
1.13.7
Backport done to v1.13
1.14.3
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

IPv4 Service hairpin return-traffic denied by ingress policy
9 participants