Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loader: refactor replaceDatapath to loadDatapath #32518

Merged
merged 2 commits into from May 22, 2024

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented May 13, 2024

loader: refactor replaceDatapath to loadDatapath

This unblocks cilium/cilium#29333. See cilium/cilium#32468 for more context.

This commit refactors replaceDatapath() to loadDatapath() by factoring device attachment
out of the function into the caller. The main reasons are flexibility and transparency.
replaceDatapath() was called from many places and needed to do a lot. This change is the
first step to handing individual callers an object representing actual bpf object handles,
so they can correctly manage its lifecycle. In the future, ebpf.LoadAndAssign will be used
for better readability.

Some callers attach the same program to multiple interfaces, some attach multiple programs
(ingress/egress) to the same interface, and some use a mixture of both. This has caused
loops to creep into replaceDatapath, giving it many arguments and many overall
responsibilities, making it hard to form intuition around.

Major changes made in this commit:
- lifted attach{SKB,XDP}Program out of the function, into all callers, making them
  call attach* methods explicitly
- removed `replaceDatapathOptions`
- reduced the window a potential 'rollback' can happen in (see code comments) due to the
  risks involved, and it never being correct to begin with.
- removed a few points where context cancellations are obeyed, to be continued in a
  subsequent commit

Fixes: #32468
loader: remove ctx from uncancellable functions

Loading bpf objects used to be done by iproute2, where propagating ctx to the
exec.Cmd invocations made sense, since realistically any shellout can hang for
arbitrary reasons.

Now the loader is fully hosted in the agent process, this no longer makes sense.
Once we're blocked in a bpf() syscall, e.g. for loading a program, the verifier
can be interrupted by sending a signal to the calling thread. Since the Go runtime
routinely sends these signals under normal operation, ebpf-go will retry a few
times when bpf() returns EINTR. The API currently doesn't expose a way to cancel
program loading/verification, and there's no clear benefit to doing so in the first
place.

Verification is relatively lightweight compared to datapath compilation, so
interrupting it during teardown is of questionable benefit. The agent doesn't expect
it to be interruptible, it's bound to leave endpoints in an undefined state.

This commit introduces the assumption that, once endpoint loading/attachment is
kicked off (after compilation), it cannot be cancelled. This is reflected in the
interface exposed to the rest of the system, by removing the ctx parameter on many
methods. Only compilation can be interrupted, since it can take a long time on some
systems, especially lower-spec.

Fixes: #32468

@ti-mo ti-mo added the release-note/misc This PR makes changes that have no direct user impact. label May 13, 2024
@ti-mo
Copy link
Contributor Author

ti-mo commented May 13, 2024

/test

@ti-mo ti-mo force-pushed the tb/replacedatapath-remove-attach branch from c636496 to 38d44d4 Compare May 14, 2024 08:12
@ti-mo
Copy link
Contributor Author

ti-mo commented May 14, 2024

/test

@ti-mo ti-mo changed the title loader: move device attachment out of replaceDatapath, rename to loadDatapath loader: refactor replaceDatapath to loadDatapath May 14, 2024
@ti-mo ti-mo marked this pull request as ready for review May 14, 2024 09:56
@ti-mo ti-mo requested review from a team as code owners May 14, 2024 09:56
@ti-mo ti-mo requested review from tommyp1ckles and lmb May 14, 2024 09:56
Copy link
Contributor

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating loading from attachment is great, but does this split at the right point? As you rightfully point out, inserting into the tail call maps ends up working like attachment. Can we also move map insertion out of this function?

pkg/datapath/loader/netlink.go Show resolved Hide resolved
pkg/datapath/loader/netlink.go Show resolved Hide resolved
pkg/datapath/loader/netlink.go Outdated Show resolved Hide resolved
@ti-mo ti-mo force-pushed the tb/replacedatapath-remove-attach branch from 38d44d4 to a6673b1 Compare May 15, 2024 13:50
@ti-mo
Copy link
Contributor Author

ti-mo commented May 15, 2024

/test

@ti-mo ti-mo marked this pull request as draft May 16, 2024 18:36
@ti-mo
Copy link
Contributor Author

ti-mo commented May 16, 2024

Drafted due to rebasing onto #32059.

@ti-mo ti-mo force-pushed the tb/replacedatapath-remove-attach branch from a6673b1 to cf8ba69 Compare May 16, 2024 18:36
@ti-mo
Copy link
Contributor Author

ti-mo commented May 16, 2024

/test

@ti-mo ti-mo force-pushed the tb/replacedatapath-remove-attach branch from cf8ba69 to ae7ea76 Compare May 17, 2024 14:09
@ti-mo
Copy link
Contributor Author

ti-mo commented May 17, 2024

/test

@ti-mo ti-mo marked this pull request as ready for review May 22, 2024 07:36
@ti-mo ti-mo requested a review from lmb May 22, 2024 09:18
pkg/datapath/loader/base.go Outdated Show resolved Hide resolved
pkg/datapath/loader/loader.go Outdated Show resolved Hide resolved
pkg/datapath/loader/tc.go Outdated Show resolved Hide resolved
pkg/datapath/loader/xdp.go Outdated Show resolved Hide resolved
ti-mo added 2 commits May 22, 2024 13:49
This unblocks cilium#29333. See cilium#32468 for more context.

This commit refactors replaceDatapath() to loadDatapath() by factoring device attachment
out of the function into the caller. The main reasons are flexibility and transparency.
replaceDatapath() was called from many places and needed to do a lot. This change is the
first step to handing individual callers an object representing actual bpf object handles,
so they can correctly manage its lifecycle. In the future, ebpf.LoadAndAssign will be used
for better readability.

Some callers attach the same program to multiple interfaces, some attach multiple programs
(ingress/egress) to the same interface, and some use a mixture of both. This has caused
loops to creep into replaceDatapath, giving it many arguments and many overall
responsibilities, making it hard to form intuition around.

Major changes made in this commit:
- lifted attach{SKB,XDP}Program out of the function, into all callers, making them
  call attach* methods explicitly
- removed `replaceDatapathOptions`
- reduced the window during which a 'revert' can happen (see code comments) due to
  the risks involved and its questionable effectiveness
- removed a few points where context cancellations are obeyed, to be continued in a
  subsequent commit

Fixes: cilium#32468

Signed-off-by: Timo Beckers <timo@isovalent.com>
Loading bpf objects used to be done by iproute2, where propagating ctx to the
exec.Cmd invocations made sense, since realistically any shellout can hang for
arbitrary reasons.

Now the loader is fully hosted in the agent process, this no longer makes sense.
Once we're blocked in a bpf() syscall, e.g. for loading a program, the verifier
can be interrupted by sending a signal to the calling thread. Since the Go runtime
routinely sends these signals under normal operation, ebpf-go will retry a few
times when bpf() returns EINTR. The API currently doesn't expose a way to cancel
program loading/verification, and there's no clear benefit to doing so in the first
place.

Verification is relatively lightweight compared to datapath compilation, so
interrupting it during teardown is of questionable benefit. The agent doesn't expect
it to be interruptible, it's bound to leave endpoints in an undefined state.

This commit introduces the assumption that, once endpoint loading/attachment is
kicked off (after compilation), it cannot be cancelled. This is reflected in the
interface exposed to the rest of the system, by removing the ctx parameter on many
methods. Only compilation can be interrupted, since it can take a long time on some
systems, especially lower-spec.

Signed-off-by: Timo Beckers <timo@isovalent.com>
@ti-mo ti-mo force-pushed the tb/replacedatapath-remove-attach branch from ae7ea76 to 1188eda Compare May 22, 2024 12:02
@ti-mo ti-mo enabled auto-merge May 22, 2024 12:03
@ti-mo
Copy link
Contributor Author

ti-mo commented May 22, 2024

/test

@ti-mo ti-mo added this pull request to the merge queue May 22, 2024
Merged via the queue into cilium:main with commit d395867 May 22, 2024
63 of 64 checks passed
@ti-mo ti-mo deleted the tb/replacedatapath-remove-attach branch May 22, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move program attachment logic out of replaceDatapath()
2 participants