Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: User-specified epoll flags #6084

Open
Noah-Kennedy opened this issue Oct 17, 2023 · 9 comments
Open

RFC: User-specified epoll flags #6084

Noah-Kennedy opened this issue Oct 17, 2023 · 9 comments
Assignees
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-net Module: tokio/net T-performance Topic: performance and benchmarks

Comments

@Noah-Kennedy
Copy link
Contributor

The problem

When building shared-nothing systems at scale, load balancing new connections is a significant challenge.

To summarize that blog post, SO_REUSEPORT can often introduce new sources tail latency because it splits the new connections into per-socket queues regardless of whether or not they are currently doing work. A worker parked on epoll_wait is generally an excellent candidate for a new connection when compared to a worker currently handling existing connections, as a worker who is currently looking for work anyways clearly has the capacity to accept the new connection. Note that even eBPF load SO_REUSEPORT balancing isn't ideal here, as an eBPF script can't really be thus smart. Therefore, it tends to be better for latency to load balance with epoll. Unfortunately, this can't be done with the vanilla set of options - normally, if multiple epoll instances watch the same socket, all of them get the notification, leading to a thundering herd.

Fortunately, the EPOLLEXCLUSIVE flag resolves this issue by ensuring that only one waiting epoll instance with the flag set for the particular interest will get the notification. EPOLLEXCLUSIVE is, as a result, extraordinarily useful for at-scale shared-nothing systems. It isn't always the best approach depending on how sensitive a system is to even load balancing vs TTFB, but it's an important element of any shared-nothing toolbox.

At Cloudflare, we have services which use tokio in both shared-nothing and work-stealing configurations and make extensive use of EPOLLEXCLUSIVE and other atypical epoll flags. Based on our experience serving diverse types of traffic at scale, we think that allowing users to leverage custom epoll flags would make tokio a significantly more powerful toolkit for users working on shared-nothing systems.

The solution

I have a POC patch which I can push later which adds a new from_std variant to several types (currently just the TCP and AF_UNIX stream listeners) which allows the specification of the exact set of epoll flags to use when registering the socket with our epoll descriptor. If we made this fallible, it wouldn't block the use of io_uring or similar in the future, as we could just document that this only works if you are using epoll. We could potentially do that only with AsyncFd, or with the listener types as I implemented in the POC, or both.

We could also try and add in EPOLLEXCLUSIVE as a new IO interest that users can specify, but this has all of the issues of the POC approach I took, while being more complicated for us to implement and less flexible for users. For that reason, I'd recommend something along the lines of option number one.

If this RFC is accepted, I can take responsibility for the implementation of this.

Because Mio exposes the raw fd of the epoll instance, it can be bypassed entirely for the purposes of implementing this functionality in Tokio. As a result, Mio support is not a prerequisite for Tokio having this functionality.

@Noah-Kennedy Noah-Kennedy added A-tokio Area: The main tokio crate M-net Module: tokio/net C-feature-request Category: A feature request. T-performance Topic: performance and benchmarks labels Oct 17, 2023
@Noah-Kennedy Noah-Kennedy self-assigned this Oct 17, 2023
@Darksonn
Copy link
Contributor

This seems reasonable enough. I think the main question here is how the new from_std api should look. It would make sense to think about how we can choose an api that is extensible in the future, e.g. for passing flags to io_uring, kqueue, or windows afd.

@Nerdy5k
Copy link

Nerdy5k commented Oct 18, 2023

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Noah-Kennedy
Copy link
Contributor Author

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

@Nerdy5k
Copy link

Nerdy5k commented Oct 19, 2023

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

I want to keep the metal io approach as much as possible without delegating to separate api workers.

@Noah-Kennedy
Copy link
Contributor Author

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

I want to keep the metal io approach as much as possible without delegating to separate api workers.

This doesn't force you to change how you use tokio or mio. It just opens up new options for others who are currently using shared-nothing.

@Noah-Kennedy
Copy link
Contributor Author

@Nerdy5k could you elaborate on what you mean here?

We aren't talking about changing the innards of tokio in any way which modifies existing behavior, merely adding a new way to construct registered sockets. This doesn't impact the current IO approach, just allow a new way to interface with it.

I'm not sure what you mean by "separate API workers". I suspect this to be the result of confusion?

Noah-Kennedy added a commit that referenced this issue Oct 19, 2023
…ied epoll flags

WIP fix for #6084.

This currently only adds support for TcpListener.
@Noah-Kennedy
Copy link
Contributor Author

I put up the POC here: #6089

@carllerche
Copy link
Member

The blog post uses level-triggered notification, which allows the code to perform 1 accept() per epoll_wait. Tokio uses edge-triggered, which means users must call accept() until EWOULDBLOCK, which somewhat defeats the load balancing aspect.

Can you address this?

@Noah-Kennedy
Copy link
Contributor Author

Sure!

You bring up a good point here: while there are valid reasons to use EPOLLET | EPOLLEXCLUSIVE, you generally want to use level triggered with the accept, not just because of load balancing, but also because of short-term starvation issues under a burst of new connections. There are situations where you want this flag combination, but they are a minority of cases.

This skipped my mind earlier in the convo, but was one of the reasons that I crafted the patch this way, with users controlling their own flags including interests. Thanks for reminding me of this; I need to add some notes to the documentation regarding this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-feature-request Category: A feature request. M-net Module: tokio/net T-performance Topic: performance and benchmarks
Projects
None yet
Development

No branches or pull requests

4 participants