Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support delayed deny in the RBAC network filter #33771

Open
yangminzhu opened this issue Apr 24, 2024 · 9 comments
Open

Support delayed deny in the RBAC network filter #33771

yangminzhu opened this issue Apr 24, 2024 · 9 comments
Labels
area/rbac enhancement Feature requests. Not bugs or questions.

Comments

@yangminzhu
Copy link
Contributor

Title: Support delayed deny in the RBAC network filter

Description:
When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immedidately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage.

We propose extending the RBAC network filter to delay a small amount of time (for example, 500ms, this will be configurable) before closing the TCP connection. For those clients this will limit the rate it retries with new connection.

This is especially useful for the RBAC network filter, because unlike the RBAC HTTP filter, it can only close the TCP connection and doesn't have a good way to propagate back the permission denial error (that is not retryable) to the client.

This is the same feature as the connection_limit that also implements the delayed rejecting functionality.

The following is a proposed API change to the RBAC network filter:

message RBAC {

  // The delay_deny to use for rejecting the connection after some specified time duration
  // instead of immediately rejecting the connection. That way, a malicious user is not able to
  // retry as fast as possible which provides a better DoS protection for Envoy. If this is not present,
  // the connection will be closed immediately.
  google.protobuf.Duration delay_deny = 8;
}

[optional Relevant Links:]

Any extra documentation required to understand the issue.

@yangminzhu yangminzhu added enhancement Feature requests. Not bugs or questions. triage Issue requires triage labels Apr 24, 2024
@yangminzhu
Copy link
Contributor Author

cc @lambdai

@lambdai
Copy link
Contributor

lambdai commented Apr 24, 2024

Also want to add that L4 RBAC is likely work with a TLS connection, which is cpu extensive (1ms cpu time)
The delayed deny as a start point of back pressure propogation potentially highly reduce the CPU at both envoy and envoy's downstream

@kyessenov
Copy link
Contributor

Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.

@yangminzhu
Copy link
Contributor Author

Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.

@kyessenov I think we will need both. The dealyed deny in RBAC is specific to connections to be closed due to permission error, and will be more effective to reduce the CPU usage on Envoy in some situations, for example, some gRPC clients retry in a busy for-loop when it is closed by RBAC, this creates signigicant number of new connections (e.g. 400 per second per client) on Envoy for more CPU usage (we are not really worried about the memory as it doesn't look to be an issue in either case).

A delayed deny will naturaully reduce how fast the client is to retry, then siginificantly reduce the CPU usage since there is much less new connection being created/closed at the same time.

@kyessenov
Copy link
Contributor

SG, although this principle of delayed close should probably be applied uniformly: TLS handshake failures, protocol errors, etc all fails in the same error domain. It might be better to handle it at listener or the HCM level.

CC @yanavlasov

@wufanqqfsc
Copy link

"When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immediately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage."

1.what's the real reason here for envoy proxy to high CPU usage ? Just for handler some clients DENY and close the tcp connection ? As you had said some clients doesn't handler well and just retry with new connection. So can we think this is client issue ?

2.From Envoy Pov, may be some protection policy or connection limit should be assigned to the same client not just only add connection close delay.

@lambdai
Copy link
Contributor

lambdai commented Apr 25, 2024

@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.

It is a client issue, but you don't nessessarily have the full control of the clients.

@wufanqqfsc
Copy link

@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.

It is a client issue, but you don't nessessarily have the full control of the clients.

Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.

@wbpcode wbpcode added area/rbac and removed triage Issue requires triage labels Apr 26, 2024
yangminzhu added a commit to yangminzhu/envoy that referenced this issue Apr 30, 2024
If specified, the RBAC network filter will delay the specified duration
before actually closing the connection.

This implements envoyproxy#33771.

Signed-off-by: Yangmin Zhu <ymzhu@uber.com>
yangminzhu added a commit to yangminzhu/envoy that referenced this issue Apr 30, 2024
If specified, the RBAC network filter will delay the specified duration
before actually closing the connection.

This implements envoyproxy#33771.

Signed-off-by: Yangmin Zhu <ymzhu@uber.com>
@yangminzhu
Copy link
Contributor Author

Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.

yeah, we can definitily enable different protections at multiple layer and places depending on the actual situation, it's not a one for all solution. The memory cost should be very small and the benefit (as compared to not having the delayed deny) is well worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rbac enhancement Feature requests. Not bugs or questions.
Projects
None yet
Development

No branches or pull requests

5 participants