Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy circuit breakers hit with HTTP/2 #839

Open
mbrancato opened this issue May 10, 2022 · 18 comments
Open

Envoy circuit breakers hit with HTTP/2 #839

mbrancato opened this issue May 10, 2022 · 18 comments

Comments

@mbrancato
Copy link

mbrancato commented May 10, 2022

I use a custom client to send requests to Knative services with a Kourier gateway. It has a slow-start algorithm that gradually scales up concurrent requests until a configured limit is reached, or until error rates are too high. Under normal conditions, with HTTP/1.1, this can easily scale to 10k concurrent requests with just a few Kourier pods. Using HTTP/1.1 is more fragile as it could hit socket limits. However, if I switch that same config to HTTP/2, once the concurrency hits approximately 1000 (maybe slightly lower), Kourier / envoy responds with:

HTTP/2.0 503 Service Unavailable
Content-Length: 81
Content-Type: text/plain
Date: Tue, 10 May 2022 01:58:16 GMT
Server: envoy
X-Envoy-Overloaded: true

upstream connect error or disconnect/reset before headers. reset reason: overflow

If I reduce the concurrency, but increase the number of clients sending requests, once the total concurrent requests goes above 1000, the same result happens. This might depend on the number of upstream service pods, I was testing with just 2.

Is there a way to tune this setting in the Kourier config? Ideally, if requests per socket connection need to be limited, it would allow spawning more connections when the limit is reached.

edit: to clarify, the upstream Service is not changing here, just how the client connects. The upstream service port is defined as a named port (h2c).

@mbrancato
Copy link
Author

As a follow-up, this is mainly overflowing the upstream_rq_pending_overflow value.

foo/myapp-00006.upstream_rq_pending_overflow: 17345

There are also some upstream_cx_overflow and upstream_rq_pending_overflow values > 0 on HTTP/1.1 services, but they are fairly small.

The dynamic clusters are getting default circuit breaker values.

% curl -s localhost:9000/clusters | grep myapp
foo/myapp-00006::observability_name::foo/myapp-00006
foo/myapp-00006::default_priority::max_connections::1024
foo/myapp-00006::default_priority::max_pending_requests::1024
foo/myapp-00006::default_priority::max_requests::1024
foo/myapp-00006::default_priority::max_retries::3
foo/myapp-00006::high_priority::max_connections::1024
foo/myapp-00006::high_priority::max_pending_requests::1024
foo/myapp-00006::high_priority::max_requests::1024
foo/myapp-00006::high_priority::max_retries::3
foo/myapp-00006::added_via_api::true
...

And looking where new clusters appear to be created via the API, there is no circuit breaker config:
https://github.com/knative-sandbox/net-kourier/blob/969ae748c33768049c72b10ae9fb0cd0235fe6ff/pkg/envoy/api/cluster.go#L39-L52

As my testing has shown, the requests from multiple clients still triggers this overflow. Otherwise, I could create some transport pool to spread the load. I do think each individual Kourier instance should be able to handle > 1024 concurrent requests to a service.

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 9, 2022
@github-actions github-actions bot closed this as completed Sep 8, 2022
@mbrancato
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 23, 2022
@mbrancato
Copy link
Author

/reopen

@knative-prow knative-prow bot reopened this Sep 23, 2022
@knative-prow
Copy link

knative-prow bot commented Sep 23, 2022

@mbrancato: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mbrancato
Copy link
Author

I'm currently working around this with the following hack, but since it is not configurable it requires a custom build on my part:

		// HTTP/2 expanded max defaults
		cluster.CircuitBreakers = &envoyclusterv3.CircuitBreakers{
			Thresholds: []*envoyclusterv3.CircuitBreakers_Thresholds{
				{
					Priority:           envoycorev3.RoutingPriority_DEFAULT,
					MaxConnections:     &wrappers.UInt32Value{Value: 1024},
					MaxPendingRequests: &wrappers.UInt32Value{Value: 1024*1024},
					MaxRequests:        &wrappers.UInt32Value{Value: 1024*1024},
					MaxRetries:         nil,
					RetryBudget:        nil,
					TrackRemaining:     false,
					MaxConnectionPools: nil,
				},
				{
					Priority:           envoycorev3.RoutingPriority_HIGH,
					MaxConnections:     &wrappers.UInt32Value{Value: 1024},
					MaxPendingRequests: &wrappers.UInt32Value{Value: 1024*1024},
					MaxRequests:        &wrappers.UInt32Value{Value: 1024*1024},
					MaxRetries:         nil,
					RetryBudget:        nil,
					TrackRemaining:     false,
					MaxConnectionPools: nil,
				},
			},
		}

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2022
@knative-prow-robot
Copy link

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close

/lifecycle stale

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2023
@github-actions
Copy link

github-actions bot commented Jun 9, 2023

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2023
@mbrancato
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2023
@github-actions
Copy link

github-actions bot commented Sep 8, 2023

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 8, 2023
@github-actions github-actions bot closed this as completed Oct 8, 2023
@mbrancato
Copy link
Author

/reopen

@knative-prow knative-prow bot reopened this Oct 8, 2023
@knative-prow
Copy link

knative-prow bot commented Oct 8, 2023

@mbrancato: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@github-actions github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2023
Copy link

github-actions bot commented Jan 7, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2024
@mbrancato
Copy link
Author

/reopen

@mbrancato
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2024
Copy link

github-actions bot commented Apr 7, 2024

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 7, 2024
@mbrancato
Copy link
Author

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants