Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2: Performance issues when many concurrent requests #35184

Open
JamesNK opened this issue Apr 20, 2020 · 24 comments
Open

HTTP2: Performance issues when many concurrent requests #35184

JamesNK opened this issue Apr 20, 2020 · 24 comments
Labels
Projects
Milestone

Comments

@JamesNK
Copy link
Member

JamesNK commented Apr 20, 2020

HttpClient has performance issues when many concurrent calls are made on one connection. HttpClient is slower compared to Grpc.Core (a gRPC client that uses chttp2 native C library for HTTP/2).

Test app: https://github.com/JamesNK/Http2Perf

I have replicated a gRPC call being made using HttpClient to avoid complication from involving Grpc.Net.Client.

.NET version:

.NET SDK (reflecting any global.json):
 Version:   5.0.100-preview.4.20216.32
 Commit:    c492fe1552

HttpClient results:

  • 100 concurrent callers on one HTTP/2 connection with HttpClient (dotnet run -c Release -p GrpcSampleClient r 100 false): 19k RPS, 1-6ms latency
  • 100 concurrent callers on 100 HTTP/2 connections with HttpClient (dotnet run -c Release -p GrpcSampleClient r 100 true): 31k RPS, 1-8ms latency

Grpc.Core results:

  • 100 concurrent callers on one HTTP/2 connection with Grpc.Core (dotnet run -c Release -p GrpcSampleClient c 100 false): 59k RPS, 1-2ms latency
  • 100 concurrent callers on 100 HTTP/2 connections with Grpc.Core (dotnet run -c Release -p GrpcSampleClient c 100 true): 45k RPS, 1-2ms latency

Interesting that a connection per caller increases performance with HttpClient but decreases it with Grpc.Core.

With one caller HttpClient is faster than Grpc.Core (about 3k RPS vs 3.5k RPS) but as you can see in a 100 concurrent caller scenario Grpc.Core is three times faster.

@ghost
Copy link

ghost commented Apr 20, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 20, 2020
@scalablecory scalablecory added tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Apr 20, 2020
@scalablecory scalablecory added this to the 5.0 milestone Apr 20, 2020
@scalablecory
Copy link
Contributor

Are you benchmarking against Kestrel?

I suspect one large contributor to this is how we process frames sequentially: it does not max out multi-threaded CPUs very well. I've been thinking about a better buffering strategy to enable parallelism, but haven't had time to prototype it yet.

I also think our HPACK decoder can be optimized. It takes up the bulk of the CPU usage for the basic protocol parsing.

@JamesNK
Copy link
Member Author

JamesNK commented Apr 20, 2020

Yes it is against Kestrel. Out of curiosity I've just added a Grpc.Core server to see what changes.

The source code is here if you're curious about any of the implementation: https://github.com/JamesNK/Http2Perf

HttpClient, 100 callers, 1 connection:
Kestrel server: 18k RPS, 5ms latency
CCore server: 17k RPS, 5ms latency

CCore client, 100 callers, 1 connection:
Kestrel server: 66k RPS, 2ms latency
CCore server: 42k RPS, 2ms latency

On the server-side Kestrel is out performing CCore server. It doesn't make much difference when HttpClient is the client, an indication that HttpClient is the throughput bottleneck. When CCore client is used you can see a 50% RPS increase when switching from CCore server to Kestrel.

@karelz karelz added this to TODO (High Priority) in HTTP/2 Apr 20, 2020
@scalablecory
Copy link
Contributor

scalablecory commented Apr 21, 2020

To expand a little on what I said previously,

Http2Connection.SendHeadersAsync:

  • Fully serialized, so we have no chance for parallelism when writing headers. We can look at finer-grained locking, but keep in mind that this will prevent HPACK dynamic table encoding.
  • The way we write the headers frame isn't great, it demands buffering and then copying. We can look at Consider adding Scatter/Gather IO on Stream #25344 and Finalize Socket API upgrade #33417 to make this zero-copy, or we can update our HPACK encoder to chunk frames out for us.
    • SendStreamDataAsync very similarly copies content buffers and could be optimized the same way.

ProcessIncomingFramesAsync:

  • Also fully serialized, but supporting HPACK dynamic table makes this one far more difficult to process in parallel, especially when you have some servers that just pack every single header into the dynamic table. If we were to set our maximum dynamic table size to 0 again, it would open some options.
  • I think the easiest win here will come from optimizing HPackDecoder, which currently does a lot of byte-at-a-time processing, copying, and string allocations. We can also shortcut static table indexes into known headers / known values.

Our windowing algorithm is also very latency-sensitive, but will primarily exhibit issues only when downloading more data than fits into the window. I don't think this will effect perf of small requests.

@stephentoub stephentoub self-assigned this Apr 28, 2020
@scalablecory scalablecory moved this from TODO (High Priority) to TODO (Low Priority) in HTTP/2 May 8, 2020
@karelz
Copy link
Member

karelz commented May 8, 2020

@JamesNK how close does Stephen's PR get us to reasonable perf? What is the priority of pushing perf further? Are there any customers / benchmarks demonstrating we have significant deficiencies?

@stankovski
Copy link

stankovski commented May 9, 2020

@karelz we have an Azure service that is affected by this perf degradation. I'll be happy to chat offline to describe our scenario as well as try to validate the fix.

@JamesNK
Copy link
Member Author

JamesNK commented May 9, 2020

@JamesNK how close does Stephen's PR get us to reasonable perf? What is the priority of pushing perf further?

I haven't had a chance to measure it yet. I am improving the gRPC benchmarks to capture more client data, and adding a golang gRPC client to compare against. That will give us two points of reference. I aim to provide new numbers next week.

When I've seen Stephen's improvements flow through to nightly builds I'll ask the customers who have raised client per issues (one is @stankovski) to retest. Right now they are considering going with Grpc.Core solely because of the perf difference.

Priority wise, I'd like the .NET gRPC client to be competitive with other gRPC client libraries. Client perf is important in microservice scenarios because it is common to have one caller rather than one thousand, and client perf can be the RPS bottle neck. We don't have to be the fastest client, but I don't want to be the slowest. Having a second point of reference will give us more information on where we stand.

@karelz
Copy link
Member

karelz commented May 9, 2020

Great, let's see where we stand with the latest fix and then we can decide how much more we need to invest in .NET 5.

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

I have some results from running on my computer with the latest nightly SDK. There is large improvement in using HttpClientHandler directly, but something about how Grpc.Net.Client uses HttpClientHandler causes a significant performance drop. Note that the benchmark now references a nightly package of Grpc.Net.Client. The nightly package uses HttpHandlerInvoker rather than HttpClient.

Grpc.Core:

Successfully processed 208492; RPS 72894; Errors 0; Last elapsed 1.3694ms
Successfully processed 280400; RPS 71904; Errors 0; Last elapsed 1.2076ms
Successfully processed 352859; RPS 72397; Errors 0; Last elapsed 1.4541ms
Successfully processed 426157; RPS 73262; Errors 0; Last elapsed 1.4229ms
Successfully processed 498743; RPS 72546; Errors 0; Last elapsed 1.411ms

HttpClientHandler:

Successfully processed 300869; RPS 64092; Errors 0; Last elapsed 1.868ms
Successfully processed 365679; RPS 64733; Errors 0; Last elapsed 1.6089ms
Successfully processed 430764; RPS 65057; Errors 0; Last elapsed 1.6459ms
Successfully processed 494023; RPS 63253; Errors 0; Last elapsed 1.2146ms
Successfully processed 558517; RPS 64431; Errors 0; Last elapsed 1.188ms

Grpc.Net.Client:

Successfully processed 182354; RPS 27377; Errors 0; Last elapsed 3.9545ms
Successfully processed 209746; RPS 27386; Errors 0; Last elapsed 3.4128ms
Successfully processed 236951; RPS 27205; Errors 0; Last elapsed 3.5759ms
Successfully processed 264463; RPS 27511; Errors 0; Last elapsed 3.7227ms
Successfully processed 291919; RPS 27453; Errors 0; Last elapsed 2.8536ms

My guess is Grpc.Net.Client reads from a stream, while the raw HttpClientHandler scenario gets its response data as a byte[]. I'll experiment.

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

I have narrowed the difference in performance to how Grpc.Net.Client sends the request message. The "raw" benchmark uses a ByteArrayContent, while Grpc.Net.Client writes to a stream in a custom HttpContent.

Request content:
https://github.com/JamesNK/Http2Perf/blob/master/GrpcSampleClient/PushUnaryContent.cs

Command line to test: dotnet run -c Release -p GrpcSampleClient r-stream-request 100 false

Successfully processed 133419; RPS 28205; Errors 0; Last elapsed 3.3902ms
Successfully processed 161529; RPS 28099; Errors 0; Last elapsed 3.3645ms
Successfully processed 189449; RPS 27905; Errors 0; Last elapsed 4.4922ms
Successfully processed 215009; RPS 25524; Errors 0; Last elapsed 3.7083ms
Successfully processed 240717; RPS 25704; Errors 0; Last elapsed 3.8551ms

I think there are issues with custom HttpContent and HTTP/2.

@stephentoub

@stephentoub
Copy link
Member

There is large improvement in using HttpClientHandler directly, but something about how Grpc.Net.Client uses HttpClientHandler causes a significant performance drop.

Did Grpc.Net.Client get worse compared to the previous build you were using, or it just didn't improve like the HttpClientHandler used directly?

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

The performance drop I refered to was the switch from HttpClientHandler+ByteArrayContent to Grpc.Net.Client.

Your PR did improve Gprc.Net.Client perf:

Before:

Successfully processed 114039; RPS 15253; Errors 0; Last elapsed 6.8305ms
Successfully processed 129317; RPS 15266; Errors 0; Last elapsed 6.6393ms
Successfully processed 144655; RPS 15333; Errors 0; Last elapsed 6.1096ms
Successfully processed 160087; RPS 15427; Errors 0; Last elapsed 6.4803ms
Successfully processed 175038; RPS 14946; Errors 0; Last elapsed 6.1241ms

After

Successfully processed 117038; RPS 25741; Errors 0; Last elapsed 3.9156ms
Successfully processed 142410; RPS 25351; Errors 0; Last elapsed 3.8135ms
Successfully processed 168229; RPS 25818; Errors 0; Last elapsed 3.9624ms
Successfully processed 193149; RPS 24885; Errors 0; Last elapsed 3.6607ms
Successfully processed 218170; RPS 24992; Errors 0; Last elapsed 3.3362ms

@stephentoub
Copy link
Member

Your PR did improve Gprc.Net.Client perf

Phew.

The performance drop I refered to was the switch from HttpClientHandler+ReadAsByteArray to Grpc.Net.Client

What happens if you comment out the await stream.FlushAsync(CancellationToken.None).ConfigureAwait(false); line in the custom content?

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

Results on latest nightly.

Custom content with FlushAsync

Successfully processed 151054; RPS 26851; Errors 0; Last elapsed 4.0214ms
Successfully processed 177422; RPS 26362; Errors 0; Last elapsed 3.5943ms
Successfully processed 204407; RPS 26985; Errors 0; Last elapsed 3.5367ms
Successfully processed 231167; RPS 26751; Errors 0; Last elapsed 3.7856ms
Successfully processed 258003; RPS 26824; Errors 0; Last elapsed 3.2997ms

Custom content without FlushAsync

Successfully processed 120251; RPS 46342; Errors 0; Last elapsed 1.6783ms
Successfully processed 163379; RPS 43122; Errors 0; Last elapsed 2.4453ms
Successfully processed 207134; RPS 43690; Errors 0; Last elapsed 2.3255ms
Successfully processed 248658; RPS 41504; Errors 0; Last elapsed 2.0557ms
Successfully processed 290641; RPS 41929; Errors 0; Last elapsed 1.0522ms

ByteArrayContent

Successfully processed 139706; RPS 51302; Errors 0; Last elapsed 1.2861ms
Successfully processed 192712; RPS 52973; Errors 0; Last elapsed 1.3942ms
Successfully processed 243812; RPS 51098; Errors 0; Last elapsed 1.6656ms
Successfully processed 297643; RPS 53804; Errors 0; Last elapsed 1.9658ms
Successfully processed 351130; RPS 53457; Errors 0; Last elapsed 0.976ms

I recall having to add the FlushAsync here during 3.0 so that request data was sent. Is it no longer necessary in 5.0, or did that behavior change before 3.0 went GA?

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

grpc/grpc-dotnet#368 😄

@stephentoub
Copy link
Member

stephentoub commented May 10, 2020

grpc/grpc-dotnet#368 😄

A flush is needed after an individual message on a duplex request, as otherwise the message may just sit in the buffer forever until something else triggers a flush. But there's always a flush issued during or soon after sending an EndStream HTTP/2 frame, so there's no explicit flush required at the end of content's SerializeToStreamAsync. Essentially, we always flush at the end of a request content; if you need flushes before then, you need to request them explicitly.

@stephentoub
Copy link
Member

stephentoub commented May 10, 2020

Can we influence the APIs like CodedOutputStream? There are non-trivial inefficiencies stemming from having to use the APIs as designed. My guess is that's the root of the difference with ByteArrayContent, in particular that it ends up driving a need for two writes instead of one.

@JamesNK
Copy link
Member Author

JamesNK commented May 10, 2020

Can we influence the APIs like CodedOutputStream?

Yes! In fact I'm working with the Protobuf team to add IBufferWriter<byte> support.
We've just merged a PR that adds support for reading from ReadOnlySequence<byte> - protocolbuffers/protobuf#7351.

When both of these PRs are merged, and code generation is updated to use them, then on the server we'll be able to read Protobuf directly from ASP.NET Core's request pipe, and write directly to the response pipe.

HttpClient has streams instead of pipes, so things won't be quite as efficient. However the extra abstraction of reading from ReadOnlySequence<byte> and writing to IBufferWriter<byte> will allow us to use a pooled array as the source or destination when working with Protobuf. We can then pass that array to responseStream.WriteAsync.

@davidfowl
Copy link
Member

Happy to see that change go in finally 😬

@JamesNK
Copy link
Member Author

JamesNK commented May 11, 2020

Grpc.Net.Client PR - grpc/grpc-dotnet#901

Grpc.Net.Client with PR:

Successfully processed 228755; RPS 52340; Errors 0; Last elapsed 2.1993ms
Successfully processed 282848; RPS 53993; Errors 0; Last elapsed 2.1865ms
Successfully processed 336652; RPS 53802; Errors 0; Last elapsed 1.5827ms
Successfully processed 390310; RPS 53628; Errors 0; Last elapsed 0.9105ms
Successfully processed 443199; RPS 52889; Errors 0; Last elapsed 2.2438ms

Raw HttpClientHandler:

Successfully processed 158831; RPS 61493; Errors 0; Last elapsed 1.193ms
Successfully processed 215912; RPS 57081; Errors 0; Last elapsed 1.0608ms
Successfully processed 278651; RPS 62739; Errors 0; Last elapsed 0.9503ms
Successfully processed 339517; RPS 60865; Errors 0; Last elapsed 1.551ms
Successfully processed 399893; RPS 60324; Errors 0; Last elapsed 1.9509ms

I believe the gap is now down to the extra features that Grpc.Net.Client adds (call status, cancellation, tracing, strongly typed API)

It is much smaller, but there remains a gap between Grpc.Net.Client (53k RPS) and Grpc.Core (73k RPS). Will wait and see how the Go client performs.

@stephentoub
Copy link
Member

Thanks, @JamesNK. And just to confirm, these numbers are with server GC enabled for the client app?

@JamesNK
Copy link
Member Author

JamesNK commented May 11, 2020

Yes, the client app is using server GC.

@vlpereva
Copy link

vlpereva commented Jun 5, 2020

Hey @JamesNK !
I've forked your benchmark to make a couple of adaptations for our use case and the results may be of interest.
First I've enabled HTTPS for all connections, since it is a requirement for us, second added SignalR to the list of protocols, also I ran test with .Net 3.1.300 and .Net 5.0.100-preview.7.20304.1
image

Each RPS number is an average of 3 minutes of runtime with 0 errors. Also there is a warm up before the actual test is run

@karelz karelz modified the milestones: 5.0.0, Future Jul 30, 2020
@karelz
Copy link
Member

karelz commented Jul 30, 2020

Bunch of improvements in the space happened (some just recently) -- we think this is "it" for 5.0, moving to Future for potential further improvements.

@geoffkizer geoffkizer removed their assignment Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
HTTP/2
TODO (Low Priority)
Development

No branches or pull requests

9 participants