[Improvement] Set Netty as the default server type #1651

rickyma · 2024-04-16T09:46:24Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

What would you like to be improved?

After #1650, we can know that gRPC mode will not perform very well under high pressure. I think it's time for us to set the default server type to Netty, based on the following reasons.

Netty mode brings about a 20% performance improvement compared to gRPC mode

Refer to https://github.com/apache/incubator-uniffle/blob/master/docs/benchmark_netty_case_report.md.

gRPC mode will cause higher CPU load

We can see that gRPC mode will cause the machine's load to be much higher than Netty mode:

gRPC mode will cause memory usage to double

We will find that in gRPC mode, both off-heap and on-heap memory may be heavily occupied. In extreme cases, this may cause memory usage to double:

This is because gRPC enables off-heap memory by default, and this part of the memory will be allocated and used by the gRPC framework. When the request enters the ShuffleServerGrpcService method, this part of the memory will be converted into on-heap memory and used in the business code. However, this is completely unnecessary, as we can simply use either off-heap memory or on-heap memory, without the need for conversion.

gRPC mode will cause inaccurate `usedMemory` leading to OOM

gRPC enables off-heap memory by default, and this part of the memory will be allocated and used by the gRPC framework, which is not calculated in usedMemory.
By default, gRPC uses PooledByteBufAllocator to allocate off-heap memory for requests, with a chunkSize of 2MB. Therefore, when doing pre-allocation, it will also require allocating 2MB of off-heap memory. So in high-pressure and high-concurrency scenarios, we may easily encounter the following error exceptions when the shuffle server is receiving plenty of SendShuffleDataRequest and RequireBufferRequest at the same time:

[17:23:23:348] [client-data-transfer-79] ERROR org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.requirePreAllocation:257 - Exception happened when requiring pre-allocated buffer from 127.0.0.1:19970
io.grpc.StatusRuntimeException: UNAVAILABLE: GOAWAY shut down transport. HTTP/2 error code: INTERNAL_ERROR, debug data: failed to allocate 2097152 byte(s) of direct memory (used: 161059176727, max: 161061273600)
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.proto.ShuffleServerGrpc$ShuffleServerBlockingStub.requireBuffer(ShuffleServerGrpc.java:842) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.requirePreAllocation(ShuffleServerGrpcClient.java:255) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.lambda$sendShuffleData$0(ShuffleServerGrpcClient.java:476) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.common.util.RetryUtils.retryWithCondition(RetryUtils.java:81) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.sendShuffleData(ShuffleServerGrpcClient.java:473) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at org.apache.uniffle.client.impl.ShuffleWriteClientImpl.lambda$sendShuffleDataAsync$1(ShuffleWriteClientImpl.java:189) ~[rss-client-spark3-shaded-0.9.0-SNAPSHOT.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) ~[?:1.8.0_352]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352]
	at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352]

gRPC does not support sending `ByteString` using off-heap memory

Refer to grpc/grpc-java#9704.

More flexible

Using Netty is more direct and flexible, and it won't be constrained by the gRPC layer's wrapping.

Netty's iterative upgrades will be faster

Netty's iterative upgrades will be faster, and the gRPC community cannot guarantee timely updates to Netty's version.

So I think it's time for us to set the default server type to Netty.

How should we improve?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Apr 17, 2024

[apache#1651] improvement(netty): Set Netty as the default server type

4d04254

rickyma mentioned this issue Apr 17, 2024

[#1651] improvement(netty): Set Netty as the default server type #1653

Closed

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Apr 17, 2024

[apache#1651] improvement(netty): Set Netty as the default server type

aec58b8

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Apr 17, 2024

[apache#1651] improvement(netty): Set Netty as the default server type

b50c76c

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Apr 17, 2024

[apache#1651] improvement(netty): Set Netty as the default server type

3b348bb

rickyma added a commit to rickyma/incubator-uniffle that referenced this issue Apr 17, 2024

[apache#1651] improvement(netty): Set Netty as the default server type

84dda59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improvement] Set Netty as the default server type #1651

[Improvement] Set Netty as the default server type #1651

rickyma commented Apr 16, 2024 •

edited

[Improvement] Set Netty as the default server type #1651

[Improvement] Set Netty as the default server type #1651

Comments

rickyma commented Apr 16, 2024 • edited

Code of Conduct

Search before asking

What would you like to be improved?

Netty mode brings about a 20% performance improvement compared to gRPC mode

gRPC mode will cause higher CPU load

gRPC mode will cause memory usage to double

gRPC mode will cause inaccurate usedMemory leading to OOM

gRPC does not support sending ByteString using off-heap memory

More flexible

Netty's iterative upgrades will be faster

How should we improve?

Are you willing to submit PR?

rickyma commented Apr 16, 2024 •

edited

gRPC mode will cause inaccurate `usedMemory` leading to OOM

gRPC does not support sending `ByteString` using off-heap memory