Intermittent stall of S3 PUT request for about 17 minutes #11203

gudladona · 2024-05-13T15:39:03Z

Hello,

We have an interesting problem that happens intermittently in our environment that causes the S3 PUT via HTTP PUT operation stall between 17-19 minutes. Let me try to describe this in detail.

First of, Environment details. We are running OSS spark and Hadoop on EKS with Karpenter.

JDK version : 11.0.19
Spark Version: 3.4.1
Hadoop Version: 3.3.4
EKS Version: 1.26
Hudi Version: 0.14.x
OS: Verified on both Bottlerocket & AL2

Issue Details:

Occasionally, we notice that Spark stage & few tasks get stalled for about 17 minutes, this delay is consistent whenever it happens. We have noticed that this is due to a stalled socket write on a close() within AWS SDK which uses Apache HTTP Client. When we expect a bad TLS connection, and the underlying socket should be terminated eagerly for a retry we don’t see that happening. Instead, the Socket is left until OS triggers a terminate. This seems to be due to the implementation of socket Linger option which is set to -1 by default in the JDK. An option exists to set Linger to 0 which means bad connections are immediately removed. But neither the AWS SDK nor the Apache HTTP Client sets this option to alter the default Linger behavior in the JDK.

Attached are the logs with slightly different errors with DEBUG level for AWS SDK and Hadoop S3a and Apache HTTP Client with when the issue is encountered.

After further investigation we have found this JDK bug : https://bugs.openjdk.org/browse/JDK-8241239. This perfectly describes and reproduces the issue we are having.

We have tried to fork the aws sdk by adding the LINGER option with default to 0 in here and set it to the SSL socket options here. But that did not fix the issue, which could be due to how the JDK version is treating the socket options.

Expected Behavior

The socket file descriptor should close non-gracefully/"prematurely", forcing the write to terminate immediately.

Current Behavior

close() blocks until the OS forces the socket closed at the transport layer, causing the socket write to fail

Reproduction Steps

As mentioned in the JDK Bug Report

establish a connection between two hosts/VMs, have the client side perform sizable writes (enough to fill up socket buffers etc.), the server just reads and discards.
introduce a null route on either side (or otherwise prevent transmission of TCP acks from the server to the client) force the client to attempt retransmits
wait until you're stuck in a write() (check stack dumps), then call close() on the client-side socket.

Environment Description

Hudi version : 0.14.1
Spark version : 3.4.1
Hive version : NA
Hadoop version : 3.3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes

Additional context

debug.log

ad1happy2go · 2024-05-17T07:55:25Z

@gudladona Looks like S3 throttling is happening. Did you checked if you have lot of small file groups in your data?

hgudladona · 2024-05-17T13:53:58Z

We are mostly certain this is not due to S3 throttling but a bad socket state and its handling in the JDK 11. If you see the debug log you will notice that the socket write fails and a retry succeeds, We are tuning some network setting on the container to fail fast in this situation and let the retry handle the failure, Will keep youposted.

hamadjaved · 2024-05-22T03:26:28Z

I ran into something very similar - it typically happened when the size of the file being written to a partition approached ~ 100mb or so. I'll be curious if there are network settings to tweak to let this fail fast

gudladona · 2024-06-07T16:56:14Z

Assessment and workaround provided here: aws/aws-sdk-java#3110

codope added performance aws-support labels May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent stall of S3 PUT request for about 17 minutes #11203

Intermittent stall of S3 PUT request for about 17 minutes #11203

gudladona commented May 13, 2024

ad1happy2go commented May 17, 2024

hgudladona commented May 17, 2024

hamadjaved commented May 22, 2024

gudladona commented Jun 7, 2024

Intermittent stall of S3 PUT request for about 17 minutes #11203

Intermittent stall of S3 PUT request for about 17 minutes #11203

Comments

gudladona commented May 13, 2024

Expected Behavior

Current Behavior

Reproduction Steps

ad1happy2go commented May 17, 2024

hgudladona commented May 17, 2024

hamadjaved commented May 22, 2024

gudladona commented Jun 7, 2024