Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call to KDS 'put_records' fails intermittently with 'Connection reset by Peer' within lambda extension #1106

Open
dgarcia-collegeboard opened this issue Mar 20, 2024 · 7 comments
Labels
bug This issue is a bug. p2 This is a standard priority issue

Comments

@dgarcia-collegeboard
Copy link

Describe the bug

Hey!

I've derived an example from this repository: HERE

But instead of pushing to firehose, pushing to KDS instead. See minimal example

I've added some extra logic to my version of the above code where I'm providing custom credentials to the KDS client that's instantiated, but mostly, my implementation is the same. Is there a common reason for the Connection reset by peer error? It seems like the extension doesn't spin up the logs processor unless I invoke my lambda again, but this could just be because the async processing means any logs made in the Processors call method aren't spit out until they're resolved. I've seen some calls to kinesis succeed, but others seem to fail unexpectedly with this error:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" } }), connection: Unknown } }

The above error is logged during a match on the result of the future that is pinned inside of a Box in the example, expanded from this value HERE

Please note that the error is intermittent, meaning that sometimes the call to KDS works, but fails randomly.

I created an issue here in the lambda extension repository, but one of the maintainers mentioned this could be an issue with the SDK. I am thinking it may be a result of the lifecycle of the extension causing connection interference with the requests to KDS.

Any guidance would be much appreciated!

Expected Behavior

Lambda extension pushes logs to KDS with no issues

Current Behavior

Lambda extension fails to push logs to KDS on an intermittent / irregular basis

Reproduction Steps

https://github.com/dgarcia-collegeboard/aws-rust-lambda-extension-kinesis-example/blob/main/src/main.rs

The above code pushes to a KDS based on set env var

Possible Solution

No response

Additional Information/Context

Relevant issue link:
awslabs/aws-lambda-rust-runtime#837

Version

1.15.0

Environment details (OS name and version, etc.)

AWS NodeJS runtime for lambda

Logs

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Connect, Custom { kind: Other, error: Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" } }), connection: Unknown } }

@dgarcia-collegeboard dgarcia-collegeboard added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Mar 20, 2024
@rcoh
Copy link
Contributor

rcoh commented Mar 20, 2024

is this happening during running the lambda locally or running it on a real lambda function? There are apparently some issues with the local simulator.

@dgarcia-collegeboard
Copy link
Author

is this happening during running the lambda locally or running it on a real lambda function? There are apparently some issues with the local simulator.

Hey, thanks for the reply. This is happening with the extension deployed to AWS and attached to a lambda running nodejs

@dgarcia-collegeboard
Copy link
Author

Updating, I'm getting another error that seems potentially related:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Canceled, hyper::Error(IncompleteMessage))

After looking a bit, I found this issue which seems to indicate that maybe the issue is related to connection pooling. I've attempted to look to see if anyone has encountered this issue with KDS / lambda extensions, agnostic to the actual language, but some of these errors seem specific to the usage of hyper. I can't narrow down if this is some setting with the Extension struct I need to modify, or if I need to add some more logic when creating the future to send to KDS.

Is there any similar issues others have experienced that could maybe lead to a resolution? I'm unsure what logic on the client-side will lead to resolving this issue

@dgarcia-collegeboard
Copy link
Author

@rcoh hey there, just updating. I've added a bit more diagnostic information over on the other issue. There may be some nuances here between the lifecycle of the extension and the aws sdk client connections used / re-used for sdk calls. Do you see potential for conflicts there?

In regards to the error from the above comment:

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io, source: hyper::Error(Canceled, hyper::Error(IncompleteMessage))

This was caused by some experimenting with timeout_ms configuration on the extension's log buffer. I added a lot more technical info on this over on that issue I linked.

@satyamagarwal249
Copy link

I had something similar issue in past when pushing data to AWS-S3, my findings:

Observed on wireshirk: both errors you mentions os 104 connection reset and incomplete message are same and cause by server closing (reset) connection (RST and not FIN/graceful close). Its just that hyper throws different error based on racing condition of when/how it gets to know about connection closure where it is attempting to flush data. os 104 is thrown when hyper is informed of closed connection by os while writing to closed socket, while incomplete msg is thrown when hyper knows of it while reading from closed socket.

So above is normal workflow, main issue is to identify cause of why server is closing connection. In my case I solved as:

a) connection pool: Default timeout for idle connection is 90s in hyper, while for S3 it is actually around 20s. So, for next request when a idle connection is picked by hyper, it might have been already closed by server leading to those errors.

b) Server may close connections on violation of different policies. I found that I was exceeding the requests-count/sec rate as well as request-size/sec rate. So, as soon as rates exceeded, server started connection resets randomly on connections leading to these errors intermittently. I solved it by tuning the data-inflight as well as concurrent connection count, like max idle pool size =500. Because even I limit data-inflight , in high-bandwidth low-latency scenario for small sized requests, I still could exceed request-rate limit.

For me, this reduced incomplete message error count from hundreds-of-thousands to almost zero.

@dgarcia-collegeboard
Copy link
Author

dgarcia-collegeboard commented Mar 27, 2024

I solved it by tuning the data-inflight as well as concurrent connection count, like max idle pool size =500. Because even I limit data-inflight , in high-bandwidth low-latency scenario for small sized requests, I still could exceed request-rate limit.

For me, this reduced incomplete message error count from hundreds-of-thousands to almost zero.

@satyamagarwal249 thanks for providing some more datapoints!

Do you have an example of how this is done? I have no results for idle or pool in the kinesis sdk docs
https://docs.rs/aws-sdk-kinesis/latest/aws_sdk_kinesis/?search=idle

@dgarcia-collegeboard
Copy link
Author

dgarcia-collegeboard commented Mar 28, 2024

This ticket seems like a good analog breakdown of the problem being experienced here

npgsql/npgsql#3559 (comment)

@rcoh I can't quite narrow in on the fix to this problem. The other ticket in the lambda extension repository has been closed because it seems like the issue may be related to the sdk library / underlying hyper configuration for the client. That ticket is here and it contains a lot of information that I'd rather not copy paste into here.

This error pattern seems to occur sometimes in node js aws sdk, and from research it seems like the fix there is to set a Keep-Alive header when sending requests. I don't know if that may fully resolve the issue, with consideration to the breakdown in the npgsql repository ticket.

Any potential root cause / fix you're seeing with respect to the linked tickets or the input from above?

@jmklix jmklix added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants