aws_s3 source: async receive_messages does not return after poll_secs #20017

AvihaiSam · 2024-03-06T10:52:46Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

when using aws_s3 source, the sqs client's receive_messages fn sometimes hangs with no return value.
this causes vector to stop polling from sqs (loop still waiting for a future to end before requesting again)

Configuration

sources:
  s3:
    auth:
      assume_role: ${S3_RO_ROLE}
      region: ${S3_REGION}
    region: ${S3_REGION}
    sqs:
      queue_url: ${SQS_QUEUE_URL}
    type: aws_s3

Version

0.27.0 -> 0.37.0

Debug Output

No response

Example Data

No response

Additional Context

I have made a fork with a PR that sets a timeout on the receive_messages fn, It's probably not the best way to handle it but it's working.
It's possible that updating the SDK might help, I wasn't able to update it though.

References

https://discord.com/channels/742820443487993987/1019686524742271038/threads/1214559445963505734

The text was updated successfully, but these errors were encountered:

jszwedko · 2024-03-07T17:46:35Z

Thanks @AvihaiSam . This seems like it may potentially point to an issue in the AWS SDK. The receive_messages call should wait up to poll_secs (default of 15) to return if there are no messages available. It may be worth trying to reproduce with just the Rust AWS SDK and, if you can reproduce it, file a bug upstream: https://github.com/awslabs/aws-sdk-rust

andjhop · 2024-03-08T11:17:28Z

Hi @jszwedko. My understanding is that poll_secs only configures the WaitTimeSeconds parameter of the SQS RequestMessage honoured API side. We don't appear to have a way to configure a client request timeout, so
connectivity could be lost at the network level, the client never sees a RST or FIN and stalls. We've seen this a number of times making updates to our CNI.

AvihaiSam · 2024-03-08T14:10:33Z

I agree with @andjhop
@jszwedko i have opened a ticket at the sdk repo, they have suggested to set an operation_attempt_timeout on client's timeout_config, what do you think of that? It sounds like they implemented that on all types of clients, so it can be a generic config for all AWS sources/sinks for vector, optionally set to null if timeout not wanted.

jszwedko · 2024-03-08T14:55:13Z

I agree with @andjhop @jszwedko i have opened a ticket at the sdk repo, they have suggested to set an operation_attempt_timeout on client's timeout_config, what do you think of that? It sounds like they implemented that on all types of clients, so it can be a generic config for all AWS sources/sinks for vector, optionally set to null if timeout not wanted.

Good find! I think adding that attempt timeout would make sense to have the client timeout if it receives no response. I agree it could be a generic setting supported by all AWS components. I think it could be configured around here:

vector/src/aws/mod.rs

Lines 198 to 207 in 1b87cce

    
           let mut config_builder = SdkConfig::builder() 
        
               .http_client(connector) 
        
               .sleep_impl(Arc::new(TokioSleep::new())) 
        
               .identity_cache(auth.credentials_cache().await?) 
        
               .credentials_provider( 
        
                   auth.credentials_provider(region.clone(), proxy, tls_options) 
        
                       .await?, 
        
               ) 
        
               .region(region.clone()) 
        
               .retry_config(retry_config.clone());

jszwedko · 2024-03-08T14:57:18Z

Actually, I see we already have a config option for some components that could maybe be used here: https://vector.dev/docs/reference/configuration/sinks/aws_s3/#request.timeout_secs. We maybe just need to expose that on AWS sources too.

AvihaiSam · 2024-03-09T19:22:12Z

@jszwedko I will try that, but it states that the default value is 60s, which if was true - i wouldn't have found this issue in the first place... maybe it doesn't apply for any aws-sdk requests...

AvihaiSam · 2024-03-09T19:31:20Z

well... I get

Configuration error. error=unknown field `request`

when I try to add that:

sources:
  s3:
    ...
    request:
      timeout_secs: 15

so I guess it's not really working...

andjhop · 2024-03-11T11:06:51Z

I don't think @jszwedko was suggesting that this should already work, rather that this configuration could perhaps be made available for AWS sources as it already is for sinks. The problem I see is that the request configuration already maps closely to Tower middleware which, if I understand correctly, isn't relevant to the aws_s3 source.

jszwedko · 2024-03-11T14:54:04Z

Right, yeah, sources don't have the same request options that sinks do which, as @andjhop points out, are closely tied to Tower. I think we could just add request.timeout_secs for sources that make requests though (like aws_s3).

AvihaiSam · 2024-03-12T08:02:31Z

oooh.... I thought I was looking at the source aws_s3...
well, I guess this might work...

andjhop · 2024-03-12T11:22:08Z

That sounds good, although it seems reasonable that a user might want finer grain control over the settings the SDK client already allows for; here we have the ability to configure the connect timeout, read timeout, operation and operation attempt timeouts. Also considering #15518 which already exposed some of these settings for IMDS requests, introducing imds.connect_timeout_seconds and imds.read_timeout_seconds.

jszwedko · 2024-03-12T20:08:15Z

Agreed, some of those settings (operation and operation attempt timeouts) are AWS specific so it could make sense to expose them under AWS-specific options like request.aws.operation_timeout. Connect and read timeouts are general enough that we could consider supporting them on any component that makes HTTP requests assuming hyper, the HTTP crate we use, supports them. We could start with just supporting them on AWS components though.

andjhop · 2024-03-13T11:52:38Z

OK, so request.{connect,read}_timeout_seconds and request.aws.operation_{attempt_,}timeout_seconds} then? And should we have the optional IMDS settings default to these if not configured? I'm happy to attempt the necessary changes.

jszwedko · 2024-03-13T15:50:01Z

OK, so request.{connect,read}_timeout_seconds and request.aws.operation_{attempt_,}timeout_seconds} then? And should we have the optional IMDS settings default to these if not configured? I'm happy to attempt the necessary changes.

That all sounds right to me! Agreed with having the IMDS fallback to those.

jdisanti · 2024-03-14T22:34:52Z

If you only need the timeout for SQS long polling, then I recommend only setting it for that specific operation in the operation's config override. It's pretty difficult to come up with an operation timeout value that works across different AWS services, since some operations are expected to take longer than others (for example, downloading a large file from S3).

andjhop · 2024-03-18T15:36:42Z

Thank you, that makes sense. I've made a pull request for an optional timeout configuration on the S3 source for S3 and SQS operations, the exact naming and structure to be decided.

AvihaiSam · 2024-03-20T20:55:32Z

hey guys, I've compiled the code from @andjhop's PR and run it.
there's still an issue in py pipeline, didn't have enough time to investigate it, not sure if it's related to aws-s3-source but I think it is.

right after the the pod starts there's several (5 in this specific case)

2024-03-20T17:44:25.714865Z INFO source{component_kind="source" component_id=s3 component_type=aws_s3}:invoke{service=s3 operation=GetObject}:try_op:try_attempt:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 110.795221ms) new_expiration=2024-03-20T18:44:25Z valid_for=3599.285135873s partition=IdentityCachePartition(7)

logs, and right after that nothing more.
on debug logs i can see that after several more logs there's no more reference for aws-s3 source in any of the logs.

i will try to investigate it further...

jszwedko · 2024-03-21T13:20:48Z

hey guys, I've compiled the code from @andjhop's PR and run it. there's still an issue in py pipeline, didn't have enough time to investigate it, not sure if it's related to aws-s3-source but I think it is.

right after the the pod starts there's several (5 in this specific case)

2024-03-20T17:44:25.714865Z INFO source{component_kind="source" component_id=s3 component_type=aws_s3}:invoke{service=s3 operation=GetObject}:try_op:try_attempt:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 110.795221ms) new_expiration=2024-03-20T18:44:25Z valid_for=3599.285135873s partition=IdentityCachePartition(7)

logs, and right after that nothing more. on debug logs i can see that after several more logs there's no more reference for aws-s3 source in any of the logs.

i will try to investigate it further...

Thanks for taking a look at that. Just to confirm, you don't see that issue on master?

AvihaiSam · 2024-03-24T12:17:40Z

@jszwedko - I actually see no difference between master and PR. Though I believe the PR is legit and should be working...

jszwedko · 2024-03-25T20:16:15Z

@jszwedko - I actually see no difference between master and PR. Though I believe the PR is legit and should be working...

Gotcha, I think I misread your question before. I see that you expect to see the behavior on master, without this PR, but expected this PR to fix it. I would try running with debug logs enabled and see if anything jumps out.

andjhop · 2024-04-03T08:32:06Z

Is that related? I didn't expect for this fix to affect or resolve anything related to the identity cache.

AvihaiSam added the type: bug A code related bug. label Mar 6, 2024

jszwedko added the source: aws_s3 Anything `aws_s3` source related label Mar 7, 2024

AvihaiSam mentioned this issue Mar 7, 2024

aws-sqs-sdk: receive_messages is flaky regarding timeout (at wait_time_seconds) awslabs/aws-sdk-rust#1094

Closed

jszwedko added the provider: aws Anything `aws` service provider related label Mar 8, 2024

andjhop linked a pull request Mar 18, 2024 that will close this issue

enhancement(AWS config): Make SQS client timeout settings configurable #20120

Open

hanley-patrick mentioned this issue Apr 29, 2024

Vector Pod is not recovering from brief network disruptions after restarting CNI Pod #20337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws_s3 source: async receive_messages does not return after poll_secs #20017

aws_s3 source: async receive_messages does not return after poll_secs #20017

AvihaiSam commented Mar 6, 2024

jszwedko commented Mar 7, 2024

andjhop commented Mar 8, 2024 •

edited

AvihaiSam commented Mar 8, 2024 •

edited

jszwedko commented Mar 8, 2024

jszwedko commented Mar 8, 2024

AvihaiSam commented Mar 9, 2024

AvihaiSam commented Mar 9, 2024 •

edited

andjhop commented Mar 11, 2024 •

edited

jszwedko commented Mar 11, 2024

AvihaiSam commented Mar 12, 2024

andjhop commented Mar 12, 2024 •

edited

jszwedko commented Mar 12, 2024

andjhop commented Mar 13, 2024

jszwedko commented Mar 13, 2024

jdisanti commented Mar 14, 2024

andjhop commented Mar 18, 2024 •

edited

AvihaiSam commented Mar 20, 2024

jszwedko commented Mar 21, 2024

AvihaiSam commented Mar 24, 2024

jszwedko commented Mar 25, 2024

andjhop commented Apr 3, 2024

aws_s3 source: async receive_messages does not return after poll_secs #20017

aws_s3 source: async receive_messages does not return after poll_secs #20017

Comments

AvihaiSam commented Mar 6, 2024

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

jszwedko commented Mar 7, 2024

andjhop commented Mar 8, 2024 • edited

AvihaiSam commented Mar 8, 2024 • edited

jszwedko commented Mar 8, 2024

jszwedko commented Mar 8, 2024

AvihaiSam commented Mar 9, 2024

AvihaiSam commented Mar 9, 2024 • edited

andjhop commented Mar 11, 2024 • edited

jszwedko commented Mar 11, 2024

AvihaiSam commented Mar 12, 2024

andjhop commented Mar 12, 2024 • edited

jszwedko commented Mar 12, 2024

andjhop commented Mar 13, 2024

jszwedko commented Mar 13, 2024

jdisanti commented Mar 14, 2024

andjhop commented Mar 18, 2024 • edited

AvihaiSam commented Mar 20, 2024

jszwedko commented Mar 21, 2024

AvihaiSam commented Mar 24, 2024

jszwedko commented Mar 25, 2024

andjhop commented Apr 3, 2024

andjhop commented Mar 8, 2024 •

edited

AvihaiSam commented Mar 8, 2024 •

edited

AvihaiSam commented Mar 9, 2024 •

edited

andjhop commented Mar 11, 2024 •

edited

andjhop commented Mar 12, 2024 •

edited

andjhop commented Mar 18, 2024 •

edited