Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_s3 source: async receive_messages does not return after poll_secs #20017

Open
AvihaiSam opened this issue Mar 6, 2024 · 21 comments 路 May be fixed by #20120
Open

aws_s3 source: async receive_messages does not return after poll_secs #20017

AvihaiSam opened this issue Mar 6, 2024 · 21 comments 路 May be fixed by #20120
Labels
provider: aws Anything `aws` service provider related source: aws_s3 Anything `aws_s3` source related type: bug A code related bug.

Comments

@AvihaiSam
Copy link

A note for the community

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

when using aws_s3 source, the sqs client's receive_messages fn sometimes hangs with no return value.
this causes vector to stop polling from sqs (loop still waiting for a future to end before requesting again)

Configuration

sources:
  s3:
    auth:
      assume_role: ${S3_RO_ROLE}
      region: ${S3_REGION}
    region: ${S3_REGION}
    sqs:
      queue_url: ${SQS_QUEUE_URL}
    type: aws_s3

Version

0.27.0 -> 0.37.0

Debug Output

No response

Example Data

No response

Additional Context

I have made a fork with a PR that sets a timeout on the receive_messages fn, It's probably not the best way to handle it but it's working.
It's possible that updating the SDK might help, I wasn't able to update it though.

References

https://discord.com/channels/742820443487993987/1019686524742271038/threads/1214559445963505734

@AvihaiSam AvihaiSam added the type: bug A code related bug. label Mar 6, 2024
@jszwedko
Copy link
Member

jszwedko commented Mar 7, 2024

Thanks @AvihaiSam . This seems like it may potentially point to an issue in the AWS SDK. The receive_messages call should wait up to poll_secs (default of 15) to return if there are no messages available. It may be worth trying to reproduce with just the Rust AWS SDK and, if you can reproduce it, file a bug upstream: https://github.com/awslabs/aws-sdk-rust

@andjhop
Copy link

andjhop commented Mar 8, 2024

Hi @jszwedko. My understanding is that poll_secs only configures the WaitTimeSeconds parameter of the SQS RequestMessage honoured API side. We don't appear to have a way to configure a client request timeout, so
connectivity could be lost at the network level, the client never sees a RST or FIN and stalls. We've seen this a number of times making updates to our CNI.

@AvihaiSam
Copy link
Author

AvihaiSam commented Mar 8, 2024

I agree with @andjhop
@jszwedko i have opened a ticket at the sdk repo, they have suggested to set an operation_attempt_timeout on client's timeout_config, what do you think of that? It sounds like they implemented that on all types of clients, so it can be a generic config for all AWS sources/sinks for vector, optionally set to null if timeout not wanted.

@jszwedko
Copy link
Member

jszwedko commented Mar 8, 2024

I agree with @andjhop @jszwedko i have opened a ticket at the sdk repo, they have suggested to set an operation_attempt_timeout on client's timeout_config, what do you think of that? It sounds like they implemented that on all types of clients, so it can be a generic config for all AWS sources/sinks for vector, optionally set to null if timeout not wanted.

Good find! I think adding that attempt timeout would make sense to have the client timeout if it receives no response. I agree it could be a generic setting supported by all AWS components. I think it could be configured around here:

vector/src/aws/mod.rs

Lines 198 to 207 in 1b87cce

let mut config_builder = SdkConfig::builder()
.http_client(connector)
.sleep_impl(Arc::new(TokioSleep::new()))
.identity_cache(auth.credentials_cache().await?)
.credentials_provider(
auth.credentials_provider(region.clone(), proxy, tls_options)
.await?,
)
.region(region.clone())
.retry_config(retry_config.clone());

@jszwedko jszwedko added the provider: aws Anything `aws` service provider related label Mar 8, 2024
@jszwedko
Copy link
Member

jszwedko commented Mar 8, 2024

Actually, I see we already have a config option for some components that could maybe be used here: https://vector.dev/docs/reference/configuration/sinks/aws_s3/#request.timeout_secs. We maybe just need to expose that on AWS sources too.

@AvihaiSam
Copy link
Author

@jszwedko I will try that, but it states that the default value is 60s, which if was true - i wouldn't have found this issue in the first place... maybe it doesn't apply for any aws-sdk requests...

@AvihaiSam
Copy link
Author

AvihaiSam commented Mar 9, 2024

well... I get

Configuration error. error=unknown field `request`

when I try to add that:

sources:
  s3:
    ...
    request:
      timeout_secs: 15

so I guess it's not really working...

@andjhop
Copy link

andjhop commented Mar 11, 2024

I don't think @jszwedko was suggesting that this should already work, rather that this configuration could perhaps be made available for AWS sources as it already is for sinks. The problem I see is that the request configuration already maps closely to Tower middleware which, if I understand correctly, isn't relevant to the aws_s3 source.

@jszwedko
Copy link
Member

Right, yeah, sources don't have the same request options that sinks do which, as @andjhop points out, are closely tied to Tower. I think we could just add request.timeout_secs for sources that make requests though (like aws_s3).

@AvihaiSam
Copy link
Author

oooh.... I thought I was looking at the source aws_s3...
well, I guess this might work...

@andjhop
Copy link

andjhop commented Mar 12, 2024

That sounds good, although it seems reasonable that a user might want finer grain control over the settings the SDK client already allows for; here we have the ability to configure the connect timeout, read timeout, operation and operation attempt timeouts. Also considering #15518 which already exposed some of these settings for IMDS requests, introducing imds.connect_timeout_seconds and imds.read_timeout_seconds.

@jszwedko
Copy link
Member

Agreed, some of those settings (operation and operation attempt timeouts) are AWS specific so it could make sense to expose them under AWS-specific options like request.aws.operation_timeout. Connect and read timeouts are general enough that we could consider supporting them on any component that makes HTTP requests assuming hyper, the HTTP crate we use, supports them. We could start with just supporting them on AWS components though.

@andjhop
Copy link

andjhop commented Mar 13, 2024

OK, so request.{connect,read}_timeout_seconds and request.aws.operation_{attempt_,}timeout_seconds} then? And should we have the optional IMDS settings default to these if not configured? I'm happy to attempt the necessary changes.

@jszwedko
Copy link
Member

OK, so request.{connect,read}_timeout_seconds and request.aws.operation_{attempt_,}timeout_seconds} then? And should we have the optional IMDS settings default to these if not configured? I'm happy to attempt the necessary changes.

That all sounds right to me! Agreed with having the IMDS fallback to those.

@jdisanti
Copy link

If you only need the timeout for SQS long polling, then I recommend only setting it for that specific operation in the operation's config override. It's pretty difficult to come up with an operation timeout value that works across different AWS services, since some operations are expected to take longer than others (for example, downloading a large file from S3).

@andjhop
Copy link

andjhop commented Mar 18, 2024

Thank you, that makes sense. I've made a pull request for an optional timeout configuration on the S3 source for S3 and SQS operations, the exact naming and structure to be decided.

@AvihaiSam
Copy link
Author

hey guys, I've compiled the code from @andjhop's PR and run it.
there's still an issue in py pipeline, didn't have enough time to investigate it, not sure if it's related to aws-s3-source but I think it is.

right after the the pod starts there's several (5 in this specific case)

2024-03-20T17:44:25.714865Z INFO source{component_kind="source" component_id=s3 component_type=aws_s3}:invoke{service=s3 operation=GetObject}:try_op:try_attempt:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 110.795221ms) new_expiration=2024-03-20T18:44:25Z valid_for=3599.285135873s partition=IdentityCachePartition(7)

logs, and right after that nothing more.
on debug logs i can see that after several more logs there's no more reference for aws-s3 source in any of the logs.

i will try to investigate it further...

@jszwedko
Copy link
Member

hey guys, I've compiled the code from @andjhop's PR and run it. there's still an issue in py pipeline, didn't have enough time to investigate it, not sure if it's related to aws-s3-source but I think it is.

right after the the pod starts there's several (5 in this specific case)

2024-03-20T17:44:25.714865Z INFO source{component_kind="source" component_id=s3 component_type=aws_s3}:invoke{service=s3 operation=GetObject}:try_op:try_attempt:lazy_load_identity: aws_smithy_runtime::client::identity::cache::lazy: identity cache miss occurred; added new identity (took 110.795221ms) new_expiration=2024-03-20T18:44:25Z valid_for=3599.285135873s partition=IdentityCachePartition(7)

logs, and right after that nothing more. on debug logs i can see that after several more logs there's no more reference for aws-s3 source in any of the logs.

i will try to investigate it further...

Thanks for taking a look at that. Just to confirm, you don't see that issue on master?

@AvihaiSam
Copy link
Author

@jszwedko - I actually see no difference between master and PR. Though I believe the PR is legit and should be working...

@jszwedko
Copy link
Member

@jszwedko - I actually see no difference between master and PR. Though I believe the PR is legit and should be working...

Gotcha, I think I misread your question before. I see that you expect to see the behavior on master, without this PR, but expected this PR to fix it. I would try running with debug logs enabled and see if anything jumps out.

@andjhop
Copy link

andjhop commented Apr 3, 2024

Is that related? I didn't expect for this fix to affect or resolve anything related to the identity cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider: aws Anything `aws` service provider related source: aws_s3 Anything `aws_s3` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants