-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GCP cloud storage output error with Failed to delete temporary file used for merging: context canceled #2522
Comments
Hey @anthonycorbacho 👋 thank you for raising this issue! While I think setting Not sure what's the best way to address this, but I'd consider skipping the deletion if the write fails. Also, it might be worth either reducing the level of that "Failed to delete temporary file used for merging" log or introducing a flag which users can enable to silence this error. Additionally, I think there's some opportunity for performance gains by passing more than one object in the PS: PRs are welcome! |
Hi @mihaitodor thank you for the answer. I tested different settings and am still having issues. I cloned the repo and added more debug to see the folder where the merge failed. The funny thing is that the file that was supposed to have a bunch of entries inside one has 1. I tried with batch and having multiple file instead and same issue happen gcp_cloud_storage:
bucket: "abucket"
path: benthos/dt=${!@dt}/id=${!@id}/file-${!timestamp_unix_nano()}.csv
content_type: binary/octet-stream
collision_mode: append
timeout: 60s
max_in_flight: 64
batching:
count: 100 Maybe my pipeline is causing the issue because each event I get generally has a bunch of IDs in the message.
and I am trying to
I tired to put every messages into a single file but I get a rate limit error. |
I'm not sure what's going on there, but if you do PS: When configuring |
@mihaitodor i followed your recommendation with the batching and the good new is that I don't get the error anymore, so I think this help. But this bring a new issue, now with the following gcp output setting batching:
period: 1s
processors:
- archive:
format: lines each folder (dt=YYYY/MM/DD/id=XXXX) contain a mix of ids not the one I am targeting with the path. so if the id in the path is |
@anthonycorbacho That's great to hear! Regarding the path, it sounds like you don't want to mix IDs across multiple files, so just have a file for each ID value if I'm understanding you correctly. If that's what you need, then you can break up the batch into smaller batches by using the |
Its mote like I don't want to mix id accros multiple path, one folder = one id more likely.
Ids are sensor id, so multiple reading will happen within couple of seconds. |
@mihaitodor I tried with your last recommended way gcp_cloud_storage:
bucket: "abucket"
path: benthos/dt=${!@dt}/file-${!@id}.csv
content_type: application/octet-stream
collision_mode: append
timeout: 60s
max_in_flight: 64
batching:
period: 10s
processors:
- group_by_value:
value: ${! meta("id") }
- archive:
format: lines I have a mix of
and
|
I have a feeling that
That is surprising. Benthos should keep retrying messages until the output reports success. Would be great to know if that is a reproducible issue somehow, since it should be fixed. |
@mihaitodor I have tried with I think its easily reproducible, I think if you have a lot of messages (~9k every sec) you can easily hit this issue, I don't have a complex use case, I just transform a map of json into multiple messages, change the formatting into csv and then try to save it into gcp cloud storage |
Understood. I don't have the capacity now to look into this in more detail unfortunately and no GCP access to run tests. I think the hints I left in my previous comment should point in the right direction if anyone is interested in submitting a PR. |
@mihaitodor thank you for your help! I really appreciate the time you spent on my issue. Maybe I should try to use AWS S3 storage instead of gcs if it fix my issue. I will try to play more and see if I find a way to fix it. |
Context
I am reading bunch of event coming from a MQTT broker
and I am trying to process and save the output into blobstorage (GCP cloud storage).
The processing is relatively simple, I am taking an mqtt message (slice of events) and outputting to a CSV like format
the output logic is to create folder partition by date (
dt=YYYY/MM/DD
) and id (id=ID
)when running Benthos with the config, I am seeing quite often this error
I am not 100% sure if this is false positive since I have quite a lot of event comming per sec (~1000 events) and the error doesn't tell me much about the path timing out so I cannot really check if the temp file has been merged or not.
i tried to change timeout 20s, 30s 60s but seems to always error.
The text was updated successfully, but these errors were encountered: