GH-40557: [C++] Use `PutObject` request for S3 in OutputStream when only uploading small data #41564

OliLay · 2024-05-07T07:34:29Z

Rationale for this change

See #40557. The previous implementation would always issue multi part uploads which come with 3x RTT to S3 instead of just 1x RTT with a PutObject request.

What changes are included in this PR?

Implement logic in the S3 OutputStream to use a PutObject request if data is below a certain threshold (5 MB) and the output stream is closed. If more data is written, a multi part upload is triggered. Note: Previously, opening the output stream was already expensive because the CreateMultipartUpload request was triggered then. With this change opening the output stream becomes cheap, as we rather wait until some data is written to decide which upload method to use. This required some more state-keeping in the output stream class.

Are these changes tested?

No new tests were added, as there are already tests for very small writes and very large writes, which will trigger both ways of uploading. Everything should therefore be covered by existing tests.

Are there any user-facing changes?

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated). Replicating the old behavior is not possible without sending another request which defeats the purpose of this performance optimization. I hope this is fine.

GitHub Issue: [C++] S3Filesystem always initiates multipart uploads, regardless of input size #40557

github-actions · 2024-05-07T07:35:01Z

⚠️ GitHub issue #40557 has been automatically assigned in GitHub to PR creator.

OliLay · 2024-05-13T13:04:45Z

From the log output, it seems like the failing CI jobs are not related to this change. Correct me if I am wrong though. Should I rebase (in case the flaky tests are already fixed on main)?

orf · 2024-05-13T15:48:54Z

I think it’s worth rebasing to see?

OliLay · 2024-05-14T12:11:50Z

Rebased to current main, now waiting for the CI approval again :)

cpp/src/arrow/filesystem/s3fs.cc

cpp/src/arrow/filesystem/s3fs_test.cc

pitrou · 2024-05-14T12:38:19Z

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated).

Hmm, I'm not sure that is ok. Usually, when opening a file for writing, you expect the initial open to fail if the path cannot be written to. I have no idea how much code relies on that, but that's a common expectation due to how filesystems usually work (e.g. when accessing local storage).

orf · 2024-05-14T12:48:02Z

Previously, we would fail when opening the output stream if the bucket doesn't exist. We inferred that by sending the CreateMultipartUpload request, which we now do not send anymore upon opening the stream. We now rather fail at closing, or at writing (when >5MB have accumulated).

Hmm, I'm not sure that is ok. Usually, when opening a file for writing, you expect the initial open to fail if the path cannot be written to. I have no idea how much code relies on that, but that's a common expectation due to how filesystems usually work (e.g. when accessing local storage).

This isn’t guaranteed with the current implementation though? Putting a part, or completing a multipart upload, can fail in various ways? An obvious one would be a checksum failure.

pitrou · 2024-05-14T12:57:41Z

My point is that if the path cannot be written to, the error happens when opening the file, not later on.

OliLay · 2024-05-14T13:11:13Z

My point is that if the path cannot be written to, the error happens when opening the file, not later on.

That is true. I guess the question is if arrow's OutputStream API makes an explicit guarantee that Open should throw if the target does not exist. My guess would be that you shouldn't built code upon this assumption if it isn't explicitly stated in arrow's API/docs (which it is not), but of course real-world usage deviates from that (Hyrum's Law).
But checking if the bucket exists would at least come with another 1x RTT to S3 and the goal of the PR was to reduce the amount of blocking calls to S3 to reduce overall latency. If we add another check here, we'll have a total 2x RTT to S3 for small uploads, which is better than the initial 3x RTT without this change, but still not optimal from a performance-view. (and we would probably have 4x RTT for multipart uploads)

pitrou · 2024-05-14T13:24:10Z

That is true. I guess the question is if arrow's OutputStream API makes an explicit guarantee that Open should throw if the target does not exist. My guess would be that you shouldn't built code upon this assumption if it isn't explicitly stated in arrow's API/docs (which it is not), but of course real-world usage deviates from that (Hyrum's Law).

The API docs generally do not go into that level of detail. However, it is a general assumption that a filesystem "looks like" a local filesystem API-wise.

It is also much more convenient to get an error early, than after you have already "written" multiple megabytes of data to the file.

A compromise would be to add a dedicated option in S3Options, but of course the optimization would only benefit those users that enable the option.

OliLay · 2024-05-14T13:51:38Z

A compromise would be to add a dedicated option in S3Options, but of course the optimization would only benefit those users that enable the option.

We can do that. I would propose that if the optimization is disabled, we directly use multi-part uploads (basically replicating the old behavior). I don't think it makes sense to explicitly issue a HeadBucket request because that will lead to minimum 4 requests with multi-part uploads then. (although we would only have 2 requests for small writes without the optimization compared to current main)
What do you think?

pitrou · 2024-05-14T14:00:30Z

We can do that. I would propose that if the optimization is disabled, we directly use multi-part uploads (basically replicating the old behavior).

That sounds reasonable to me.

orf · 2024-05-14T18:05:49Z

Just to note, issuing HeadBucket doesn't guarantee that a write will succeed - there isn't really a good way to check without actually writing. A HeadObject on the key and failing on any 403 is probably ok though? However there are valid cases where you'd want to write to a key that your principal is not able to read from. HeadBucket also requires full s3:ListBucket permissions, policies that restrict listing to specific prefixes would need to be updated.

I think an optimization flag is appropriate as the behaviour is technically changing. Does it make sense to make the flag a somewhat generic one, rather than specific to this case?

There are a few other optimizations that might also fall into the "more performant, slightly different semantics" category. If I was to contribute a PR to improve one of the linked areas, would we want to add a new specific flag for this case or bundle it under a single "optimized" flag?

The upside would be that it becomes more configurable, whereas the downside is that the testing and support matrix explodes. Perhaps it's better to just have a single optimized=True flag, vs receiving bug reports when specifically optimize_put_object=True, optimize_delete_dir=False, optimize_move=True, optimize_delete=False, optimize_ensure_parents_exist=True, optimize_foobar=True are set?

Edit: i guess this is only relevant for higher-level Python bindings, we'd still want internal flags for individual features.

OliLay · 2024-05-15T07:36:30Z

I added a sanitize_bucket_on_open_ flag to the S3Options, adjusted the logic and also instantiated tests with this flag enabled.
I guess the Python bindings can be tackled in a separate PR, right?

kou · 2024-05-14T00:48:54Z

cpp/src/arrow/filesystem/s3fs.cc

-    // So we instead default to application/octet-stream which is less misleading
-    if (!req.ContentTypeHasBeenSet()) {
-      req.SetContentType("application/octet-stream");
+    if (metadata == nullptr ||


Suggested change

if (metadata == nullptr ||

if (!metadata ||

kou · 2024-05-14T00:50:28Z

cpp/src/arrow/filesystem/s3fs.cc

+    if (metadata == nullptr ||
+        !metadata->Contains(ObjectMetadataSetter<ObjectRequest>::CONTENT_TYPE_KEY)) {
+      // If we do not set anything then the SDK will default to application/xml
+      // which confuses some tools (https://github.com/apache/arrow/issues/11934)
+      // So we instead default to application/octet-stream which is less misleading
+      request->SetContentType("application/octet-stream");
+    } else {
+      RETURN_NOT_OK(SetObjectMetadata(metadata, request));
    }


How about swapping these clauses for easy to read?

if (metadata && metadata->Contains(ObjectMetadataSetter<ObjectRequest>::CONTENT_TYPE_KEY)) { RETURN_NOT_OK(SetObjectMetadata(metadata, request)); } else { request->SetContentType("application/octet-stream"); }

pitrou · 2024-05-16T14:34:30Z

cpp/src/arrow/filesystem/s3fs.h

+  /// for latency-sensitive applications, at the cost of the OutputStream may throwing an
+  /// exception at a later stage (i.e. at writing or closing) if e.g. the bucket does not
+  /// exist.
+  bool sanitize_bucket_on_open = true;


Perhaps we should make this more general, to open up other potential optimizations?

/// Whether to allow file-open methods to return before the actual open /// /// Enabling this true may reduce the latency of `OpenInputStream`, `OpenOutpuStream`, /// and similar methods, by reducing the number of roundtrips necessary. It may also /// allow usage of more efficient S3 APIs for small files. /// The downside is that failure conditions such as attempting to open a file in a /// non-existing bucket will only be reported when actual I/O is done (at worse, /// when attempting to close the file). bool allow_delayed_open = false;

Yes, I like this much more 👍

pitrou · 2024-05-16T14:36:05Z

cpp/src/arrow/filesystem/s3fs_test.cc

@@ -1197,6 +1199,19 @@ TEST_F(TestS3FS, OpenOutputStreamSyncWrites) {
  TestOpenOutputStream();
 }

+TEST_F(TestS3FS, OpenOutputStreamNoBucketSanitizationSyncWrites) {


Instead of adding more test methods for every combinatorial expansion, perhaps we should instead use a for loop on the various tested parameter values?

I added a struct which abstracts the parameter combinations. 7748824

pitrou · 2024-05-16T14:36:41Z

cpp/src/arrow/filesystem/s3fs.cc

@@ -1293,12 +1295,14 @@ std::shared_ptr<const KeyValueMetadata> GetObjectMetadata(const ObjectResult& re

 template <typename ObjectRequest>
 struct ObjectMetadataSetter {
+  static constexpr std::string_view CONTENT_TYPE_KEY = "Content-Type";


Nit: naming conventions for constants

Suggested change

static constexpr std::string_view CONTENT_TYPE_KEY = "Content-Type";

static constexpr std::string_view kContentTypeKey = "Content-Type";

pitrou · 2024-05-16T14:40:13Z

cpp/src/arrow/filesystem/s3fs.cc

    }

    return Status::OK();
  }

+  Status CleanupAfterFlush() {


Should this be named CleanupAfterClose?

pitrou · 2024-05-16T14:40:52Z

cpp/src/arrow/filesystem/s3fs.cc

@@ -1734,7 +1812,7 @@ class ObjectOutputStream final : public io::OutputStream {
        return Status::OK();
      }

-      // Upload current buffer
+      // Upload current buffer if we are above threshold for multi-part upload


The comment is misleading: this always uploads the current buffer, right?

I mean, CommitCurrentPart starts by calling CreateMultipartUpload.

You're right, I just wanted to make the point that we won't call this when we haven't accumulated enough data. I've clarified the comment. 7748824

pitrou · 2024-05-16T14:42:47Z

cpp/src/arrow/filesystem/s3fs.cc

+    if (current_part_ == nullptr) {
+      // In case the stream is closed directly after it has been opened without writing
+      // anything, we'll have to create an empty buffer.
+      buf = Buffer::FromVector<uint8_t>({});


Nit, but std::make_shared<Buffer>("", 0) looks simpler to me.

pitrou · 2024-05-16T14:45:09Z

cpp/src/arrow/filesystem/s3fs.cc

+  template <typename RequestType, typename OutcomeType>
+  static Result<OutcomeType> TriggerUploadRequest(
+      const RequestType& request, const std::shared_ptr<S3ClientHolder>& holder);


I don't think this declaration is actually necessary? i.e. TriggerUploadRequest doesn't need to be a template method, it can be a regular overloaded method.

pitrou · 2024-05-16T14:46:16Z

cpp/src/arrow/filesystem/s3fs.cc

+  template <typename RequestType, typename OutcomeType>
+  using UploadResultCallbackFunction =
+      std::function<Status(const RequestType& request, std::shared_ptr<UploadState>,
+                           int32_t, OutcomeType outcome)>;


Let's make this signature more informative

Suggested change

template <typename RequestType, typename OutcomeType>

using UploadResultCallbackFunction =

std::function<Status(const RequestType& request, std::shared_ptr<UploadState>,

int32_t, OutcomeType outcome)>;

template <typename RequestType, typename OutcomeType>

using UploadResultCallbackFunction =

std::function<Status(const RequestType& request, std::shared_ptr<UploadState>,

int32_t part_number, OutcomeType outcome)>;

pitrou · 2024-05-16T14:52:38Z

cpp/src/arrow/filesystem/s3fs.cc

+    return Upload<Aws::S3::Model::PutObjectRequest, Aws::S3::Model::PutObjectOutcome>(
+        std::move(req), sync_result_callback, async_result_callback, data, nbytes,
+        owned_buffer);


Nits: move more arguments

Suggested change

return Upload<Aws::S3::Model::PutObjectRequest, Aws::S3::Model::PutObjectOutcome>(

std::move(req), sync_result_callback, async_result_callback, data, nbytes,

owned_buffer);

return Upload<Aws::S3::Model::PutObjectRequest, Aws::S3::Model::PutObjectOutcome>(

std::move(req), std::move(sync_result_callback), std::move(async_result_callback),

data, nbytes, std::move(owned_buffer));

pitrou · 2024-05-16T14:52:54Z

cpp/src/arrow/filesystem/s3fs.cc

+    return Upload<Aws::S3::Model::UploadPartRequest, Aws::S3::Model::UploadPartOutcome>(
+        std::move(req), sync_result_callback, async_result_callback, data, nbytes,
+        owned_buffer);


Same here: more arguments can be moved.

mapleFU · 2024-05-24T18:01:05Z

Sorry for delaying review! Would merge after other committers approve

pitrou

Thanks for this @OliLay ! I have some questions and suggestions below.

pitrou · 2024-06-03T14:25:45Z

cpp/src/arrow/filesystem/s3fs.cc

@@ -1293,12 +1295,14 @@ std::shared_ptr<const KeyValueMetadata> GetObjectMetadata(const ObjectResult& re

 template <typename ObjectRequest>
 struct ObjectMetadataSetter {
+  static constexpr std::string_view kContentTypeKey = "Content-Type";


"Content-Type" is still used as a literal above. Should we move this at the top-level and use it everywhere?
Or, conversely, just undo this, since the "Content-Type" literal is unlikely to change value...

I reverted the change to move it to a constant and rather have it inline now everywhere again.

pitrou · 2024-06-03T14:26:58Z

cpp/src/arrow/filesystem/s3fs.cc

+      // If we do not set anything then the SDK will default to application/xml
+      // which confuses some tools (https://github.com/apache/arrow/issues/11934)
+      // So we instead default to application/octet-stream which is less misleading
+      request->SetContentType("application/octet-stream");


So metadata is ignored if it doesn't contain a "Content-Type" key? Or am I missing something here?

Good catch, I think this was an issue, I fixed it and made the code also a bit clearer.

pitrou · 2024-06-03T14:28:55Z

cpp/src/arrow/filesystem/s3fs.cc

    return Status::OK();
  }

  // OutputStream interface

+  bool ShouldBeMultipartUpload() const { return pos_ > kMultiPartUploadThresholdSize; }


Why not instead

Suggested change

bool ShouldBeMultipartUpload() const { return pos_ > kMultiPartUploadThresholdSize; }

bool ShouldBeMultipartUpload() const {

return pos_ > kMultiPartUploadThresholdSize || !allow_delayed_open_;

}

pitrou · 2024-06-03T14:29:34Z

cpp/src/arrow/filesystem/s3fs.cc

+  bool ShouldBeMultipartUpload() const { return pos_ > kMultiPartUploadThresholdSize; }
+
+  bool IsMultipartUpload() const {
+    return ShouldBeMultipartUpload() || is_multipart_created_;


Why add is_multipart_created_ here? Is there any situation where is_multipart_created_ would be true but ShouldBeMultipartUpload() would be false?

There is probably no reason, I think it evolved to be like this during the PR review and also before we had the feature flag. I streamlined the handling now to use the upload id instead of this additional boolean.

pitrou · 2024-06-03T14:35:32Z

cpp/src/arrow/filesystem/s3fs.cc

-  static void HandleUploadOutcome(const std::shared_ptr<UploadState>& state,
-                                  int part_number, const S3Model::UploadPartRequest& req,
-                                  const Result<S3Model::UploadPartOutcome>& result) {
+  static Status UploadError(const Aws::S3::Model::PutObjectRequest& request,


Perhaps call this UploadUsingSingleRequestError?

pitrou · 2024-06-03T14:37:31Z

cpp/src/arrow/filesystem/s3fs.cc

-    if (--state->parts_in_progress == 0) {
-      state->pending_parts_completed.MarkFinished(state->status);
+    if (--state->uploads_in_progress == 0) {
+      state->pending_uploads_completed.MarkFinished(state->status);


pitrou · 2024-06-03T14:38:48Z

cpp/src/arrow/filesystem/s3fs.cc


  Aws::String upload_id_;
  bool closed_ = true;
+  bool is_multipart_created_ = false;


For the record, is_multipart_created_ == true iff !upload_id_.empty(), right? Perhaps this can be consolidated and we can rename upload_id_ to something more explicit, such as multipart_upload_id.

pitrou · 2024-06-03T14:39:57Z

cpp/src/arrow/filesystem/s3fs_test.cc

+  void apply_to_s3_options(S3Options& options) const {
+    options.background_writes = background_writes;
+    options.allow_delayed_open = allow_delayed_open;
+  }


Coding style nit: 1) use CamelCase for function names, 2) avoid mutable refs; therefore:

Suggested change

void apply_to_s3_options(S3Options& options) const {

options.background_writes = background_writes;

options.allow_delayed_open = allow_delayed_open;

}

void ApplyToS3Options(S3Options* options) const {

options->background_writes = background_writes;

options->allow_delayed_open = allow_delayed_open;

}

pitrou · 2024-06-03T14:43:56Z

cpp/src/arrow/filesystem/s3fs_test.cc

+TEST_F(TestS3FS, OpenOutputStream) {
+  for (const auto& combination : S3OptionsTestParameters::GetCartesianProduct()) {
+    combination.apply_to_s3_options(options_);
+    MakeFileSystem();


I don't think this is deleting the files currently on the filesystem, which means the tests might succeed even if the write path doesn't work for some options.

Good catch, I implemented cleanup (i.e. emptying the test bucket and restoring the test files again) after each test run.

pitrou · 2024-06-03T14:44:33Z

cpp/src/arrow/filesystem/s3fs_test.cc

-  TestOpenOutputStreamDestructor();
+TEST_F(TestS3FS, OpenOutputStream) {
+  for (const auto& combination : S3OptionsTestParameters::GetCartesianProduct()) {
+    combination.apply_to_s3_options(options_);


Ideally we would also leave a trace in the test log to make diagnosing failures easier (see the ARROW_SCOPED_TRACE macro somewhere).

OliLay · 2024-06-05T06:43:15Z

I also merged main into this branch due to a conflict. Should be free of conflicts now.

github-actions bot added Component: C++ awaiting review Awaiting review labels May 7, 2024

OliLay marked this pull request as ready for review May 8, 2024 07:40

OliLay mentioned this pull request May 13, 2024

[C++] S3Filesystem always initiates multipart uploads, regardless of input size #40557

Open

implement logic for PutObject optimization in S3 output stream

355b13c

OliLay force-pushed the s3-put branch from c7b0c16 to 355b13c Compare May 14, 2024 07:11

mapleFU requested review from pitrou and felipecrv May 14, 2024 12:16

mapleFU reviewed May 14, 2024

View reviewed changes

cpp/src/arrow/filesystem/s3fs.cc Outdated Show resolved Hide resolved

cpp/src/arrow/filesystem/s3fs_test.cc Show resolved Hide resolved

github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 14, 2024

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels May 14, 2024

flag in S3Options for sanitization & latency optimization

9987662

adjust threshold

78efe6e

kou reviewed May 15, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 15, 2024

refactor metadata if

92aa434

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 16, 2024

pitrou reviewed May 16, 2024

View reviewed changes

PR feedback apache#2

7748824

OliLay requested review from kou, pitrou and mapleFU May 17, 2024 07:39

do not use designated initializers (requires C++20)

2e00f24

mapleFU approved these changes May 24, 2024

View reviewed changes

pitrou requested changes Jun 3, 2024

View reviewed changes

OliLay added 2 commits June 4, 2024 10:14

PR feedback

b9b3af4

remove leftover comment

050a317

OliLay requested a review from pitrou June 4, 2024 11:50

OliLay added 2 commits June 5, 2024 08:39

fix linting

35da0a2

Merge branch 'main' into s3-put

711d20a

	static constexpr std::string_view CONTENT_TYPE_KEY = "Content-Type";
	static constexpr std::string_view kContentTypeKey = "Content-Type";

GH-40557: [C++] Use PutObject request for S3 in OutputStream when only uploading small data #41564

Are you sure you want to change the base?

GH-40557: [C++] Use PutObject request for S3 in OutputStream when only uploading small data #41564

Conversation

OliLay commented May 7, 2024 • edited

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented May 7, 2024

OliLay commented May 13, 2024

orf commented May 13, 2024

OliLay commented May 14, 2024

pitrou commented May 14, 2024

orf commented May 14, 2024

pitrou commented May 14, 2024

OliLay commented May 14, 2024 • edited

pitrou commented May 14, 2024

OliLay commented May 14, 2024

pitrou commented May 14, 2024

orf commented May 14, 2024 • edited

OliLay commented May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mapleFU commented May 24, 2024

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OliLay commented Jun 5, 2024

GH-40557: [C++] Use `PutObject` request for S3 in OutputStream when only uploading small data #41564

GH-40557: [C++] Use `PutObject` request for S3 in OutputStream when only uploading small data #41564

OliLay commented May 7, 2024 •

edited

OliLay commented May 14, 2024 •

edited

orf commented May 14, 2024 •

edited

OliLay commented May 15, 2024 •

edited