Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid throttling: S3FileSystem::CreateDir should check the existence first rather than creating the parent path all the time #41493

Open
HaochengLIU opened this issue May 2, 2024 · 0 comments

Comments

@HaochengLIU
Copy link
Contributor

HaochengLIU commented May 2, 2024

Describe the enhancement requested

Hi, I have a use case that thousands of jobs are writing hive partitioned parquet files daily to the same bucket via S3FS filesystem. Note each job may generate from single digit to a few thousand parquet files depending on the volume from its data source.
After abstraction, these jobs follow regex path patterns like s3fs://my-S3-bucket/<vendor-name>/<fruit-type>/<color>/<origination>/<creation_date>/<data-center-location>/date=YYYY-MM-DD/.... The gist here is a lot of keys are being created at the same time hense jobs hits AWS Error SLOW_DOWN. during Put Object operation: The object exceeded the rate limit for object mutation operations(create, update, and delete). Please reduce your rate request error. frequently throughout the day.

After investigation i realize they are creating too many objects in S3FileSystem::CreateDir(..) function one by one. My local experiments show that if my implementation checks the existence of the path first then call impl_->CreateEmptyDir(...) only when necessary, it addresses the issue in my production environment.

(I understand various cloud vendors have various IO limits on a single bucket, in order to completely fix the the issue is another story to my daily work)

I'm proposing a code change like below. Hi @pitrou I see you are the main author of s3fs.cc, can you pls share your insights when you have time?
Also even with a vanilla build S3FS test fails quite a few on my mac... can you guide how to make them run successfully..?

Many thanks.

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index 640888e1c..782d5f75d 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -2871,7 +2871,10 @@ Status S3FileSystem::CreateDir(const std::string& s, bool recursive) {
     for (const auto& part : path.key_parts) {
       parent_key += part;
       parent_key += kSep;
-      RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
+      ARROW_ASSIGN_OR_RAISE(FileInfo parent_key_info, this->GetFileInfo(parent_key));
+      if (parent_key_info.type() == FileType::NotFound) {
+        RETURN_NOT_OK(impl_->CreateEmptyDir(path.bucket, parent_key));
+      }
     }
     return Status::OK();
   } else {
TestS3FS.CreateDir even fails with a Clean build :sigh

➜  build ninja && ./debug/arrow-s3fs-test --gtest_filter="TestS3FS.CreateDir"
ninja: no work to do.
Running main() from /Users/haochengliu/Documents/projects/Arrow/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = TestS3FS.CreateDir
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TestS3FS
[ RUN      ] TestS3FS.CreateDir
/Users/haochengliu/Documents/projects/Arrow/arrow/cpp/src/arrow/filesystem/s3fs_test.cc:934: Failure
Failed
Expected 'fs_->CreateDir("bucket/somefile")' to fail with IOError, but got OK
[  FAILED  ] TestS3FS.CreateDir (219 ms)
[----------] 1 test from TestS3FS (219 ms total)

Component(s)

C++

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant