HADOOP-18679. Add API for bulk/paged delete of files #6726

mukund-thakur · 2024-04-12T22:02:17Z

Adding tests on top of #6494

Description of PR

How was this patch tested?

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

A more minimal design that is easier to use and implement. Caller creates a BulkOperation; they get the page size of it and then submit batches to delete of less than that size. The outcome of each call contains a list of failures. S3A implementation to show how straightforward it is. Even with the single entry page size, it is still more efficient to use this as it doesn't try to recreate a parent dir or perform any probes to see if it is a directory: it maps straight to a DELETE call. Change-Id: Ibe8737e7933fe03d39070e7138a710b50c3d60c2

Add methods in FileUtil to take an FS, cast to a BulkDeleteSource then perform the pageSize/bulkDelete operations. This is to make reflection based access straightforward: no new interfaces or types to work with, just two new methods with type-erased lists. Change-Id: I2d7b1bf8198422de635253fc087ca551a8bc6609

Change-Id: Ib098c07cc1f7747ed1a3131b252656c96c520a75

Using this PR to start with the initial design, implementation and services offered by having lower-level interaction with S3 pushed down into an S3AStore class, with interface/impl split. The bulk delete callbacks now to talk to the store, *not* s3afs, with some minor changes in behaviour (IllegalArgumentException is raised if root paths / are to be deleted) Mock tests are failing; I expected that: they are always brittle. What next? get this in and then move lower level fs ops over a method calling s3client at a time, or in groups, as appropriate. The metric of success are: * all callback classes created in S3A FS can work through the store * no s3client direct invocation in S3AFS Change-Id: Ib5bc58991533fd5d13e78426e88b884b6ae5205c

Changing results of method calls, using Tuples.pair() to return Map.Entry() instances as immutable tuples. Change-Id: Ibdd5a5b11fe0a57b293b9cb623272e272c8bab69

and some minor prod changes.

...-azure/src/test/java/org/apache/hadoop/fs/azurebfs/contract/ITestAbfsContractBulkDelete.java

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestAssumeRole.java

hadoop-common-project/hadoop-common/pom.xml

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/S3AStoreImpl.java

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md

...adoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractBulkDeleteTest.java

...s/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractBulkDelete.java

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestAssumeRole.java

steveloughran · 2024-04-16T13:47:17Z

commented. I've also done a PR #6738 which tunes the API to work with iceberg, having just written a PoC of the iceberg binding.

My PR

moved the wrapper methods to a new wrappedio.WrappedIO class
add a probe for the api being available
I also added an availability probe in the interface. not sure about that as we really should make it available everywhere, always.

Can you cherrypick this PR onto your branch and then do the review comments.

After which, please do not do any rebasing of your PR. That way, it is easier for me too keep my own branch in sync with your changes. Thanks.

PoC of iceberg integration, based on their S3FileIO one.

https://github.com/steveloughran/iceberg/blob/s3/HADOOP-18679-bulk-delete-api/core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java#L208

The iceberg api passes in a collection of paths, which may span multiple filesystems.

To handle this,

the bulk delete API should take a Collection, not a list
it needs to be implemented in every FS, because trying to distinguish case-by-case on support would be really complex.

We are going to need a default FS impl which just invokes delete(path, false) and maps any IOE to a failure. Change-Id: If56bca7cb8529ccbfbb1dfa29cedc8287ec980d4

steveloughran

commented

steveloughran · 2024-04-19T11:21:58Z

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

+ */
+public class DefaultBulkDeleteOperation implements BulkDelete {
+
+    private final int pageSize;


this is always 1, isn't it? so much can be simplified here

no need for the field

no need to pass it in the constructor

pageSize() to return 1

steveloughran · 2024-04-19T11:23:02Z

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

+        validateBulkDeletePaths(paths, pageSize, basePath);
+        List<Map.Entry<Path, String>> result = new ArrayList<>();
+        // this for loop doesn't make sense as pageSize must be 1.
+        for (Path path : paths) {


there's only even going to be 1 entry here

steveloughran · 2024-04-19T11:24:34Z

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

+            try {
+                fs.delete(path, false);
+                // What to do if this return false?
+                // I think we should add the path to the result list with value "Not Deleted".


good q. I'd say yes. or actually do a getFileStatus() cal and see what is there for

file doesn't exist (not an error)

path is a directory: add to result

key is that file not found isn't escalated

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md

...s/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractBulkDelete.java

steveloughran · 2024-04-19T11:58:08Z

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestAssumeRole.java

+    bindReadOnlyRolePolicy(assumedRoleConfig, readOnlyDir);
+    roleFS = (S3AFileSystem) destDir.getFileSystem(assumedRoleConfig);
+
+    int range = 10;


on a store where bulk delete page size == 1, use that as the range

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestAssumeRole.java

...adoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractBulkDeleteTest.java

hadoop-yetus · 2024-04-24T03:42:14Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
			_ Prechecks _
+1 💚	dupname	0m 05s		No case conflicting files found.
+0 🆗	codespell	0m 01s		codespell was not available.
+0 🆗	detsecrets	0m 01s		detect-secrets was not available.
+0 🆗	xmllint	0m 01s		xmllint was not available.
+0 🆗	spotbugs	0m 01s		spotbugs executables are not available.
+0 🆗	markdownlint	0m 01s		markdownlint was not available.
+1 💚	@author	0m 00s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 00s		The patch appears to include 6 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	2m 31s		Maven dependency ordering for branch
+1 💚	mvninstall	91m 09s		trunk passed
+1 💚	compile	40m 32s		trunk passed
+1 💚	checkstyle	6m 09s		trunk passed
-1 ❌	mvnsite	4m 42s	/branch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in trunk failed.
+1 💚	javadoc	14m 12s		trunk passed
+1 💚	shadedclient	171m 45s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 24s		Maven dependency ordering for patch
+1 💚	mvninstall	11m 00s		the patch passed
+1 💚	compile	38m 33s		the patch passed
+1 💚	javac	38m 33s		the patch passed
-1 ❌	blanks	0m 00s	/blanks-eol.txt	The patch has 5 line(s) that end in blanks. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚	checkstyle	6m 31s		the patch passed
-1 ❌	mvnsite	4m 36s	/patch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in the patch failed.
+1 💚	javadoc	14m 19s		the patch passed
+1 💚	shadedclient	185m 02s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	asflicense	5m 46s	/results-asflicense.txt	The patch generated 1 ASF License warnings.
		555m 55s

Subsystem	Report/Notes
GITHUB PR	#6726
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle markdownlint
uname	MINGW64_NT-10.0-17763 cfb6e8c364ad 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool	maven
Personality	/c/hadoop/dev-support/bin/hadoop.sh
git revision	trunk / `7415427`
Default Java	Azul Systems, Inc.-1.8.0_332-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/1/testReport/
modules	C: hadoop-common-project/hadoop-common hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/1/console
versions	git=2.44.0.windows.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Add bulk delete path capability true for all FS

steveloughran

I'm happy with this, just want to make sure we are happy with the fallback logic of delete() false. it's such a vague result anyway.

we should get reviews from people like iceberg devs

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md

steveloughran · 2024-04-26T16:56:45Z

...s/hadoop-aws/src/test/java/org/apache/hadoop/fs/contract/s3a/ITestS3AContractBulkDelete.java

@@ -85,6 +88,9 @@ public ITestS3AContractBulkDelete(boolean enableMultiObjectDelete) {
    protected Configuration createConfiguration() {
        Configuration conf = super.createConfiguration();
        S3ATestUtils.disableFilesystemCaching(conf);
+        conf = propagateBucketOptions(conf, getTestBucketName(conf));
+        skipIfNotEnabled(conf, Constants.ENABLE_MULTI_DELETE,


this should actually be guarded with 'if (enableMultiObjectDelete)' so at least the single entry delete ops are tested with GCS. This does matter as GCS returns 404 on a DELETE nonexistent-path; we need to make sure that doesn't trigger a failure

nice catch. tested with gcs bucket as well.

hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/auth/ITestAssumeRole.java

steveloughran · 2024-04-26T17:28:49Z

for the iceberg support to work, all filesystems MUST implement the api or we have to modify that PoC to handle the case where they don't. I'd rather the spec says any FS which supports delete() MUST support this since it comes for free.

steveloughran · 2024-04-26T17:29:29Z

iceberg poc pr apache/iceberg#10233

hadoop-yetus · 2024-04-29T19:15:40Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
			_ Prechecks _
+1 💚	dupname	0m 06s		No case conflicting files found.
+0 🆗	codespell	0m 01s		codespell was not available.
+0 🆗	detsecrets	0m 01s		detect-secrets was not available.
+0 🆗	xmllint	0m 01s		xmllint was not available.
+0 🆗	spotbugs	0m 01s		spotbugs executables are not available.
+0 🆗	markdownlint	0m 01s		markdownlint was not available.
+1 💚	@author	0m 00s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 00s		The patch appears to include 10 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	2m 19s		Maven dependency ordering for branch
+1 💚	mvninstall	91m 56s		trunk passed
+1 💚	compile	40m 19s		trunk passed
+1 💚	checkstyle	6m 14s		trunk passed
-1 ❌	mvnsite	4m 28s	/branch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in trunk failed.
+1 💚	javadoc	19m 58s		trunk passed
+1 💚	shadedclient	195m 53s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 25s		Maven dependency ordering for patch
+1 💚	mvninstall	16m 00s		the patch passed
+1 💚	compile	37m 43s		the patch passed
+1 💚	javac	37m 43s		the patch passed
+1 💚	blanks	0m 00s		The patch has no blanks issues.
+1 💚	checkstyle	5m 59s		the patch passed
-1 ❌	mvnsite	4m 36s	/patch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in the patch failed.
+1 💚	javadoc	19m 48s		the patch passed
+1 💚	shadedclient	199m 09s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	5m 37s		The patch does not generate ASF License warnings.
		597m 32s

Subsystem	Report/Notes
GITHUB PR	#6726
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle markdownlint
uname	MINGW64_NT-10.0-17763 e57383186604 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool	maven
Personality	/c/hadoop/dev-support/bin/hadoop.sh
git revision	trunk / `0339eeb`
Default Java	Azul Systems, Inc.-1.8.0_332-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/4/testReport/
modules	C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/4/console
versions	git=2.44.0.windows.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran

commented.

I'd prefer that BulkDeleteSource to be in FileSystem, with the base implementatoon returning a DefaultBulkDeleteOperation instance.
and so we should have one of the contract tests doing it directly through the API on the getFileSystem() instance
other than that though, looking really good.

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/bulkdelete.md

...adoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractBulkDeleteTest.java

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/performance.md

...common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefalutBulkDeleteSource.java

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BulkDeleteUtils.java

...mon-project/hadoop-common/src/main/java/org/apache/hadoop/fs/DefaultBulkDeleteOperation.java

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BulkDeleteSource.java

steveloughran

Commented. Now, one more thing based on my changes in #6686.

it's adding many more methods to this class, so I'm giving them the name of the class/interface + "_" + method name.

can you do the same here? some style checker will complain but it will help us to separate the methods in the new class.

other than that, all good

steveloughran · 2024-05-02T19:50:43Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java

@@ -4980,4 +4982,17 @@ public MultipartUploaderBuilder createMultipartUploader(Path basePath)
    methodNotSupported();
    return null;
  }
+
+  /**
+   * Create a default bulk delete operation to be used for any FileSystem.


This doesn't hold for the subclasses. better to say

Create a bulk delete operation. The default implementation returns an instance of {@link DefaultBulkDeleteOperation}

mukund-thakur · 2024-05-06T22:41:06Z

can you do the same here? some style checker will complain but it will help us to separate the methods in the new class.

I don't understand what to do here.

steveloughran

@mukund-thakur tried to clarify what i mean: we have the class/interface split from the method/operation by a _ character

it makes a lot more sense when you look at my PR, which brings a lot more methods into the same class. Why all the same class? less reflection code

steveloughran · 2024-05-07T17:24:54Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/wrappedio/WrappedIO.java

+   * @throws IllegalArgumentException path not valid.
+   * @throws IOException problems resolving paths
+   */
+  public static int bulkDeletePageSize(FileSystem fs, Path path) throws IOException {


rename bulkDelete_pageSize

steveloughran · 2024-05-07T17:25:21Z

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/wrappedio/WrappedIO.java

+   * @throws IOException IO problems including networking, authentication and more.
+   * @throws IllegalArgumentException if a path argument is invalid.
+   */
+  public static List<Map.Entry<Path, String>> bulkDelete(FileSystem fs,


rename bulkDelete_delete

hadoop-yetus · 2024-05-09T11:25:44Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
			_ Prechecks _
+1 💚	dupname	0m 07s		No case conflicting files found.
+0 🆗	codespell	0m 01s		codespell was not available.
+0 🆗	detsecrets	0m 01s		detect-secrets was not available.
+0 🆗	xmllint	0m 01s		xmllint was not available.
+0 🆗	spotbugs	0m 01s		spotbugs executables are not available.
+0 🆗	markdownlint	0m 01s		markdownlint was not available.
+1 💚	@author	0m 01s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 00s		The patch appears to include 11 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	2m 42s		Maven dependency ordering for branch
+1 💚	mvninstall	107m 34s		trunk passed
+1 💚	compile	48m 30s		trunk passed
+1 💚	checkstyle	7m 10s		trunk passed
-1 ❌	mvnsite	5m 19s	/branch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in trunk failed.
+1 💚	javadoc	23m 51s		trunk passed
+1 💚	shadedclient	225m 30s		branch has no errors when building and testing our client artifacts.
-0 ⚠️	patch	228m 42s		Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 57s		Maven dependency ordering for patch
+1 💚	mvninstall	19m 17s		the patch passed
+1 💚	compile	45m 43s		the patch passed
+1 💚	javac	45m 43s		the patch passed
+1 💚	blanks	0m 01s		The patch has no blanks issues.
+1 💚	checkstyle	7m 25s		the patch passed
-1 ❌	mvnsite	5m 28s	/patch-mvnsite-hadoop-common-project_hadoop-common.txt	hadoop-common in the patch failed.
+1 💚	javadoc	24m 00s		the patch passed
+1 💚	shadedclient	232m 12s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	asflicense	7m 14s		The patch does not generate ASF License warnings.
		700m 51s

Subsystem	Report/Notes
GITHUB PR	#6726
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient codespell detsecrets xmllint spotbugs checkstyle markdownlint
uname	MINGW64_NT-10.0-17763 296d6abd6fb2 3.4.10-87d57229.x86_64 2024-02-14 20:17 UTC x86_64 Msys
Build tool	maven
Personality	/c/hadoop/dev-support/bin/hadoop.sh
git revision	trunk / `e37d88f`
Default Java	Azul Systems, Inc.-1.8.0_332-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/5/testReport/
modules	C: hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-tools/hadoop-aws hadoop-tools/hadoop-azure U: .
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch-windows-10/job/PR-6726/5/console
versions	git=2.44.0.windows.1
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

steveloughran · 2024-05-13T19:50:17Z

mukund, if you can do those naming changes then I'm +1

steveloughran · 2024-05-17T14:13:04Z

+1

I was about to merge then I realised that yetus wasn't ready. Here is my draft commit message

as soon as this is in I will rebase HADOOP-19131. Assist reflection IO with WrappedOperations class #6686 onto it, which extends WrappedIO and adds the reflection utility classes from Parquet to assist in testing.
I'll leave you to do the cherrypick and merge onto 3.4.x
And I want to get a minimal version into 3.3.x, maybe with a page size of 1 even on S3A, but without the safety checks, so still saves on LIST/HEAD calls.

create a BulkDelete implementation from a
BulkDeleteSource; the BulkDelete interface provides
the pageSize(): the maximum number of entries which can be
deleted, and a bulkDelete(Collection paths)
method which can take a collection up to pageSize() long.

This is optimized for object stores with bulk delete APIs;
the S3A connector will offer the page size of
fs.s3a.bulk.delete.page.size unless bulk delete has
been disabled.

Even with a page size of 1, the S3A implementation is
more efficient than delete(path)
as there are no safety checks for the path being a directory
or probes for the need to recreate directories.

The interface BulkDeleteSource is implemented by
all FileSystem implementations, with a page size
of 1 and mapped to delete(pathToDelete, false).
This means that callers do not need to have special
case handling for object stores versus classic filesystems.

To aid use through reflection APIs, the class
org.apache.hadoop.io.wrappedio.WrappedIO
has been created with "reflection friendly" methods.

Contributed by Mukund Thakur and Steve Loughran

steveloughran · 2024-05-20T16:06:10Z

merged to trunk. mukund, can you do the backport to branch-3.4.1, while we can think about what to do for 3.3.9? speaking of which, we should think about that...

steveloughran · 2024-05-20T20:27:29Z

fyi #6686 adds dynamic load for the wrapped methods, and calls them through the tests.

mukund-thakur · 2024-05-21T23:01:15Z

While backporting this to branch-3.4 I see this failure. Will check if this is happening on trunk as well.
[INFO] [ERROR] Failures: [ERROR] TestStagingCommitter.testJobCommitFailure:662 [Committed objects compared to deleted paths org.apache.hadoop.fs.s3a.commit.staging.StagingTestBase$ClientResults@2de1acf4{ requests=12, uploads=12, parts=12, tagsByUpload=12, commits=5, aborts=7, deletes=0}] Expecting: <["s3a://bucket-name/output/path/r_0_0_c055250c-58c7-47ea-8b14-215cb5462e89", "s3a://bucket-name/output/path/r_1_1_9111aa65-96c2-465c-b278-696aff7707e3", "s3a://bucket-name/output/path/r_0_0_dec7f398-ee4e-4a53-a783-6b72cead569a", "s3a://bucket-name/output/path/r_1_1_39ad0eba-1053-4217-aa63-ddc8edfa7c64", "s3a://bucket-name/output/path/r_0_0_6c0518f6-7c1b-418f-a3e4-7db568880e6a"]> to contain exactly in any order: <[]> but the following elements were unexpected: <["s3a://bucket-name/output/path/r_0_0_c055250c-58c7-47ea-8b14-215cb5462e89", "s3a://bucket-name/output/path/r_1_1_9111aa65-96c2-465c-b278-696aff7707e3", "s3a://bucket-name/output/path/r_0_0_dec7f398-ee4e-4a53-a783-6b72cead569a", "s3a://bucket-name/output/path/r_1_1_39ad0eba-1053-4217-aa63-ddc8edfa7c64", "s3a://bucket-name/output/path/r_0_0_6c0518f6-7c1b-418f-a3e4-7db568880e6a"]>

steveloughran · 2024-05-22T10:48:38Z

yeah, just seen those too. the mocked s3 client isn't getting the delete one object calls via the store object.

got a small PR up to help debug it, which is not a fix

steveloughran · 2024-05-22T11:21:54Z

problem is store is null in innermost s3afs, triggers an NPE in deleteObject() before the aws client has its delete operation called, so list of deleted paths is not updated.

easiest immediate fix is to mock deleteObject(), longer term we should actually be stubbing the entire store, as that's what the interface/impl split is meant to assist

mukund-thakur · 2024-05-22T21:53:15Z

fix this with mockito magic somehow. https://github.com/apache/hadoop/pull/6843/files
this was not easy to debug.

steveloughran · 2024-05-23T14:07:22Z

I tried to do it yesterday too. I think my solution would have pulled out createStore() into a method and overrode it, or added S3AInternals.setStore() call.

steveloughran · 2024-05-27T15:54:30Z

update, #5081 does successfully show the staging UT failure (good!) but also some in hadoop-common due to the new interface...we need to modify the tests to indicate it is safe to not override the base implementation.

Applications can create a BulkDelete instance from a BulkDeleteSource; the BulkDelete interface provides the pageSize(): the maximum number of entries which can be deleted, and a bulkDelete(Collection paths) method which can take a collection up to pageSize() long. This is optimized for object stores with bulk delete APIs; the S3A connector will offer the page size of fs.s3a.bulk.delete.page.size unless bulk delete has been disabled. Even with a page size of 1, the S3A implementation is more efficient than delete(path) as there are no safety checks for the path being a directory or probes for the need to recreate directories. The interface BulkDeleteSource is implemented by all FileSystem implementations, with a page size of 1 and mapped to delete(pathToDelete, false). This means that callers do not need to have special case handling for object stores versus classic filesystems. To aid use through reflection APIs, the class org.apache.hadoop.io.wrappedio.WrappedIO has been created with "reflection friendly" methods. Contributed by Mukund Thakur and Steve Loughran Conflicts: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

github-actions bot added build trunk Common TOOLS AWS labels Apr 12, 2024

steveloughran and others added 6 commits April 15, 2024 12:36

HADOOP-18679. ?: values the wrong way round

910dd76

Change-Id: Ib098c07cc1f7747ed1a3131b252656c96c520a75

HADOOP-18679. Bulk Delete getting the API simpler

ec5625c

Changing results of method calls, using Tuples.pair() to return Map.Entry() instances as immutable tuples. Change-Id: Ibdd5a5b11fe0a57b293b9cb623272e272c8bab69

Adding some contract tests. This will evolve.

a61f139

and some minor prod changes.

mukund-thakur force-pushed the bulk-delete-mukund-HADOOP-18679 branch from 4bbb3b7 to a61f139 Compare April 15, 2024 18:02

github-actions bot added the ABFS label Apr 15, 2024

mukund-thakur commented Apr 15, 2024

View reviewed changes

steveloughran self-assigned this Apr 15, 2024

steveloughran mentioned this pull request Apr 15, 2024

[Feature Request] More performant transaction log parsing for Azure delta-io/delta#1568

Open

3 tasks

steveloughran changed the title ~~Add API for bulk/paged object deletion~~ HADOOP-18679. Add API for bulk/paged object deletion Apr 15, 2024

steveloughran mentioned this pull request Apr 16, 2024

HADOOP-18679. Add API for bulk/paged object deletion #6738

Closed

4 tasks

steveloughran requested changes Apr 16, 2024

View reviewed changes

HADOOP-18679. My changes based on iceberg PoC

8ef09f7

We are going to need a default FS impl which just invokes delete(path, false) and maps any IOE to a failure. Change-Id: If56bca7cb8529ccbfbb1dfa29cedc8287ec980d4

steveloughran mentioned this pull request Apr 16, 2024

[SHUFFLE] [WIP] Prototype: store shuffle file on external storage like S3 apache/spark#34864

Closed

mukund-thakur added 2 commits April 18, 2024 13:01

Refactoring and review comments and adding more tests

9ad220b

Adding default impl for bulk delete for other file systems

7415427

steveloughran requested changes Apr 19, 2024

View reviewed changes

review comments

27b16f9

mukund-thakur added 2 commits April 24, 2024 13:56

add path capability

86544f2

Add bulk delete path capability true for all FS

Some more tests and docs

88c8623

github-actions bot added the HDFS label Apr 25, 2024

steveloughran requested changes Apr 26, 2024

View reviewed changes

steveloughran mentioned this pull request Apr 26, 2024

[draft] HADOOP-18679. Add API for bulk/paged object deletion: Iceberg PoC apache/iceberg#10233

Draft

checkstyle

0776025

some more fixes

01b2ae8

steveloughran requested changes Apr 30, 2024

View reviewed changes

mukund-thakur added 3 commits April 30, 2024 14:57

review comments

89e87d4

BulkDeleteSource to be implemented by base FileSystem

058d099

Removing default impl from BulkDeleteSource

e37d88f

steveloughran requested changes May 2, 2024

View reviewed changes

steveloughran requested changes May 7, 2024

View reviewed changes

doc fixes and reflection friendly api names

b80a8e6

steveloughran changed the title ~~HADOOP-18679. Add API for bulk/paged object deletion~~ HADOOP-18679. Add API for bulk/paged delete of files May 17, 2024

steveloughran approved these changes May 17, 2024

View reviewed changes

steveloughran merged commit 47be1ab into apache:trunk May 20, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HADOOP-18679. Add API for bulk/paged delete of files #6726

HADOOP-18679. Add API for bulk/paged delete of files #6726

mukund-thakur commented Apr 12, 2024 •

edited by steveloughran

steveloughran commented Apr 16, 2024

steveloughran left a comment

steveloughran Apr 19, 2024 •

edited

steveloughran Apr 19, 2024

steveloughran Apr 19, 2024

steveloughran Apr 19, 2024

hadoop-yetus commented Apr 24, 2024

steveloughran left a comment

steveloughran Apr 26, 2024

mukund-thakur Apr 29, 2024

steveloughran commented Apr 26, 2024

steveloughran commented Apr 26, 2024

hadoop-yetus commented Apr 29, 2024

steveloughran left a comment

steveloughran left a comment

steveloughran May 2, 2024

mukund-thakur commented May 6, 2024

steveloughran left a comment

steveloughran May 7, 2024

steveloughran May 7, 2024

hadoop-yetus commented May 9, 2024

steveloughran commented May 13, 2024

steveloughran commented May 17, 2024

steveloughran commented May 20, 2024

steveloughran commented May 20, 2024

mukund-thakur commented May 21, 2024

steveloughran commented May 22, 2024

steveloughran commented May 22, 2024

mukund-thakur commented May 22, 2024

steveloughran commented May 23, 2024

steveloughran commented May 27, 2024

HADOOP-18679. Add API for bulk/paged delete of files #6726

HADOOP-18679. Add API for bulk/paged delete of files #6726

Conversation

mukund-thakur commented Apr 12, 2024 • edited by steveloughran

Description of PR

How was this patch tested?

For code changes:

steveloughran commented Apr 16, 2024

steveloughran left a comment

Choose a reason for hiding this comment

steveloughran Apr 19, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadoop-yetus commented Apr 24, 2024

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented Apr 26, 2024

steveloughran commented Apr 26, 2024

hadoop-yetus commented Apr 29, 2024

steveloughran left a comment

Choose a reason for hiding this comment

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mukund-thakur commented May 6, 2024

steveloughran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hadoop-yetus commented May 9, 2024

steveloughran commented May 13, 2024

steveloughran commented May 17, 2024

steveloughran commented May 20, 2024

steveloughran commented May 20, 2024

mukund-thakur commented May 21, 2024

steveloughran commented May 22, 2024

steveloughran commented May 22, 2024

mukund-thakur commented May 22, 2024

steveloughran commented May 23, 2024

steveloughran commented May 27, 2024

mukund-thakur commented Apr 12, 2024 •

edited by steveloughran

steveloughran Apr 19, 2024 •

edited