Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement parallel operations #2067

Merged

Conversation

ddelgrosso1
Copy link
Contributor

@ddelgrosso1 ddelgrosso1 commented Sep 15, 2022

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1868 🦕
Fixes #2051

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: storage Issues related to the googleapis/nodejs-storage API. labels Sep 15, 2022
@ddelgrosso1
Copy link
Contributor Author

Opening as a draft PR while I add in JSDoc comments so that I can start gathering feedback / initial thoughts.

@ddelgrosso1 ddelgrosso1 changed the base branch from main to transfer-manager September 16, 2022 14:46
@ddelgrosso1
Copy link
Contributor Author

ddelgrosso1 commented Sep 16, 2022

Repointing this at a new branch named transfer-manager. My plan is to release this as a "preview" from that branch. Since this is a new feature, keeping it in sync with main should be relatively headache free until such a point that we deem the feature stable.

Copy link

@danielduhh danielduhh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple passthrough comments/questions :)

src/transfer-manager.ts Show resolved Hide resolved
src/transfer-manager.ts Outdated Show resolved Hide resolved
src/transfer-manager.ts Show resolved Hide resolved
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Sep 30, 2022
@ddelgrosso1 ddelgrosso1 changed the base branch from transfer-manager to main September 30, 2022 17:50
@ddelgrosso1 ddelgrosso1 changed the base branch from main to transfer-manager September 30, 2022 17:51
src/transfer-manager.ts Show resolved Hide resolved
src/transfer-manager.ts Outdated Show resolved Hide resolved
src/transfer-manager.ts Show resolved Hide resolved
src/transfer-manager.ts Outdated Show resolved Hide resolved
test/transfer-manager.ts Show resolved Hide resolved
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Oct 24, 2022
@ddelgrosso1 ddelgrosso1 added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Oct 24, 2022
@ddelgrosso1 ddelgrosso1 marked this pull request as ready for review October 24, 2022 15:26
@ddelgrosso1 ddelgrosso1 requested review from a team as code owners October 24, 2022 15:27
@ddelgrosso1 ddelgrosso1 changed the base branch from transfer-manager to main October 24, 2022 15:27
@ddelgrosso1 ddelgrosso1 changed the base branch from main to transfer-manager October 24, 2022 15:27
@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Nov 29, 2022
@ddelgrosso1 ddelgrosso1 removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Nov 29, 2022
Copy link
Contributor

@shaffeeullah shaffeeullah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments, but generally looks good. Thanks, Denis!!

src/transfer-manager.ts Show resolved Hide resolved
src/transfer-manager.ts Outdated Show resolved Hide resolved
* ```
* @experimental
*/
async downloadManyFiles(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do other lanauges have this as a list of objects? or is a list of strings also an option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python does it as a list of names. If desired we could move to that as well. Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two use cases are

  1. download "directory"
  2. download list of files

Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modified the signature to now accept an array of files, an array of strings, or a single string. If a single string is provided it will be treated as a GCS prefix to call getFiles. Utilizing this prefix will allow for an entire folder structure to be downloaded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably the right idea. It sounds like an idiomatic variation on what I wrote before I saw @ddelgrosso1's reply.

A directory is probably the most common set of files people upload. But of course you'd want to think about this as a list of files to be extensible. To me, the logical progression is something like "I want to upload a directory" -> "I want to form a glob of file names" -> "I want to upload a list of files". So solving the last problem lets you frame the others as variations on it. For example, make a convenience method that accepts a directory name, and just constructs the list of files to pipe into the actual implementation.

assert.strictEqual(count, paths.length);
});

it('sets ifGenerationMatch to 0 if skipIfExists is set', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a test for skipIfExists set to true, the file exists, and it errors? We should verify that it doesnt stop the execution of the code and just moves onto the next file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to best handle this as during a unit test since we aren't calling the actual service.

src/transfer-manager.ts Outdated Show resolved Hide resolved
* ```
* @experimental
*/
async downloadManyFiles(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two use cases are

  1. download "directory"
  2. download list of files

Not sure which is more common (question for @domZippilli) but it would be great to support both (or show users how to do both)

@snippet-bot
Copy link

snippet-bot bot commented Dec 1, 2022

Here is the summary of changes.

You are about to add 3 region tags.

This comment is generated by snippet-bot.
If you find problems with this result, please file an issue at:
https://github.com/googleapis/repo-automation-bots/issues.
To update this comment, add snippet-bot:force-run label or use the checkbox below:

  • Refresh this comment

Copy link
Member

@danielbankhead danielbankhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few thoughts/comments. Gonna play with it some more tomorrow.

* limitations under the License.
*/

// eslint-disable-next-line node/no-unsupported-features/node-builtins
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file requires worker_theads and not all features were available in the initial release of v12. If I remove this the CI pipeline starts to complain if I remember correctly.

* @experimental
*/
async uploadManyFiles(
filePaths: string[],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] We have an opportunity to get fancy here to really emphasize throughput; what if we allowed iterators and async iterators?

Use cases:

  • walking large directory you want to walk asynchronously (reduces memory usage)
  • situations where one may parse a list of files in chunks and want to chain operations together (such as an HTTP server receiving incoming data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably wait until customers ask for this functionality though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good idea, maybe after the initial release we can circle up. I would definitely be interested in your thoughts on optimizations here.

Comment on lines +293 to +296
async downloadFileInChunks(
fileOrName: File | string,
options: DownloadFileInChunksOptions = {}
): Promise<void | DownloadResponse> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on data validation with this method? Since file.download({start, end}) skips validation, we may want to add support or mention it in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question... I will definitely add a callout in the comments that this will skip validation. Do we currently have a means of validating a file after it has been downloaded? I know you did a bunch of work in this area but I thought it only worked with streamed input?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's technically possible to CRC32C chunks and then "sum" them into the final checksum, such that you could do this kind of chunked operation and still checksum without an additional byte scan. If you need MD5 or can't do the previous algorithm, I think validation would require a full read of the completed file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently have a means of validating a file after it has been downloaded?

Yep (although we may be able to do this as we're receiving each chunk):

static async fromFile(file: PathLike) {

I know you did a bunch of work in this area but I thought it only worked with streamed input?

It's possible to use CRC32C with/out streams (HashStreamValidator is the class that wraps CRC32C for streams):

class CRC32C implements CRC32CValidator {

class HashStreamValidator extends Transform {
readonly crc32cEnabled: boolean;
readonly md5Enabled: boolean;
#crc32cHash?: CRC32CValidator = undefined;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we comfortable with only allowing CRC32C. In my mind something is better than nothing... I can take a look at implementing it if we feel strongly this should be included.

Comment on lines +318 to +332
const fileToWrite = await fsp.open(filePath, 'w+');
while (start < size) {
const chunkStart = start;
let chunkEnd = start + chunkSize - 1;
chunkEnd = chunkEnd > size ? size : chunkEnd;
promises.push(
limit(() =>
file.download({start: chunkStart, end: chunkEnd}).then(resp => {
return fileToWrite.write(resp[0], 0, resp[0].length, chunkStart);
})
)
);

start += chunkSize;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's stopping the chunks from being written out of order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I did this correctly the position in the file of where to write this chunk is handled by the chunkStart parameter. Basically this call should write the incoming chunk with no offset in the chunk buffer to the file offset specified by chunkStart. Was your understanding different?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slightly different understanding here (I could be wrong), but would it be possible for subsequent chunks to download before proceeding chunks and write to the file first?

Say we have a 65MB file and a 64MB chunk size (making 2 chunks to download). Since the 2nd chunk is 1MB and would download first, wouldn't it write to the file before the first chunk?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter if they are written in order. This implementation is leaning on a filesystem supporting sparse files. So, imagine you wrote these back-to-front; the first chunk will seek to its chunkStart near the end of the file, and the OS will logically pad that first byte with 0s up to the seek point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^that's the missing piece for me, thanks Dom!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Dom outlined this was the behavior I was intending. It shouldn't matter which chunk downloads first they should get written to their correct place in the resultant file. Ultimately if all succeed, the file should be correctly "reassembled".

@ddelgrosso1
Copy link
Contributor Author

Since this is getting large and most feedback has been addressed going to go ahead and merge this into the staging branch. We can re-review before the final merge to main.

@ddelgrosso1 ddelgrosso1 merged commit 3c11ea2 into googleapis:transfer-manager Dec 5, 2022
ddelgrosso1 added a commit to ddelgrosso1/nodejs-storage that referenced this pull request Dec 5, 2022
* feat: implement parallel operations

* add more parallel operations

* add header to test file

* update import of fs/promises

* fix pathing on windows, fix mocking of fs promises

* add jsdoc headers to class and uploadMulti

* add jsdoc comments to remaining functions

* update comment wording

* add experimental jsdoc tags

* feat: add directory generator to performance test framework

* clarify variable names and comments

* capitalization

* wip: transfer manager performance tests

* feat: merged in application performance tests (googleapis#2100)

* fix: fixed many bugs (googleapis#2102)

* fix: cleaning up bugs

* fix: fixed many bugs

* fix: more work on transfer manager perf metrics (googleapis#2103)

* fix: more work on transfer manager perf metrics

* fix unit tests for tm

* fix: performance test refactoring, comments (googleapis#2104)

* refactor: refactor constants (googleapis#2105)

* refactor: refactor constants

* bug fixes

* bug fixes

* bug fixes

* linter fixes, download to disk for performance test

* rename transfer manager functions

* remove callbacks from transfer manager

* add more experimental tags, update comments

* change signature of downloadManyFiles to accept array of strings or a folder name

* linter fix

* add transfer manager samples and samples tests

Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
ddelgrosso1 added a commit to ddelgrosso1/nodejs-storage that referenced this pull request Dec 5, 2022
* feat: implement parallel operations

* add more parallel operations

* add header to test file

* update import of fs/promises

* fix pathing on windows, fix mocking of fs promises

* add jsdoc headers to class and uploadMulti

* add jsdoc comments to remaining functions

* update comment wording

* add experimental jsdoc tags

* feat: add directory generator to performance test framework

* clarify variable names and comments

* capitalization

* wip: transfer manager performance tests

* feat: merged in application performance tests (googleapis#2100)

* fix: fixed many bugs (googleapis#2102)

* fix: cleaning up bugs

* fix: fixed many bugs

* fix: more work on transfer manager perf metrics (googleapis#2103)

* fix: more work on transfer manager perf metrics

* fix unit tests for tm

* fix: performance test refactoring, comments (googleapis#2104)

* refactor: refactor constants (googleapis#2105)

* refactor: refactor constants

* bug fixes

* bug fixes

* bug fixes

* linter fixes, download to disk for performance test

* rename transfer manager functions

* remove callbacks from transfer manager

* add more experimental tags, update comments

* change signature of downloadManyFiles to accept array of strings or a folder name

* linter fix

* add transfer manager samples and samples tests

Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
ddelgrosso1 added a commit that referenced this pull request Dec 5, 2022
* feat: implement parallel operations

* add more parallel operations

* add header to test file

* update import of fs/promises

* fix pathing on windows, fix mocking of fs promises

* add jsdoc headers to class and uploadMulti

* add jsdoc comments to remaining functions

* update comment wording

* add experimental jsdoc tags

* feat: add directory generator to performance test framework

* clarify variable names and comments

* capitalization

* wip: transfer manager performance tests

* feat: merged in application performance tests (#2100)

* fix: fixed many bugs (#2102)

* fix: cleaning up bugs

* fix: fixed many bugs

* fix: more work on transfer manager perf metrics (#2103)

* fix: more work on transfer manager perf metrics

* fix unit tests for tm

* fix: performance test refactoring, comments (#2104)

* refactor: refactor constants (#2105)

* refactor: refactor constants

* bug fixes

* bug fixes

* bug fixes

* linter fixes, download to disk for performance test

* rename transfer manager functions

* remove callbacks from transfer manager

* add more experimental tags, update comments

* change signature of downloadManyFiles to accept array of strings or a folder name

* linter fix

* add transfer manager samples and samples tests

Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
ddelgrosso1 added a commit that referenced this pull request Dec 7, 2022
* feat: implement parallel operations (#2067)

* feat: implement parallel operations

* add more parallel operations

* add header to test file

* update import of fs/promises

* fix pathing on windows, fix mocking of fs promises

* add jsdoc headers to class and uploadMulti

* add jsdoc comments to remaining functions

* update comment wording

* add experimental jsdoc tags

* feat: add directory generator to performance test framework

* clarify variable names and comments

* capitalization

* wip: transfer manager performance tests

* feat: merged in application performance tests (#2100)

* fix: fixed many bugs (#2102)

* fix: cleaning up bugs

* fix: fixed many bugs

* fix: more work on transfer manager perf metrics (#2103)

* fix: more work on transfer manager perf metrics

* fix unit tests for tm

* fix: performance test refactoring, comments (#2104)

* refactor: refactor constants (#2105)

* refactor: refactor constants

* bug fixes

* bug fixes

* bug fixes

* linter fixes, download to disk for performance test

* rename transfer manager functions

* remove callbacks from transfer manager

* add more experimental tags, update comments

* change signature of downloadManyFiles to accept array of strings or a folder name

* linter fix

* add transfer manager samples and samples tests

Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>

* 🦉 Updates from OwlBot post-processor

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

* feat: add crc32c validation option to chunked download, cleanup naming (#2110)

* feat: add crc32c validation option to chunked download, cleanup naming in perf tests

* close file in finally block

Co-authored-by: Sameena Shaffeeullah <shaffeeullah@google.com>
Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/nodejs-storage API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel Operations
7 participants