lang/funcs: File hashing functions stream data from disk #28681

apparentlymart · 2021-05-11T21:25:56Z

Previously our file hashing functions were backed by the same "read file into memory" function we use for situations like file and templatefile, meaning that they'd read the entire file into memory first and then calculate the hash from that buffer.

All of the hash implementations we use here can calculate hashes from a sequence of smaller buffer writes though, so there's no actual need for us to create a file-sized temporary buffer here.

This, then, is a small refactoring of our underlying function into two parts, where one is responsible for deciding the actual filename to load opening it, and the other is responsible for buffering the file into memory. Our hashing functions can then use only the first function and skip the second.

This then allows us to use io.Copy to stream from the file into the hashing function in smaller chunks, possibly of a size chosen by the hash function if it happens to implement io.ReaderFrom.

The new implementation is functionally equivalent to the old but should use less temporary memory if the user passes a large file to one of the hashing functions.

This might help with #28678, but we've not root-caused that yet. Either way, this seems like a reasonable optimization to implement.

Previously our file hashing functions were backed by the same "read file into memory" function we use for situations like "file" and "templatefile", meaning that they'd read the entire file into memory first and then calculate the hash from that buffer. All of the hash implementations we use here can calculate hashes from a sequence of smaller buffer writes though, so there's no actual need for us to create a file-sized temporary buffer here. This, then, is a small refactoring of our underlying function into two parts, where one is responsible for deciding the actual filename to load opening it, and the other is responsible for buffering the file into memory. Our hashing functions can then use only the first function and skip the second. This then allows us to use io.Copy to stream from the file into the hashing function in smaller chunks, possibly of a size chosen by the hash function if it happens to implement io.ReaderFrom. The new implementation is functionally equivalent to the old but should use less temporary memory if the user passes a large file to one of the hashing functions.

alisdair

Neat fix. I didn't realize that this is how io.Copy works, that's fascinating.

apparentlymart · 2021-05-12T16:36:33Z

Backported to v0.15 as 7843c1a

davidcallen · 2021-05-20T13:39:33Z

@apparentlymart - Thanks for this fix :) Solves the issue I raised #23890 .

github-actions · 2021-06-20T02:13:53Z

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart added enhancement config labels May 11, 2021

apparentlymart requested a review from a team May 11, 2021 21:25

apparentlymart self-assigned this May 11, 2021

alisdair approved these changes May 12, 2021

View reviewed changes

apparentlymart merged commit 70fed23 into main May 12, 2021

apparentlymart deleted the f-streaming-file-hash branch May 12, 2021 16:36

This was referenced May 12, 2021

Uploading a large file to AWS with Resource: aws_s3_bucket_object #28678

Closed

Use of filemd5() function results in large memory usage #23890

Closed

This was referenced May 24, 2021

Use of resource "aws_s3_bucket_object" with etag = filemd5() function results in large memory usage hashicorp/terraform-provider-aws#11656

Closed

Uploading large files to AWS with aws_s3_bucket_object hashicorp/terraform-provider-aws#19336

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jun 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lang/funcs: File hashing functions stream data from disk #28681

lang/funcs: File hashing functions stream data from disk #28681

apparentlymart commented May 11, 2021

alisdair left a comment

apparentlymart commented May 12, 2021

davidcallen commented May 20, 2021

github-actions bot commented Jun 20, 2021

lang/funcs: File hashing functions stream data from disk #28681

lang/funcs: File hashing functions stream data from disk #28681

Conversation

apparentlymart commented May 11, 2021

alisdair left a comment

Choose a reason for hiding this comment

apparentlymart commented May 12, 2021

davidcallen commented May 20, 2021

github-actions bot commented Jun 20, 2021