Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lang/funcs: File hashing functions stream data from disk #28681

Merged
merged 1 commit into from
May 12, 2021

Conversation

apparentlymart
Copy link
Contributor

Previously our file hashing functions were backed by the same "read file into memory" function we use for situations like file and templatefile, meaning that they'd read the entire file into memory first and then calculate the hash from that buffer.

All of the hash implementations we use here can calculate hashes from a sequence of smaller buffer writes though, so there's no actual need for us to create a file-sized temporary buffer here.

This, then, is a small refactoring of our underlying function into two parts, where one is responsible for deciding the actual filename to load opening it, and the other is responsible for buffering the file into memory. Our hashing functions can then use only the first function and skip the second.

This then allows us to use io.Copy to stream from the file into the hashing function in smaller chunks, possibly of a size chosen by the hash function if it happens to implement io.ReaderFrom.

The new implementation is functionally equivalent to the old but should use less temporary memory if the user passes a large file to one of the hashing functions.

This might help with #28678, but we've not root-caused that yet. Either way, this seems like a reasonable optimization to implement.

Previously our file hashing functions were backed by the same "read file
into memory" function we use for situations like "file" and "templatefile",
meaning that they'd read the entire file into memory first and then
calculate the hash from that buffer.

All of the hash implementations we use here can calculate hashes from a
sequence of smaller buffer writes though, so there's no actual need for
us to create a file-sized temporary buffer here.

This, then, is a small refactoring of our underlying function into two
parts, where one is responsible for deciding the actual filename to load
opening it, and the other is responsible for buffering the file into
memory. Our hashing functions can then use only the first function and
skip the second.

This then allows us to use io.Copy to stream from the file into the
hashing function in smaller chunks, possibly of a size chosen by the hash
function if it happens to implement io.ReaderFrom.

The new implementation is functionally equivalent to the old but should
use less temporary memory if the user passes a large file to one of the
hashing functions.
Copy link
Contributor

@alisdair alisdair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat fix. I didn't realize that this is how io.Copy works, that's fascinating.

@apparentlymart apparentlymart merged commit 70fed23 into main May 12, 2021
@apparentlymart
Copy link
Contributor Author

Backported to v0.15 as 7843c1a

@davidcallen
Copy link

@apparentlymart - Thanks for this fix :) Solves the issue I raised #23890 .

@github-actions
Copy link
Contributor

I'm going to lock this pull request because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants