Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use size and mtime to determine if a file has changed rather than hash #460

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

FraserThompson
Copy link
Contributor

@FraserThompson FraserThompson commented Aug 28, 2023

Changes:

Fixes #459 and fixes #59

  • Instead of using the MD5 hash of the local file and comparing it to the etag from S3 to determine if a file has changed and needs re-uploading, just compare the size and the modification time.
  • Skipped files are output to console so it's easier to know that it's doing something during deploys of large websites.
  • The number of uploaded files is counted and output at the end so it's easier to tell whether changes were uploaded.
  • Minor refactoring on the upload loop for consistency and readability.

Explanation:

Comparing etags to the MD5 hash of each file is, imo, an inferior method for these reasons:

  1. MD5 hashing files (especially big ones) takes a while, so if you have a site with a lot of large files or you're on a particularly weak computer your deploys will be slow.
  2. Etags for files uploaded to S3 via multi-part (aka any file over 16MB by default) are NOT simply the MD5 hash of the file, which means files over 16MB will always be re-uploaded (Large files will always be re-uploaded #59). It's possible to calculate what the etag will be for files uploaded via multi-part, but it's a bit fiddly.

Instead, comparing the size and the modification time is robust enough that I think it's fine to be the default method. It's the method used by the sync function included in the AWS CLI so if it's good enough there I reckon it's also good enough here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hashing each file means large sites sync very slowly Large files will always be re-uploaded
1 participant