Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement opaque ETag in Compact Index to avoid falling back to old index in servers with different etag implementations #7122

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

martinemde
Copy link
Member

@martinemde martinemde commented Oct 29, 2023

What was the end-user or developer problem that led to this PR?

Building on @simi's branch for opaque etags (related to #6632) and merging in my work with DigestIO in the CompactIndex client at #7039 (closed).

The problem that originally surfaced here was when other compact index sources did not calculate their etags the same way as rubygems.org and S3. This would result in the compact index never doing incremental updates, which is much slower. As a solution we decided to use the Digest, and then the Repr-Digest header and allow etags to be opaque.

What is your fix for the problem, implemented in this PR?

Change the compact index client to verify compact index file downloads using the Repr-Digest or Digest header. Uses sha-256 right now but opens up the possibility to support any algorithm going forward. This is a change from assuming that the etag is an MD5 sum of the full file body and using it to verify the content.

The compact index client now stores etags in a separate file, allowing them to be completely opaque, as intended. If an etag file does not exist, we fallback to generating the md5 of the file on disk and saving that to the etag file. After the first pass, the etag will either be validated or replaced, resuming normal operation of the compact index client.

The Repr-Digest header to conform to Structured Fields in HTTP Headers RFC 8941, but we also read Digest which comes from Fastly/S3 for all compact index files and is (incorrectly if you're picky) wrapped in double quotes. Once the changes are deployed to fastly, this client will read from Repr-Digest preferentially.

Make sure the following tasks are checked

@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch 3 times, most recently from cd0b0a2 to a823684 Compare October 29, 2023 22:58
bundler/lib/bundler/compact_index_client/cache.rb Outdated Show resolved Hide resolved
@@ -193,6 +193,21 @@ def digest(name)
Digest(name)
end

def checksum_for_file(path, digest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this just take in an array of digest symbols and then return a hash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use it in one place only for the MD5 of the version info files for the compact index. The server side of the compact index is hard coded to MD5 as well. The most correct thing is probably to move this completely into the compact index client and only generalize if we need to later.

@@ -76,6 +84,16 @@ def info_path(name)
end
end

def info_etag_path(name)
name = name.to_s
if /[^a-z0-9_-]/.match?(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so... question. since this is new, might it be easier to only do the #{name}-#{digest} file name, and then only need to maintain a single directory of etags?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this logic? @simi?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to keep it the same as other data we store

if /[^a-z0-9_-]/.match?(name)
. Not every gem name is unique when transformed into file name, this is to bandaid this AFAIK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then is @segiddins suggesting that we use MD5 of every name instead of having both? Could we just do the MD5 of the name without the prefix?

bundler/lib/bundler/compact_index_client.rb Outdated Show resolved Hide resolved
bundler/lib/bundler/compact_index_client/updater.rb Outdated Show resolved Hide resolved

headers["If-None-Match"] = etag_for(local_temp_path)
headers["If-None-Match"] = local_etag_temp_path.read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we strip! here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip! and chomp! both return nil if there's nothing to strip or chomp, which makes this pretty ugly:

etag = local_etag_path.read
etag.chomp!
headers["If-None-Match"] = etag

this is one line but the to_proc might negate the benefit of using chomp! over chomp:

headers["If-None-Match"] = local_etag_path.read.tap(&:chomp!)

local_temp_path = local_temp_path.sub(/$/, ".tmp")

headers["Want-Content-Digest"] = WANT_DIGEST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing Want-Content-Digest listed at all on MDN, and it says Want-Digest is deprecated https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Want-Digest

Copy link
Member Author

@martinemde martinemde Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a draft 🤷‍♂️ though, to be fair the Content-Digest header seems to come from the same draft.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could totally skip this header. It's just future proofing.

# Unwrap surrounding colons (byte sequence) or quotes (string).
# The wrapping characters must be matched or we return nil.
# Converts hex to base64 based on field type assumptions.
def parse_structured_value(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this definitely needs some tests

bundler/lib/bundler/compact_index_client/updater.rb Outdated Show resolved Hide resolved
@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch from 94e0179 to 14a0b4e Compare November 2, 2023 05:09
@martinemde
Copy link
Member Author

martinemde commented Nov 2, 2023

Hey @segiddins & @simi, I refactored this pretty significantly. I mostly replaced the giant method with two smaller methods. I also cut down on the pending temporary files, clearly separated what was called "retrying" logic, and removed the nested call to update which made the file management weird.

The behavior changes a little, but I think for the best: In the case where a range request is ignored or not available, it no longer attempts the full request twice (it was requesting a range 0- which is the same as a full request, then it would do the same thing again, whereas the intention, I think, was to try a partial request, then try the full request).

@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch 2 times, most recently from 613b8ef to 06b95be Compare November 2, 2023 05:31
@simi
Copy link
Member

simi commented Nov 2, 2023

@martinemde this is on my list, I'll do some local testing first.

@martinemde
Copy link
Member Author

I'm still working on fixing and adding tests. I think the mocked ones are a good way to test the more subtle behavior.

Copy link
Member Author

@martinemde martinemde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks of my own code that I want to remember to fix.

@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch 2 times, most recently from c17960b to 7ca3f7b Compare November 4, 2023 19:50
@martinemde martinemde mentioned this pull request Nov 8, 2023
4 tasks
@martinemde martinemde changed the title Opaque ETag using DigestIO to calculate checksums Opaque ETag in Compact Index Nov 13, 2023
@martinemde martinemde marked this pull request as ready for review November 26, 2023 22:58
@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch from 800ecd4 to 5298b97 Compare November 26, 2023 22:58
@martinemde martinemde force-pushed the martinemde/digestio-opaque-etag branch from 5298b97 to 9fdf6fe Compare November 26, 2023 23:00
Copy link
Member

@segiddins segiddins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! took me a while to read through the changes, but I think this is both good future-proofing, as well as fully backwards compatible. well done!

bundler/lib/bundler/compact_index_client/cache_file.rb Outdated Show resolved Hide resolved
This changes the CompactIndexClient to store etags received from the
compact index in separate files rather than relying on the MD5 checksum
of the file as the etag.

Smoothes the upgrade from md5 etags to opaque by generating them when no
etag file exists. This should reduce the initial impact of changing the
caching behavior by reducing cache misses when the MD5 etag is the same.

Eventually, the MD5 behavior should be retired and the etag should be
considered completely opaque with no assumption that MD5 would match.
@martinemde martinemde merged commit 0339d53 into master Nov 27, 2023
66 checks passed
@martinemde martinemde deleted the martinemde/digestio-opaque-etag branch November 27, 2023 01:27
@@ -5,7 +5,13 @@

module Bundler
class CompactIndexClient
# NOTE: MD5 is here not because we expect a server to respond with it, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be TODO instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I mentioned this temporary md5 thing in a bunch of places so maybe we'll remember to get it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean I do rarely scan codebase for NOTE, but I do often for TODO. I'll try to keep this in my mind as well for now as a candidate to update at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I don't usually leave TODO in PRs, but that's a good point if you actually search for it. We could update.

@@ -76,8 +81,19 @@ def info_path(name)
end
end

def info_etag_path(name)
name = name.to_s
info_etag_root.join("#{name}-#{SharedHelpers.digest(:MD5).hexdigest(name).downcase}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ love this doesn't follow need for 2 directories (one with special characters) in the end and unifies it.

@simi
Copy link
Member

simi commented Nov 27, 2023

Thanks a lot for taking over @martinemde. Well done 💪.

@martinemde
Copy link
Member Author

@simi and @segiddins thanks for reviewing and supporting this, and especially thanks @simi for jumping on a call to go over this code. Now let's just hope it rolls out smoothly.

@deivid-rodriguez
Copy link
Member

This is awesome, thanks for all the work here to not just quickly patch this and move on, but to fix it the right way!

@deivid-rodriguez deivid-rodriguez changed the title Opaque ETag in Compact Index Implement opaque ETag in Compact Index to avoid falling back to old index in servers with different etag implementations Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants