-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Read repositories with sha256 object format #11707
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible (quick review on a mobile)
I think that everywhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong, I think L19 could be removed.
@@ -244,19 +326,31 @@ private static bool IsValidCharacters(string s) | |||
private readonly ulong _i1; | |||
private readonly ulong _i2; | |||
private readonly uint _i3; | |||
private readonly ulong _i4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ObjectId is critical for performance. Even more so when sha256 is used I assume, to be enabled for big repos.
sha1 is stored as 16+16+8 (which was maybe 150ms faster than 516 opening 100K commits in Linux if I remember correctly).
sha256 probably would benefit from 416 rather than 16+16+8+16+8. (Is it easier for the compiler to optimize 4*16 too?)
We have optimized individual if statements to optimize loading, this change could me measurable (as well as changes in RevisionReader).
It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release:
https://www.phoronix.com/news/Git-2.45-rc0-Released
So maybe not a quick merge...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we use subtypes here for storage. i.e. make ObjectId
abstract and have different implementations Sha1Id
, Sha256Id
etc.
If we ever need a third implementation, one can imagine it making the approach taken here even more challenging to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into subtypes and performance benchmarks for different implementations.
It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release: https://www.phoronix.com/news/Git-2.45-rc0-Released
So maybe not a quick merge...
If I understand https://git-scm.com/docs/hash-function-transition#_object_names_on_the_command_line correctly, then the hash written to the console (sha1/sha256) will always be the same during a mode of operation. It's not like some commits are sha1 and some are sha256 in one git log
output:
("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode.
("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3.
("late transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3.
("post-transition") Treat object names input by the user as SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function.
If I understand it correctly, IGitModule
would be the right place to call git rev-parse --show-object-format=output
once for a repository in order to get the length of all hashes of the repository?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the GitModule
ctor is the right place.
@@ -52,10 +52,10 @@ public sealed partial class GitModule : IGitModule | |||
// 366dfba1abf6cb98d2934455713f3d190df2ba34 refs/tags/2.51 | |||
// | |||
// Lines may also use \t as a column delimiter, such as output of "ls-remote --heads origin". | |||
[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40})[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)] | |||
[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would avoid backtracking and give better performance:
[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)] | |
[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40}([0-9a-f]{24})?)[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)] |
/// </summary> | ||
/// <param name="length">The length of the buffer.</param> | ||
/// <returns>True if the buffer can be a valid hash; otherwise false. It must still be checked if the buffer contains a valid hash.</returns> | ||
public static bool IsValidLength(int length) => length == Sha1CharCount || length == Sha256CharCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps IsValidCharacterLength
, to be more specific about what is being counted.
@@ -66,7 +80,7 @@ public static ObjectId Random() | |||
[MustUseReturnValue] | |||
public static ObjectId Parse(string s) | |||
{ | |||
if (s?.Length is not Sha1CharCount || !TryParse(s.AsSpan(), out ObjectId id)) | |||
if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging the patterns might produce better code.
if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id)) | |
if (s?.Length is not (Sha1CharCount or Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id)) |
I'm continuing with this pull request. One more question while I separate ObjectId into two separate classes // In GitModule (ObjectIdHash is the property for the repo hash format in GitModule):
public ObjectId Parse(string s) => ObjectId.Parse(s, ObjectIdHash);
public bool TryParse(string? s, [NotNullWhen(returnValue: true)] out ObjectId? objectId) => ObjectId.TryParse(s, out objectId); I think the second approach would be cleaner (and faster because not every parse has to check for 2 formats), but it would also touch a lot of files. |
Fixes #5483
Proposed changes
Test methodology
Test environment(s)
Merge strategy
I agree that the maintainer squash merge this PR (if the commit message is clear).
✒️ I contribute this code under The Developer Certificate of Origin.