Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Read repositories with sha256 object format #11707

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Chri-s
Copy link

@Chri-s Chri-s commented Apr 24, 2024

Fixes #5483

Proposed changes

  • Work with repositories which use sha256 as object format

Test methodology

  • T.B.D. / manual for now

Test environment(s)

  • GIT version 2.43.0.windows.1
  • Windows 11 23H2

Merge strategy

I agree that the maintainer squash merge this PR (if the commit message is clear).


✒️ I contribute this code under The Developer Certificate of Origin.

@Chri-s Chri-s mentioned this pull request Apr 24, 2024
Copy link
Member

@RussKie RussKie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible (quick review on a mobile)

@pmiossec
Copy link
Member

I think that everywhere ObjectId.Sha1CharCount is used, the code should be adapted to use the repo hash char count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, I think L19 could be removed.

@@ -244,19 +326,31 @@ private static bool IsValidCharacters(string s)
private readonly ulong _i1;
private readonly ulong _i2;
private readonly uint _i3;
private readonly ulong _i4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObjectId is critical for performance. Even more so when sha256 is used I assume, to be enabled for big repos.
sha1 is stored as 16+16+8 (which was maybe 150ms faster than 516 opening 100K commits in Linux if I remember correctly).
sha256 probably would benefit from 4
16 rather than 16+16+8+16+8. (Is it easier for the compiler to optimize 4*16 too?)
We have optimized individual if statements to optimize loading, this change could me measurable (as well as changes in RevisionReader).

It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release:
https://www.phoronix.com/news/Git-2.45-rc0-Released

So maybe not a quick merge...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we use subtypes here for storage. i.e. make ObjectId abstract and have different implementations Sha1Id, Sha256Id etc.

If we ever need a third implementation, one can imagine it making the approach taken here even more challenging to follow.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into subtypes and performance benchmarks for different implementations.

It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release: https://www.phoronix.com/news/Git-2.45-rc0-Released

So maybe not a quick merge...

If I understand https://git-scm.com/docs/hash-function-transition#_object_names_on_the_command_line correctly, then the hash written to the console (sha1/sha256) will always be the same during a mode of operation. It's not like some commits are sha1 and some are sha256 in one git log output:

("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode.

("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3.

("late transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3.

("post-transition") Treat object names input by the user as SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function.

If I understand it correctly, IGitModule would be the right place to call git rev-parse --show-object-format=output once for a repository in order to get the length of all hashes of the repository?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the GitModule ctor is the right place.

@@ -52,10 +52,10 @@ public sealed partial class GitModule : IGitModule
// 366dfba1abf6cb98d2934455713f3d190df2ba34 refs/tags/2.51
//
// Lines may also use \t as a column delimiter, such as output of "ls-remote --heads origin".
[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40})[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]
[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this would avoid backtracking and give better performance:

Suggested change
[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]
[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40}([0-9a-f]{24})?)[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]

/// </summary>
/// <param name="length">The length of the buffer.</param>
/// <returns>True if the buffer can be a valid hash; otherwise false. It must still be checked if the buffer contains a valid hash.</returns>
public static bool IsValidLength(int length) => length == Sha1CharCount || length == Sha256CharCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps IsValidCharacterLength, to be more specific about what is being counted.

@@ -66,7 +80,7 @@ public static ObjectId Random()
[MustUseReturnValue]
public static ObjectId Parse(string s)
{
if (s?.Length is not Sha1CharCount || !TryParse(s.AsSpan(), out ObjectId id))
if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging the patterns might produce better code.

Suggested change
if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))
if (s?.Length is not (Sha1CharCount or Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))

@Chri-s
Copy link
Author

Chri-s commented Jun 2, 2024

I'm continuing with this pull request. One more question while I separate ObjectId into two separate classes Sha1ObjectId and Sha256ObjectId with an abstract base ObjectId: Should the static (Try)Parse functions in ObjectId accept both sha1 and sha256 strings and return the parsed Sha1ObjectId (or Sha256ObjectId)?
Or should all parse functions add a new argument specifying the required hash type? The required hash type would come from GitModule. Also, the Parse functions could be added to GitModule, which would call the Parse functions of ObjectId with the hash type already specified:

// In GitModule (ObjectIdHash is the property for the repo hash format in GitModule):
public ObjectId Parse(string s) => ObjectId.Parse(s, ObjectIdHash);

public bool TryParse(string? s, [NotNullWhen(returnValue: true)] out ObjectId? objectId) => ObjectId.TryParse(s, out objectId);

I think the second approach would be cleaner (and faster because not every parse has to check for 2 formats), but it would also touch a lot of files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SHA 256
6 participants