[WIP] Read repositories with sha256 object format #11707

Chri-s · 2024-04-24T23:10:31Z

Fixes #5483

Proposed changes

Work with repositories which use sha256 as object format

Test methodology

T.B.D. / manual for now

Test environment(s)

GIT version 2.43.0.windows.1
Windows 11 23H2

Merge strategy

I agree that the maintainer squash merge this PR (if the commit message is clear).

✒️ I contribute this code under The Developer Certificate of Origin.

RussKie

Looks sensible (quick review on a mobile)

pmiossec · 2024-04-25T09:19:09Z

I think that everywhere ObjectId.Sha1CharCount is used, the code should be adapted to use the repo hash char count.

pmiossec · 2024-04-25T09:24:13Z

Plugins/GitUIPluginInterfaces/ObjectId.cs

If I'm not wrong, I think L19 could be removed.

gerhardol · 2024-04-25T20:55:59Z

Plugins/GitUIPluginInterfaces/ObjectId.cs

@@ -244,19 +326,31 @@ private static bool IsValidCharacters(string s)
        private readonly ulong _i1;
        private readonly ulong _i2;
        private readonly uint _i3;
+        private readonly ulong _i4;


ObjectId is critical for performance. Even more so when sha256 is used I assume, to be enabled for big repos.
sha1 is stored as 16+16+8 (which was maybe 150ms faster than 516 opening 100K commits in Linux if I remember correctly).
sha256 probably would benefit from 416 rather than 16+16+8+16+8. (Is it easier for the compiler to optimize 4*16 too?)
We have optimized individual if statements to optimize loading, this change could me measurable (as well as changes in RevisionReader).

It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release:
https://www.phoronix.com/news/Git-2.45-rc0-Released

So maybe not a quick merge...

I wonder if we use subtypes here for storage. i.e. make ObjectId abstract and have different implementations Sha1Id, Sha256Id etc.

If we ever need a third implementation, one can imagine it making the approach taken here even more challenging to follow.

I will look into subtypes and performance benchmarks for different implementations.

It seems like a repo can contain both sha1/sha256 hashes. Maybe a sha256 can be handled as 40 bit.But upcoming Git has changes in the upcoming release: https://www.phoronix.com/news/Git-2.45-rc0-Released

So maybe not a quick merge...

If I understand https://git-scm.com/docs/hash-function-transition#_object_names_on_the_command_line correctly, then the hash written to the console (sha1/sha256) will always be the same during a mode of operation. It's not like some commits are sha1 and some are sha256 in one git log output:

("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode.

("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3.

("late transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3.

("post-transition") Treat object names input by the user as SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function.

If I understand it correctly, IGitModule would be the right place to call git rev-parse --show-object-format=output once for a repository in order to get the length of all hashes of the repository?

Yes, the GitModule ctor is the right place.

drewnoakes · 2024-04-26T00:34:08Z

GitCommands/Git/GitModule.cs

@@ -52,10 +52,10 @@ public sealed partial class GitModule : IGitModule
        // 366dfba1abf6cb98d2934455713f3d190df2ba34 refs/tags/2.51
        //
        // Lines may also use \t as a column delimiter, such as output of "ls-remote --heads origin".
-        [GeneratedRegex(@"^(?<objectid>[0-9a-f]{40})[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]
+        [GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]


I wonder if this would avoid backtracking and give better performance:

Suggested change

[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]

[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40}([0-9a-f]{24})?)[ \t](?<refname>.+)$", RegexOptions.Multiline | RegexOptions.ExplicitCapture)]

drewnoakes · 2024-04-26T00:35:44Z

Plugins/GitUIPluginInterfaces/ObjectId.cs

+        /// </summary>
+        /// <param name="length">The length of the buffer.</param>
+        /// <returns>True if the buffer can be a valid hash; otherwise false. It must still be checked if the buffer contains a valid hash.</returns>
+        public static bool IsValidLength(int length) => length == Sha1CharCount || length == Sha256CharCount;


Perhaps IsValidCharacterLength, to be more specific about what is being counted.

drewnoakes · 2024-04-26T00:37:51Z

Plugins/GitUIPluginInterfaces/ObjectId.cs

@@ -66,7 +80,7 @@ public static ObjectId Random()
        [MustUseReturnValue]
        public static ObjectId Parse(string s)
        {
-            if (s?.Length is not Sha1CharCount || !TryParse(s.AsSpan(), out ObjectId id))
+            if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))


Merging the patterns might produce better code.

Suggested change

if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))

if (s?.Length is not (Sha1CharCount or Sha256CharCount) || !TryParse(s.AsSpan(), out ObjectId id))

Chri-s · 2024-06-02T17:58:51Z

I'm continuing with this pull request. One more question while I separate ObjectId into two separate classes Sha1ObjectId and Sha256ObjectId with an abstract base ObjectId: Should the static (Try)Parse functions in ObjectId accept both sha1 and sha256 strings and return the parsed Sha1ObjectId (or Sha256ObjectId)?
Or should all parse functions add a new argument specifying the required hash type? The required hash type would come from GitModule. Also, the Parse functions could be added to GitModule, which would call the Parse functions of ObjectId with the hash type already specified:

// In GitModule (ObjectIdHash is the property for the repo hash format in GitModule):
public ObjectId Parse(string s) => ObjectId.Parse(s, ObjectIdHash);

public bool TryParse(string? s, [NotNullWhen(returnValue: true)] out ObjectId? objectId) => ObjectId.TryParse(s, out objectId);

I think the second approach would be cleaner (and faster because not every parse has to check for 2 formats), but it would also touch a lot of files.

Start reading repositories with sha256 object format

76af359

Chri-s mentioned this pull request Apr 24, 2024

SHA 256 #5483

Open

RussKie reviewed Apr 24, 2024

View reviewed changes

pmiossec reviewed Apr 25, 2024

View reviewed changes

Plugins/GitUIPluginInterfaces/ObjectId.cs

Copy link

Member

pmiossec Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not wrong, I think L19 could be removed.

gerhardol reviewed Apr 25, 2024

View reviewed changes

drewnoakes reviewed Apr 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Read repositories with sha256 object format #11707

[WIP] Read repositories with sha256 object format #11707

Chri-s commented Apr 24, 2024

RussKie left a comment

pmiossec commented Apr 25, 2024

pmiossec Apr 25, 2024

gerhardol Apr 25, 2024

drewnoakes Apr 26, 2024

Chri-s Apr 26, 2024

mstv Apr 26, 2024

drewnoakes Apr 26, 2024

drewnoakes Apr 26, 2024

drewnoakes Apr 26, 2024

Chri-s commented Jun 2, 2024

	[GeneratedRegex(@"^(?<objectid>([0-9a-f]{40}\|[0-9a-f]{64}))[ \t](?<refname>.+)$", RegexOptions.Multiline \| RegexOptions.ExplicitCapture)]
	[GeneratedRegex(@"^(?<objectid>[0-9a-f]{40}([0-9a-f]{24})?)[ \t](?<refname>.+)$", RegexOptions.Multiline \| RegexOptions.ExplicitCapture)]

	if ((s?.Length is not Sha1CharCount && s?.Length is not Sha256CharCount) \|\| !TryParse(s.AsSpan(), out ObjectId id))
	if (s?.Length is not (Sha1CharCount or Sha256CharCount) \|\| !TryParse(s.AsSpan(), out ObjectId id))

[WIP] Read repositories with sha256 object format #11707

Are you sure you want to change the base?

[WIP] Read repositories with sha256 object format #11707

Conversation

Chri-s commented Apr 24, 2024

Proposed changes

Test methodology

Test environment(s)

Merge strategy

RussKie left a comment

Choose a reason for hiding this comment

pmiossec commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Chri-s commented Jun 2, 2024