Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parsing author lines with no angle brackets #927

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

progval
Copy link
Contributor

@progval progval commented Jan 24, 2022

This commit adds support for parsing author lines with no angle brackets,
as I found an example of this on GitHub.
It uses a regexp to avoid hand-writing some backtracking logic.

This also adds tests for other types of brokenness that I found in existing
git repos, and that were already supported. I tried to use the original example
as test data when possible.

This commit adds support for parsing author lines with no angle brackets.
It uses a regexp to avoid hand-writing some backtracking logic.

This also adds tests for other types of brokenness that I found in existing
git repos, and that were already supported. I tried to use the original example
as test data when possible.
@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2022

Codecov Report

Merging #927 (3b8a81d) into master (6b53d0b) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #927      +/-   ##
==========================================
- Coverage   84.11%   84.11%   -0.01%     
==========================================
  Files          92       92              
  Lines       22679    22752      +73     
  Branches     3315     3317       +2     
==========================================
+ Hits        19076    19137      +61     
+ Misses       3129     3119      -10     
- Partials      474      496      +22     
Impacted Files Coverage Δ
dulwich/objects.py 89.43% <100.00%> (+0.22%) ⬆️
dulwich/tests/test_objects.py 99.71% <100.00%> (+0.03%) ⬆️
dulwich/tests/test_hooks.py 94.04% <0.00%> (-3.58%) ⬇️
dulwich/patch.py 92.39% <0.00%> (-1.64%) ⬇️
dulwich/tests/test_patch.py 95.93% <0.00%> (-1.17%) ⬇️
dulwich/tests/test_refs.py 98.91% <0.00%> (-0.55%) ⬇️
dulwich/tests/test_index.py 96.14% <0.00%> (-0.49%) ⬇️
dulwich/config.py 81.86% <0.00%> (-0.29%) ⬇️
dulwich/repo.py 88.35% <0.00%> (-0.14%) ⬇️
dulwich/contrib/test_swift.py 29.05% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b53d0b...3b8a81d. Read the comment docs.

@jelmer
Copy link
Owner

jelmer commented Jan 29, 2022

Thanks for the PR. It looks like some of these will break roundtripping, i.e. parsing some bytes into a Commit object and then getting the exact same set of bytes back out when calling Commit.as_raw_bytes() - which is a property I'd like to maintain.

Have you actually seen all of these in the wild? We could probably offer some sort of alternative for cases like this, if we don't want to add attributes for each one.

@progval
Copy link
Contributor Author

progval commented Jan 29, 2022

It looks like some of these will break roundtripping

which is a property I'd like to maintain

There were already some breakages on commits with weird timezones. For example, missing leading zeros in timezones (I don't know how this happened, I assume home-made Git implementations) are fixed by Dulwich:

>>> b = b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <jdoe@example.com> 1614159930 +100\ncommitter John Doe <jdoe@example.com> 1614159930 +100\n\nfoo'
>>> c = dulwich.objects.Commit.from_string(b)
>>> c.author = c.author  # reset c._needs_serialization
>>> c.as_raw_string()
b'tree aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa \nparent bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb\nauthor John Doe <jdoe@example.com> 1614159930 +0100\ncommitter John Doe <jdoe@example.com> 1614159930 +0100\n\nfoo'

The solution we decided to use at @SoftwareHeritage was to keep the original timezone bytes somewhere. I could do it for Dulwich (I actually have a commit ready to do this, based on the one in this PR)

Have you actually seen all of these in the wild?

Yes. I have seen about 14k commits with missing leading zeros as above. 232 with the timezone +0575 (which gets rewritten as +0615), and 726 with various types of brokenness (the 0000 above, huge timezones with a leading zero, such as +091800 which gets rewritten as +91800, etc.)

We could probably offer some sort of alternative for cases like this, if we don't want to add attributes for each one.

Dulwich could store the raw bytes of the original timezone. This requires a new attribute, but it allows the *_timezone_neg_utc attribute to be removed.

We could probably offer some sort of alternative for cases like this, if we don't want to add attributes for each one.

@jelmer jelmer self-requested a review July 2, 2022 10:52
@jelmer
Copy link
Owner

jelmer commented Jan 26, 2023

FWIW Rather than changing the default parser, I'm planning to lazily load these event lines (user identity + timestamp + timezone). That way, you can get access to the raw data and parse that with a less strict parser if you like. The added minor benefit could be that we get to avoid parsing them when we don't need them.

My PoC is in #1135; that approach works, but adds two extra object instances per commit which is not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants