Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot use git lfs filter extensions to clean data which will not be smudge-d back #5745

Open
JanSharp opened this issue May 13, 2024 · 3 comments

Comments

@JanSharp
Copy link

Describe the issue
I've been trying to set up a project for Unity, more specifically VRChat world creation. Without going into detail, Udon - part of VRChat's SDK - is horrendous for source control, so I've taken it upon myself to "fix" it by removing noise-and-merge-conflict-causing data from git. To do so I naturally wanted to use git's clean/smuge feature. The affected files are large as well (scenes and prefabs), so I need to run them through git lfs.

One of the things I would like to remove/hide from git is references to automatically generated files where said files are .gitignore-d.

For example there may be a line

serializedProgramAsset: {fileID: 11400000, guid: 9eb6bf22b7b45af1d8ef5e8652d24b03, type: 2}

in a file, which I've filtered to look like

serializedProgramAsset: {fileID: 0}

inside of git.

When reading https://github.com/git-lfs/git-lfs/blob/main/docs/extensions.md it mentions that extensions can be used to

Scan files on clean to make sure they don't contain sensitive information

And this is basically no different.

However further down in the smudge description it mentions:

[...] After each extension is invoked, LFS will compare the SHA-256 signature of the bytes output by the extension with the oid stored in the pointer file as the original input to that same extension. Those signatures must match, otherwise the extension did not undo its changes correctly. [...]

These 2 contradict each other, as it is impossible for the smudge filter to output the exact same data that was initially cleaned. That data has been removed entirely. (Unless the intention is that said sensitive data would be smudged back in by reading some external data, which is not exactly stable, safe, nor what git fitlers are meant to be doing, so I'd guess that is not the intent here either.)

When testing my extension it does appear as though the smudge description is accurate, as it is no longer possible to checkout affected files.

Since the feature is marked as experimental I thought I'd give you this as a form of feedback. I remember reading before that the guids in the pointer files are required for the clean/smudge filter extensions to work, and while I do not know nor understand why, if that is the case then there is nothing that can be changed here, outside of updating the documentation. But I could be wrong.

Side note about documentation: it is unclear if low or high priority extensions run first. The common issue of priority being ascending and descending in different pieces of software.

(And for the record, what I'm likely going to try and do is make a separate git filter which pipes the files through mine first and then through git lfs, making lfs unaware of the filter happening before it.)

System environment
ManjaroLinux 24.0.0
kernel 6.6.30-2-MANJARO (64-bit)

Output of git lfs env
Pretty sure you don't need this, but 🤷 (The compress extension is doing everything. I'm aware of the option to use multiple extensions, but I did not want to think about that at the moment.)

git-lfs/3.5.1 (GitHub; linux amd64; go 1.22.1)
git version 2.45.0

Endpoint=https://github.com/<foo>/AscensionAcademy.git/info/lfs (auth=none)
  SSH=git@github.com:<foo>/AscensionAcademy.git
LocalWorkingDir=/mnt/fast/dev/unity/AscensionAcademy
LocalGitDir=/mnt/fast/dev/unity/AscensionAcademy/.git
LocalGitStorageDir=/mnt/fast/dev/unity/AscensionAcademy/.git
LocalMediaDir=/mnt/fast/dev/unity/AscensionAcademy/.git/lfs/objects
LocalReferenceDirs=
TempDir=/mnt/fast/dev/unity/AscensionAcademy/.git/lfs/tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneVerifyUnreachableAlways=false
PruneRemoteName=origin
LfsStorageDir=/mnt/fast/dev/unity/AscensionAcademy/.git/lfs
AccessDownload=none
AccessUpload=none
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
Extension[0]=compress
GIT_EXEC_PATH=/usr/lib/git-core
git config filter.lfs.process = "git-lfs filter-process"
git config filter.lfs.smudge = "git-lfs smudge -- %f"
git config filter.lfs.clean = "git-lfs clean -- %f"
@JanSharp
Copy link
Author

Quick addendum: I think I see why it needs the final oid guid in the pointer file - of course to find the file in lfs/objects, but also to know if the file has been modified. The intermediate guids I don't see a technical requirement for, except for the validation that currently happens when smudging.

With that in mind however, I do believe that it is not a functional requirement for the guids to be validated during a smudge process. It could be a flag defined in the config secion for a filter extension indicating whether it should do this validation or not, which has the downside that as soon as a filter has been run which does not want validation, any filters smudging the file afterwards could not do validation either even if they have the flag set requesting validation to be done for them.

@chrisd8088
Copy link
Contributor

Hey, thanks for writing up these notes!

it is unclear if low or high priority extensions run first. The common issue of priority being ascending and descending in different pieces of software.

I believe they are run in ascending priority order during a clean operation and run in descending priority order during a smudge operation. The documentation does seem to accord with that understanding. For the clean operation, it states:

If multiple extensions are installed, they are invoked in the order defined by their priority.

and for the smudge operation:

Each of the extensions indicated in the pointer file must be invoked in reverse order to undo the changes they made to the contents of the file.


I take your point that if an extension removes data destructively (such as a filter for sensitive content), it typically won't be able to reconstruct it when run "in reverse". I suppose that implies that any such filter must run only with the priority 1 so it runs first in the clean operation. Would that work for your experiment?

As for making the intermediate SHA-256 OIDs optional, I suspect that's not something we would add to our backlog at the moment, but if someone wants to contribute a patch with some tests which demonstrate how it works (and doesn't cause any incompatibilities with existing tools), we're always open to contributions!

@JanSharp
Copy link
Author

I believe they are run in ascending priority order during a clean operation and run in descending priority order during a smudge operation. The documentation does seem to accord with that understanding. For the clean operation, it states:

If multiple extensions are installed, they are invoked in the order defined by their priority.

My bad for just mentioning it as a side note without explaining, that is exactly where I was confused. My thought was that adding "highest to lowest" or "lowest to highest" to the docs would make it much harder to misunderstand. Of course this varies from person to person, but to me I'd interpret the current wording as "highest priority first", which would be descending... or ascending depending on defintion 😅 . There's so much software using "priority" and "order" but how the implementations actually sort seems unpredictable. I hope that explains my confusion better. (But it's also not a huge deal to be clear.)

I take your point that if an extension removes data destructively (such as a filter for sensitive content), it typically won't be able to reconstruct it when run "in reverse". I suppose that implies that any such filter must run only with the priority 1 so it runs first in the clean operation. Would that work for your experiment?

Yea you got it. Though since I'm only using 1 extension here, its priority wouldn't make a difference. The ultimate issue in my case is that the final sha256 checksum of the file does get validated (the one calculated and saved before my extension ran, and then checked after the smudge for my extension ran). Making it run first would allow other extensions after it to be non destructive and therefore be able to get validated, but any destructive extensions are currently not possible so that detail technically does not matter (currently).

As for making the intermediate SHA-256 OIDs optional, I suspect that's not something we would add to our backlog at the moment, but if someone wants to contribute a patch with some tests which demonstrate how it works (and doesn't cause any incompatibilities with existing tools), we're always open to contributions!

Makes sense and no worries, I just wanted to give a bit of feedback. The workaround I mentioned - using a different git filter - appears to work quite nicely, so I'm sorry I won't be making such a PR (I've also not written a single line of go yet). But I'm glad that you would be open to such a contribution :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants