Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using git lfs pull --include ... in a Shallow clones and sparse checkout causes git to download all blobs one by one. #5689

Open
mattrq opened this issue Mar 27, 2024 · 6 comments

Comments

@mattrq
Copy link

mattrq commented Mar 27, 2024

Describe the bug
LFS's use of the git ls-tree with -l when using the --include option in a filtered clone causes git to fetch the missing blobs one by one.
Omitting the -l, does not happen.

To Reproduce

git clone --filter=tree:0 --depth=1 --no-checkout --no-tags git@github.com:git-lfs/git-lfs.git
cd git-lfs
git sparse-checkout set commands --cone
git checkout main
# Git ls-tree without -l
time git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'ls-tree' '-r' '-z' '--full-tree' HEAD
# Git ls-tree with -l
time git '-c' 'filter.lfs.smudge=' '-c' 'filter.lfs.clean=' '-c' 'filter.lfs.process=' '-c' 'filter.lfs.required=false' 'ls-tree' '-l' '-r' '-z' '--full-tree' HEAD
GIT_TRACE git lfs pull -I command 2>&1 | grep ls-tree

Observations:
The ls-tree run with -l, is far slower.
The git lfs pull calls ls-tree with -l.

Expected behavior
Allow git ls-tree to make a first pass without the -l option when it is only used to filter path names.

System environment
Mac OS

Output of git lfs env
git lfs env
git-lfs/3.4.1 (GitHub; darwin arm64; go 1.21.5)
git version 2.44.0

Endpoint=https://github.com/git-lfs/git-lfs.git/info/lfs (auth=none)
SSH=git@github.com:git-lfs/git-lfs.git
LocalWorkingDir=/Users/mrosenquist/git-lfs
LocalGitDir=/Users/mrosenquist/git-lfs/.git
LocalGitStorageDir=/Users/mrosenquist/git-lfs/.git
LocalMediaDir=/Users/mrosenquist/git-lfs/.git/lfs/objects
LocalReferenceDirs=
TempDir=/Users/mrosenquist/git-lfs/.git/lfs/tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=/Users/mrosenquist/git-lfs/.git/lfs
AccessDownload=none
AccessUpload=none
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
GIT_EXEC_PATH=/opt/homebrew/opt/git/libexec/git-core
GIT_TRACE2_PARENT_NAME=run_dashed
GIT_TRACE2_PARENT_SID=20240327T232438.979651Z-Hbf43c273-P00017815
git config filter.lfs.process = "git-lfs filter-process"
git config filter.lfs.smudge = "git-lfs smudge -- %f"
git config filter.lfs.clean = "git-lfs clean -- %f"

Additional context
The use of -l on ls-tree prevents the manual filtering to match sparse-checkout rules while git-lfs does not directly support sparse checkouts.

Related issue

@bk2204
Copy link
Member

bk2204 commented Mar 28, 2024

Hey,

I don't think we can do that. We need the size of the blob to determine whether it can be a pointer file or not, since large blobs can't be pointers, and it's not efficient to scan very large blobs if we don't need to. If we didn't have the size, we'd still have to download every blob, which would perform even worse on partial clone and just awfully in general on large working trees without partial clone, since we'd be processing many more files than we needed to.

I agree this performs terribly in this case. I also tried with --filter=blob:none instead of --filter=tree:0, and it wasn't appreciably better. However, the fact that Git downloads objects one by one in this case is a Git bug, and not a Git LFS bug. Git has special logic in several cases to batch these requests, and I think this is one of those cases where it needs to do that.

Since you have a good reproduction case, it would be good to use git bugreport, provide the reproduction steps, and then send it to the mailing list. Or, if you want to, you can try your hand at a patch. An example of this kind of batching might be c0c578b33ca.

@mattrq
Copy link
Author

mattrq commented Mar 28, 2024

I agree that the files are not fetched as a batch is a core git issue.

The issue gets far worse as the repo scales.

Currently; GIT-LFS requires that the blobs are populated, even for files that it does not track given the way it calls git ls-tree. This does seem like a GIT-LFS issue to me.

If I've read the code correctly (not a GO programmer) the size field could be used slightly later in the call stack.
An early step is filtering by the path. Could it be possible to find the relevant files first and then call ls-tree -l, to get the format needed for the relevant files?


Side note
I tried using the "magic" patterns, not sure if this has been looked into, thought it may be of interest.
The following will list all files that have the attribute filter set to lfs:
git ls-files --full-name --with-tree=HEAD ":(top,attr:filter=lfs)"
If the above is useful then the -t, and ignoring lines starting with S would make it sparse compatible 😉
E.g. git ls-files -t --full-name --with-tree=HEAD ":(top,attr:filter=lfs)" | grep -v ^S

@bk2204
Copy link
Member

bk2204 commented Apr 3, 2024

Okay, I have #5699, which should apply that suggestion, but doesn't yet wire up the sparse functionality. I think that can come in a future revision with the rest of the sparse checkout functionality.

@mattrq
Copy link
Author

mattrq commented Apr 7, 2024

The PR looks great 👍. Glad the filters were helpful; they were new to me.

Good catch on making the formatting the same.

@mattrq
Copy link
Author

mattrq commented Apr 18, 2024

@bk2204 great to see #5699 merged. Any Idea if this will be released soonish or independently of v3.50?

@bk2204
Copy link
Member

bk2204 commented Apr 18, 2024

It will be released in v3.6, since it's a new feature, but we don't have concrete plans to do that at the moment. It will likely be a couple months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants