Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding LFS pointers scales poorly in very large repositories #5570

Open
klylesatepic opened this issue Nov 13, 2023 · 3 comments
Open

Finding LFS pointers scales poorly in very large repositories #5570

klylesatepic opened this issue Nov 13, 2023 · 3 comments
Labels
enhancement to-investigate Further investigation is needed

Comments

@klylesatepic
Copy link

klylesatepic commented Nov 13, 2023

Describe the issue
Pushing a large amount of content with LFS enabled takes days to weeks for LFS to discover all content that needs to be pushed. For our largest repository (>5.5 million commits), we've never actually gotten it to finish (but it takes at least 7 days).

In order to push that content successfully, we developed a workaround. We gather all the LFS oids with this series of commands:

{
    # Optional (speeds up incremental pushes): list hashes from
    # the remote with a leading ^ so we can skip that content
    git ls-remote --refs origin | cut -c -40 | sed -e 's/^/^/'
    # List hashes for all refs locally
    git show-ref | cut -c -40
} |
    git log --stdin -G '^oid sha256:' --patch --text --diff-merges=first-parent --no-renames --no-textconv |
    grep '^+\+oid sha256:' |
    sed -e 's/.*://' |
    sort -u

This takes ~5-6 hours for the largest repository. We then push each LFS oid. Once all LFS content is pushed, we disable LFS for the repo we're pushing from and push the regular git content.

Obviously this is not a ready-to-use solution for the general case -- at a minimum, it can have false positives from things that look like LFS oids but aren't in LFS pointers. However, it demonstrates the potential for improvement.

I also created a Bash script that will create a repository that demonstrates the issue and then run both methods for finding all LFS oids (attached at the end). Note that it uses only a single LFS oid to save space -- this is not how our repository works, but multiple LFS oids don't seem to be necessary to demonstrate the issue. Multiple different non-LFS files does seem to be required; there seems to be some optimization around not processing the same blob multiple times. The default size of 10,000 commits takes about 4 minutes to run on an i9 9900 with NVMe storage, and demonstrates the issue well enough that it should allow for useful profiling. However, the larger the repository, the worse the difference gets.

System environment
The version of your operating system, plus any relevant information about platform or configuration (e.g., container or CI usage, Cygwin, WSL, or non-Basic authentication). If relevant, include the output of git config -l as a code block.

Output of git lfs env
Git Bash on Windows:

git-lfs/3.4.0 (GitHub; windows amd64; go 1.20.6; git d06d6e9e)
git version 2.42.0.windows.2

Endpoint (upstream)=https://c//<omitted>/upstream/.git/info/lfs (auth=none)
  SSH=c:/<omitted>/upstream
LocalWorkingDir=C:\<omitted>\test
LocalGitDir=C:\<omitted>\test\.git
LocalGitStorageDir=C:\<omitted>\test\.git
LocalMediaDir=C:\<omitted>\test\.git\lfs\objects
LocalReferenceDirs=
TempDir=C:\<omitted>\test\.git\lfs\tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=C:\<omitted>\test\.git\lfs
AccessDownload=none
AccessUpload=none
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
GIT_EXEC_PATH=C:/Program Files/Git/mingw64/libexec/git-core
git config filter.lfs.process = "git-lfs filter-process"
git config filter.lfs.smudge = "git-lfs smudge -- %f"
git config filter.lfs.clean = "git-lfs clean -- %f"

Linux server where we first saw the issue:

git-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)
git version 2.42.0

Endpoint (<omitted>)=https://<omitted>.git/info/lfs (auth=none)
  SSH=git@<omitted>.git
LocalWorkingDir=/<omitted>
LocalGitDir=/<omitted>/.git
LocalGitStorageDir=/<omitted>/.git
LocalMediaDir=/<omitted>/.git/lfs/objects
LocalReferenceDirs=
TempDir=/<omitted>/.git/lfs/tmp
ConcurrentTransfers=8
TusTransfers=false
BasicTransfersOnly=false
SkipDownloadErrors=false
FetchRecentAlways=false
FetchRecentRefsDays=7
FetchRecentCommitsDays=0
FetchRecentRefsIncludeRemotes=true
PruneOffsetDays=3
PruneVerifyRemoteAlways=false
PruneRemoteName=origin
LfsStorageDir=/<omitted>/.git/lfs
AccessDownload=none
AccessUpload=none
DownloadTransfers=basic,lfs-standalone-file,ssh
UploadTransfers=basic,lfs-standalone-file,ssh
GIT_EXEC_PATH=/usr/lib/git-core
git config filter.lfs.process = "git-lfs filter-process"
git config filter.lfs.smudge = "git-lfs smudge -- %f"
git config filter.lfs.clean = "git-lfs clean -- %f"

Additional context
Script to show the issue: gist or create-repo-and-test.zip (had to zip -- .sh files are not allowed to be attached)

@chrisd8088
Copy link
Contributor

Hey, thanks for the report and the detailed reproduction script! I think this gives us a lot of scope for analysis. We probably want to investigate a few things to start with:

  • How does the LFS performance compare with plain Git (without the LFS objects in the test repo, for instance)?
  • Does the performance suffer equally on all platforms? (Likely so, but worth checking.)
  • Is the time mostly spent in the Go code, or in Git calls made from the Go code?

I'll mark this as an issue for enhancement, and try to experiment a bit with your reproduction case to see if we can narrow down the likely causes a bit.

@chrisd8088 chrisd8088 added enhancement to-investigate Further investigation is needed labels Nov 16, 2023
@klylesatepic
Copy link
Author

  1. I'm not 100% sure what you're asking here, so let me know if I missed. Once the LFS content is pushed (separately, with our workaround), we're able to push the entire git repository in well under a day.
  2. Roughly equally between Linux and Windows from what we've seen.
  3. I have not investigated to that depth yet. I can tell you that top attributes the CPU time to the git-lfs process itself, if that's helpful. I will get more details, but please note that it may take me some time -- we're still in the middle of prepping for our production rollout (of git+LFS).

@klylesatepic
Copy link
Author

I haven't been able to figure out how to get debug symbols that perf can use to show the expensive functions involved, but I can confirm that the time is mostly within the Go code, not within git calls made from it.

The Go code shows up as ~91.74%, and the git calls show up as ~1.63%, according to perf report.

Obviously there is some rounding error there, but even if every line of git call was just under the threshold to show up as 0.01% higher, it still couldn't be more than ~12% of the total time spent on git calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement to-investigate Further investigation is needed
Projects
None yet
Development

No branches or pull requests

2 participants