Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PackWriter.buildIndex is too slow #813

Open
Tachone opened this issue Jul 19, 2023 · 4 comments
Open

PackWriter.buildIndex is too slow #813

Tachone opened this issue Jul 19, 2023 · 4 comments

Comments

@Tachone
Copy link

Tachone commented Jul 19, 2023

When I executed UploadPack() and got an 8G packfile, dotgit.(*PackWriter).buildIndex took 30 hours to generate the corresponding .idx , most of the cpu time was spent in packfile.(*Parser).resolveDeltas

See pprof.

cpu:
image

heap:
image

But use git index-pack --strict --threads=16 directly to generate .idx in 2 hours. Is there room for optimization in go-git for this scenario?

The optimization method I can think of is after generating .pack through UploadPack, directly exec command
git index-pack --strict --threads=16 to generate .idx, then call ObjectStorage.Reindex() to reload index,
EncodedObject(t plumbing.ObjectType, h plumbing.Hash) to trigger reload packfile.

Will there be any problems with this, and any other optimization suggestions?

@pjbgf
Copy link
Member

pjbgf commented Jul 20, 2023

@Tachone the git CLI is indeed more efficient than go-git and also has the ability to do some of that work in parallel, which unfortunately go-git does not.

There is definitely room for improvement. The PR #799 add some changes to reduce memory churn, which in turn would decrease GC costs. Would you be able to test the impact on your case with those changes?

On your profile graph the cost of using sha1cd is high - which is expected for a large repo. That is a more secure implementation of SHA1, as it has collision detection. On the other hand, it is more costly than the default sha1. If performance is more important to you, and you trust the source of your repository, you could switch to the default Go sha1 implementation with:

hash.RegisterHash(crypto.SHA1, sha1.New)

Note that hash above is go-git/v5/plumbing/hash.

Another potential optimisation here would be to bring parallelism to buildIndex, even if that only applies for seekable storage.

@pjbgf
Copy link
Member

pjbgf commented Jul 20, 2023

@Tachone if you are keen on using git CLI in parallel with go-git, as you suggested, I would probably not do that simultaneously. So for example, I would open the repository with go-git only after any git CLI operations took place.

@Tachone
Copy link
Author

Tachone commented Jul 26, 2023

@pjbgf thanks. #799

Introducing the LazyObjectWriter interface which enables the write
operation to take places directly against the filesystem-based storage.

In my case,call WritePackfileToObjectStorage() directly to save packfile, p.storage is nilstorage.SetEncodedObject will not be executed. So LazyObjectWriter this part of optimization may not be obvious to me.

But i still think it‘s helpful, due to the decreased GC pressure and the garbage collection costs.

I will test the impact on my case. Is #799 ready for production and release later?

@pjbgf
Copy link
Member

pjbgf commented Sep 3, 2023

@Tachone yes, hoping to get more folks to test and review the PR. From my point of view it is ready to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants