Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the linter #3076

Closed
rviscomi opened this issue Aug 19, 2022 · 15 comments
Closed

Optimize the linter #3076

rviscomi opened this issue Aug 19, 2022 · 15 comments
Labels
development Building the Almanac tech stack enhancement New feature or request
Projects
Milestone

Comments

@rviscomi
Copy link
Member

It takes several minutes to run the linter. Is there any way to speed that up?

It seems like a lot of time is spent checking out the entire almanac repo. Would it be possible/faster to only check out the files that changed?

@rviscomi rviscomi added enhancement New feature or request development Building the Almanac tech stack labels Aug 19, 2022
@rviscomi rviscomi added this to the Backlog milestone Aug 19, 2022
@rviscomi rviscomi added this to TODO in 2022 via automation Aug 19, 2022
@tunetheweb
Copy link
Member

The linter does this:

  1. Download the Docker image.
  2. Checkout the repo
  3. Lint only the changed files

3 is actually the quickest! It's the first two that are problematic.

On the first the Linter as one huge docker image. This has positives and negatives:

  • On the positive we have access to lots of linters, written in lots of languages, without having to worry about installing them.
  • On the negative it's massive and takes a while to download because of that.

The Mega-Linter is a fork that produces several Docker images so you can pick the smallest one that has all the linter you need and not have the DotNet linter that we don't care about. But I'd rather stick with the original GitHub-supported one. The other alternative is to install the individual linters. That would work fine for JS and Python ones since we have those set up (sqlfluff is python btw and we do have that in out requirements.txt), but means we have to figure out changed files ourselves (another thing the Super-Linter handles for us).

On the second issue, our Repo is just too damn big. I suspect this is due to PDF changes on released. PDFs, as binary files, aren't really suited to Git. We have a lot of them and every release where we update the PDFs adds 300 MB to the repo as we check in new versions 😔 I don't think there is a way to have only the latest versions in git - we don't really need the history of these files in git. Nor do I think there is a way to purge out the old PDFs from git history (git never forgets!). To be honest, in hindsight, storing these PDFs in this git repo was a bad idea for this reason. Maybe should build them at release time and upload to our CDN? But we haven't figured out the security model for that yet. And even then can't purge out the old ones. Should look at soon though as only going to get worse and worse - especially as we add more PDFs.

Would it be possible/faster to only check out the files that changed?

I don't think that's possible with git? Though looks like we could potentially use the new-ish filter command to ignore the PDFs?

@tunetheweb
Copy link
Member

tunetheweb commented Aug 19, 2022

Not supported with GitHub Actions checkout yet: actions/checkout#680 😔

Not sure what the difference is between that and just using git checkout to be honest. But let's look at this, when the above issue is resolved.

@max-ostapenko
Copy link
Contributor

  1. @tunetheweb We can cleanup the git history using git-filter-repo:
    git filter-repo --path-glob 'src/static/pdfs' --invert-paths --force

Looks like it's goes from 3500Mb down to 700Mb then - see fork.

I see the predeploy workflow generates all the PDF anyway, so no need to store PDFs in repo.
Or am I missing anything?

P.S. It would be really great to get this done for dev container building speed - #3404.

  1. And a small quick fix for SuperLinter - Super linter slim #3409.

@max-ostapenko
Copy link
Contributor

GitHub actually shows repo sizes in settings.
And it shows half size from what I see on the disk.
The screenshot:
Screenshot 2023-05-03 at 19 52 37

@tunetheweb
Copy link
Member

I would love to get rid of the PDFs from this repo. They are massive, and only increase with every release where we generate the PDFs (and we generate ALL the PDFs - even if we only need one).

Unfortunately at the moment we generate the PDFs in a GitHub Action, add them to the repo, and then release from our PCs (where we upload them to our CDN).

This is good because it means we don't need the PDF software on our machines.
This is bad as we're using GitHub as a transfer mechanism.

With your suggested change above, we lose that ability as the PDFs are permanently gone from the repo.

I would be lovely if we could somehow mark a folder as only requiring the latest entry and not a history, but IIRC git doesn't allow you to do that?

So options are:

  1. Upload to CDN from the GitHub Action. We've avoided doing that as would require keys or the like in the repo and we have a LOT of contributors permissioned in this repo and like to freely give out that access, but might be nervous too if we start storing secrets in the repo.
  2. Use some other storage place between GitHub Action running and release (git lfs? Don't know much about that).
  3. Do a clear down once, and then renable it? Do that periodically?
  4. Some other idea?

@max-ostapenko
Copy link
Contributor

max-ostapenko commented May 3, 2023

Regarding the options mentioned:

  1. So it's actually implementing a deployment workflow in actions, instead of your laptops. Seems like we would require custom repository roles to limit repo collaborators from accessing secrets.
  2. I'm not familiar with LFS, but used to rely on ci artifacts in similar cases. May share a PR this week.
  3. Would like to avoid this scenario.
  4. Other idea - install PDF software on dev container and use it for development AND PDF generation.

Seems both 2 and 4 actually make sense: make generation and preview easily available and store action-generated files to a dedicated storage.

@max-ostapenko
Copy link
Contributor

max-ostapenko commented May 8, 2023

Some conclusions for above discussion.

  1. Install filter-repo tool:
    brew install git-filter-repo
  2. Get a clean clone of a repo:
git clone https://github.com/HTTPArchive/almanac.httparchive.org.git
cd almanac.httparchive.org.git
  1. Run the cleaning:
    git filter-repo --path-glob 'src/static/pdfs' --invert-paths --force

  2. Add the git remote that was removed during cleanup
    git remote add origin https://github.com/HTTPArchive/almanac.httparchive.org.git

  3. Push the cleaned repo:

git push origin --force --all
git push origin --force --tags
  1. Contact GitHub to remove cache and ask all collaborators to rebase (NOT merge) ALL the branches.

Assuming changes above, my estimate for the repo checkout step is <20s (down from ~3m).

@tunetheweb
Copy link
Member

Oh that's a bit annoying that we have to do step 5 :-(

I presume this will keep all the git history apart from the PDFs?

@max-ostapenko
Copy link
Contributor

You can see a preview of these steps in this fork: https://github.com/max-ostapenko/almanac.httparchive.org

@tunetheweb
Copy link
Member

@max-ostapenko I was going through the steps in #3076 (comment) but after doing the git filter-repo it seems to have removed the remote repo:

(.venv) 15:54:14 ~/sources/tunetheweb/almanac.httparchive.org main $ git push origin --force --all
fatal: 'origin' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
(.venv) 15:55:46 ~/sources/tunetheweb/almanac.httparchive.org main $ git remote -v
(.venv) 15:55:55 ~/sources/tunetheweb/almanac.httparchive.org main $

Is that expected and do I just need to re-add it?

@max-ostapenko
Copy link
Contributor

I don't remember this happening. Otherwise I think I would write it down.
I can test it with a fresh clone, but it will take some time..

@max-ostapenko
Copy link
Contributor

@tunetheweb yes, I missed it. Please add it manually and proceed.

@tunetheweb
Copy link
Member

OK that's done now. It also seems to have pushed the other branches.

Do we need to do the "Contact GitHub to remove cache" step? Or will the cache be flushed soon enough?

@max-ostapenko
Copy link
Contributor

It's what their documentation suggested ;)
I just cloned it in 40s, so probably no need.

@tunetheweb
Copy link
Member

Supper quick now. Completes in about a minute.

The testing of the website still takes time as we lint all the generated HTML and run Lighthouse, but still under 10mins.

Thanks @max-ostapenko - great improvements!

2022 automation moved this from TODO to Done May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Building the Almanac tech stack enhancement New feature or request
Projects
2022
Done
Development

No branches or pull requests

3 participants