Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide tooling to aggregate files in snapshots directory #62

Open
jgehrcke opened this issue Jun 15, 2022 · 1 comment
Open

Provide tooling to aggregate files in snapshots directory #62

jgehrcke opened this issue Jun 15, 2022 · 1 comment

Comments

@jgehrcke
Copy link
Owner

jgehrcke commented Jun 15, 2022

Over time, the number of individual files in the ../ghrs-data/snapshots/ directory grows to be O(1000) per year. This is not a problem for git. However, it creates inconveniences. For example, the snapshots directory cannot be browsed meaningfully anymore via github:

2022-06-15 16_44_40-ghrs-test_jgehrcke_covid-19-germany-gae_ghrs-data_snapshots at github-repo-stats

Note that only the oldest files are shown here, the newer files are truncated.

Another inconvenience is that upon checkout and parsing it might actually make a noticeable timing difference between having to write / read one file, or having to write (upon checkout) and read (upon parsing) 1000 files.

I think in the long run the Action should automatically aggregate data into less individual files (with each file having more content, obviously), so that maybe there are overall O(10) files per year.

One question is if the files should be nicely readable CSV files or if it makes sense to use a different serialization format.

An intermediate pragmatic step for me is to build tooling that allows to do this aggregation out-of-band, i.e. not as part of an Action run. The changes can then be manually committed to the data branch.

@mepa1363
Copy link

mepa1363 commented Mar 22, 2023

@jgehrcke thanks for creating this project and keeping it going. I was hoping to get an aggregated view for paths and referrers, like the one there is for views and clones. You can leave the individual files in the snapshots directory and aggregate the data and store it separately. Is that something you have on your radar? Is there anything I can do to help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants