Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explore stargazer limit challenge (40k+) #76

Open
jgehrcke opened this issue Sep 20, 2023 · 4 comments
Open

explore stargazer limit challenge (40k+) #76

jgehrcke opened this issue Sep 20, 2023 · 4 comments

Comments

@jgehrcke
Copy link
Owner

Just saw https://github.com/Significant-Gravitas/Auto-GPT/ starting to use github-repo-stats. They have ~150k stargazers. We can extract 40000:

230920-13:20:22.067 INFO:MainThread: 39600 gazers fetched
230920-13:20:22.465 INFO:MainThread: 39800 gazers fetched
230920-13:20:22.728 INFO:MainThread: 40000 gazers fetched
230920-13:20:22.790 INFO:MainThread: GH request limit after fetch operation: 4279
230920-13:20:22.790 INFO:MainThread: http requests made (approximately): 400
230920-13:20:22.790 INFO:MainThread: stargazer count: 40000
230920-13:20:22.924 INFO:MainThread: stargazer df

This seems to be a known limitation of the API, delivering only 400 pages:
https://stackoverflow.com/questions/68910259/fetch-all-stargazers-over-time-of-a-repository

Strongly related, potentially offering a solution: https://observablehq.com/@observablehq/github-stargazer-history

@Swiftyos I hope you get this notification; we can look into extracting the 'correct' number of stargazers in your special case there. The "many stargazer challenge" has been deliberately un-addressed by me and there are obvious ideas for improvement so that the larger chunk of the stargazer timeseries does not need to be re-fetched every single time the action runs. Also see

# TODO: for ~10k stars repositories, this operation is too costly for doing
.

@Swiftyos I saw you picked a 90 minute interval for running the action -- that is a little often for no obvious benefit! Once per day should really be good enough. Do you have any specific concerns you try to address with the 90 minute interval?

@jgehrcke
Copy link
Owner Author

The header of the response to /stargazers shows the "last" page, and beyond that there's no hope I think:

< link: <https://api.github.com/repositories/614765452/stargazers?page=2>; rel="next", <https://api.github.com/repositories/614765452/stargazers?page=1334>; rel="last"

One could think that this is a great reason to run github-repo-stats before reaching the 40000 stargazers, so that one can build the full timeseries over time via incremental updates, always getting newer data while the old stargazers move out of the visible time window. However, seemingly one has to stick with seeing the first 40000 stargazers, newer ones are always in the blind spot. This is a super quick analysis, maybe I have missed something here.

@ntindle
Copy link

ntindle commented Sep 20, 2023

I’ve pinged Swifty to take a look here. The proactivity is much appreciated

@Swiftyos
Copy link

Oh hey @jgehrcke thank you.
I had picked 90 mins for today, hoping it would follow on from were it left off. I've changed it back to daily at 23h now.

Is there a way then we can use the summary stat (total stargazers) and build a history from now?

@jgehrcke
Copy link
Owner Author

jgehrcke commented Sep 20, 2023

Thanks for the kind words @ntindle @Swiftyos!

Is there a way then we can use the summary stat (total stargazers) and build a history from now?

Yes! Example:

curl -L  -v  -H "Accept: application/vnd.github+json"   -H "Authorization: Bearer $GITHUB_API_TOKEN"   https://api.github.com/repos/Significant-Gravitas/Auto-GPT
...
  "stargazers_count": 148984,
  "watchers_count": 148984,
...
  "forks_count": 32362,
...

I will look into adding this to github-repo-stats. This is actually kind of good news -- another good reason for running github-repo-stats periodically :).

Btw, another quote I found on the limitation being baked into the GitHub API(s):

There is a limit (i.e., 400) for pagination in Github APIs. In the past, when pulling information from Github projects, nobody reached this limit because the number of records that are being pulled (e.g., stars in your question, or issue events in this post) did not reach the 40,000 (i.e., 40 times 100) limit.
Nowadays, some projects (like twbs/bootstrap or rails/rails) are grown too much and the current pagination cannot pull the full information, and as of now, I don't see any mechanism that solves this issue.

(https://stackoverflow.com/a/40871568)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants