Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURES] add additional information for a package #187

Open
6 tasks
manekinekko opened this issue Apr 24, 2018 · 13 comments
Open
6 tasks

[FEATURES] add additional information for a package #187

manekinekko opened this issue Apr 24, 2018 · 13 comments

Comments

@manekinekko
Copy link

Hi guys. I've got requests for some new additions to the npm-search index. Let's discuss them here:

  • open issues on gihtub: How many open issues are there.
  • number of gihtub stars: How many people stared this projet.
  • number of gihtub forks: How many people stared this projet.
  • open PR on gihtub: How many open PRs are there.
  • last commit on gihtub: When was the last commit on master.
  • add any other usefull information (here are all the properties returned by Github's API)

Does this make sense?

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 24, 2018

It would make sense to also query GitHub for those for which we have the data. In the Yarn website we do this frontend once the detail page has been requested. For requesting GitHub data we'll need to have some API key rotation like npms does: source.

It would also make the replication slower, but that would be fine IMO, it's just the API usage limitations that I wasn't able to overcome for now

@MartinKolarik
Copy link
Collaborator

MartinKolarik commented Apr 24, 2018

One thing that I would also love to add is jsDelivr hits in the last month. Right now the search results are sorted by npm downloads, which is great for nodejs/backend packages, but doesn't work that well for browser/frontend packages, especially those, which recommend CDN as the primary installation option.

It would be nice if we could either combine those numbers somehow, or simply had an option for sorting by cdn hits rather than npm downloads. Implementation should be as easy as using either this or this endpoint of our API.

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 24, 2018

Is it possible to get the monthly downloads batched, like this but with a specified 100/200 packages to look this up from at the same time, so we can look them all up?

@MartinKolarik
Copy link
Collaborator

There's ?page, so you can just do ?page=1, ?page=2, ... until you get no results.

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 24, 2018

I won't have time for now to add this, but feel free to contribute or contact me if I can help. It indeed could be possible to augment the index like that, although the flow we currently have is

  1. get all packages in batches
  2. loop over a batch of packages to get more info or get more info from the whole batch

So we'd need to control which packages would be in the batch (maybe something that can be added to your API first?

@vvo
Copy link
Contributor

vvo commented Apr 24, 2018

Those are valid concerns, adding the jsDelivr downloads seems "easy". As for the GitHub api requests that's a bit more tricky because of what @Haroenv said. Then you also have the freshness issue, currently rebuild the full index every week. If we provide things like number of opened issues, you might want the data to be a little fresher. That might require a bit more work to optimise the data pipeline (today it takes one day to rebuild completely the index, I am sure we can lower it down but we never investigated it too much)

Still, if you already have ideas on how to do it well, please do contribute, make it faster, anything :)

@MartinKolarik
Copy link
Collaborator

Since the data is rather compact (just package name and one number), I think we could get everything at once and have it stored in memory during the indexing process. Our API currently computes stats for all packages every time (which is why it is that slow), even though it only gives you 100 results at once, so it would make more sense if we removed that limit and you'd be able to get all numbers in one request (we're talking about 2.5 MB of data per 100k packages and currently we have ~20k packages).

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 24, 2018

Seems possible, feel free to do a PR @MartinKolarik :)

@MartinKolarik
Copy link
Collaborator

@Haroenv 👍 unfortunately I don't have the time right now either but hopefully later...

@manekinekko
Copy link
Author

@vvo @Haroenv as per the opened PRs and issues, I think it makes more sense to do it client side, indeed. I'll figure out a way to do it on https://ngx.tools.

When it comes to augmenting the npm-search index with additional information from github, I'll throw some "random" ideas here (I might be wrong in some aspects):

  • use authenticated REST calls in order to increase the rating limit from 60 to 5k req/hour (as Haroen mentioned).
  • partition the dataset (NPM packages) and query periodically github for a subset of packages.
  • we could query github info for popular packages first.
  • store the result in an intermediate database
  • during the (re)indexation process, use that intermediate (cache) database to read and augment the index whith github's info, without slowing the indexation process.
  • NOTE: depending on how the indexation process is done, we can skip the intermediate database and write directly the github info to the index database.

NPM has ≈ 700k packages. So, 700k / 5k = 140 hours ≈ 6 days. It'd take 6 days to process the 700k packages using one gitub API key, with 5k packages per hour. We could enhance this by using 6 github API keys and do it in 1 day. Right? We can even throw these calls inside Cloud Function and don't pay for the infrastruction (lots of Cloud Providers have free tier—GCP offers 2 million free call / month).

Alternatively, we could use GraphQL to query github, since one GraphQL call can replace multiple REST calls. A single complex GraphQL call could be the equivalent of thousands of REST requests.

I'm sure we'll figure it out ^_^

@vvo
Copy link
Contributor

vvo commented Apr 25, 2018

Thanks for the nice architecture thoughts!

Alternatively, we could use GraphQL to query github, since one GraphQL call can replace multiple REST calls. A single complex GraphQL call could be the equivalent of thousands of REST requests.

I did try out the GitHub GraphQL API, a bit strange at the beginning (I had to understand the actual GraphQL language), but afterwards feels super nice.

Example: https://github.com/vvo/zorgs/blob/master/src/zorgs/src/queries/repositoriesWithCommits.js

@manekinekko
Copy link
Author

Very nice. In the same fashion, we could get the number of stars and issues like so:

{
  search(type: REPOSITORY, query: "user:angular", first: 3) {
    edges {
      node {
        ... on Repository {
          name
          stargazers {
            totalCount
          }
          forkCount
          issues {
            totalCount
          }
        }
      }
    }
  }
}

Which would give us:

{
  "data": {
    "search": {
      "edges": [
        {
          "node": {
            "name": "angular.js",
            "stargazers": {
              "totalCount": 58360
            },
            "forkCount": 28909,
            "issues": {
              "totalCount": 8785
            }
          }
        },
        {
          "node": {
            "name": "angular",
            "stargazers": {
              "totalCount": 35458
            },
            "forkCount": 8634,
            "issues": {
              "totalCount": 14507
            }
          }
        },
        {
          "node": {
            "name": "angular-cli",
            "stargazers": {
              "totalCount": 17163
            },
            "forkCount": 3753,
            "issues": {
              "totalCount": 7696
            }
          }
        }
      ]
    }
  }
}

@Haroenv
Copy link
Collaborator

Haroenv commented Apr 25, 2018

Seems useful! Would be nice to try out if we can get this merged, can be added in github.js or another file. Ideally we would be able to get the issues, forks and stargazers of multiple (100/200) repositories at the same time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants