Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeouts and re-indexing issues #7

Open
jonchurch opened this issue Jan 27, 2020 · 10 comments
Open

Fix timeouts and re-indexing issues #7

jonchurch opened this issue Jan 27, 2020 · 10 comments

Comments

@jonchurch
Copy link
Member

@wesleytodd Per your comment expressjs/discussions#92 (comment)

It is not, I had to turn off the indexing because it was timing out the GH action. I will take a look at manually re-running it this week and see if it still times out. I brought this up int he last package maintenance WG call, but I will mention it here as well, if anyone is interested in fixing the @pkgjs/statusboard so that it does not re-index all projects at once that would be awesome. If you are interested in doing this work let me know and we can discuss the ways to do it.

Let's start a discussion about this here. I have cruised through this codebase and run it locally a few times, so I have some context as to how it works.

If I can't find the time to help, at least we'll have some record here of what should be done or is happening.

@wesleytodd
Copy link
Member

So the issue was that indexing all of the repos takes a while, in my tests it is around 12min. There are a ton of perf improvements (like parallelizing some of the calls) which could help, but IMO there is one easier and more important thing to change. It just shouldn't index all projects on every run.

Things in many don't change that often, so if we just kept a flag for the last time something was successfully indexed, and limited the number of projects indexed on each run to like 5 it would never hit the timeout and then since it runs every 2 hours it would take less than 24 hours to index each project.

Adding this would be as easy as on complete for a repo writing to the leveldb with the marker then using this to pick the oldest 5 instead of iterating the whole list.

@jonchurch
Copy link
Member Author

jonchurch commented Jan 28, 2020

AFK but been thinking about this

I havent benchmarked anything yet, so dunno if this would matter, but for v1 would it make sense to cut out the travis.yaml, package.json, and requests to NPM?

Assumptions here being that v1 focuses on getting issues from github to populate the statusboard.

Im currently reading some leveldb docs and the associated code in the repo to implement the lastIndexed value for projects 👍

An aside, but I previously worked on writing graphql queries to get the issues and labels. I have the queries committed on a branch of my fork, but there is still a need (Im assuming) for some repos to paginate past the 100 issue max that the graphql github api limits you to.

@wesleytodd
Copy link
Member

cut out the travis.yaml, package.json, and requests to NPM?

Yeah it would make sense, and those are what I was talking about when I said "There are a ton of perf improvements". But in the end the github data is the longest time spent. By all means cut those calls out for now, but I am not sure that alone will get it reliably under time.

Im currently reading some leveldb docs and the associated code in the repo to implement the lastIndexed value for projects 👍

Awesome! Let me know if you have questions.

paginate past the 100 issue max that the graphql github api limits you to

This is why I used the rest api, it was just more simple and well documented. If we can use the graphql api then great!

@jonchurch
Copy link
Member Author

jonchurch commented Jan 29, 2020

The graphql API isn't super fun to use, but I was able to write some scratch code tonight that grabbed all the open issues across the 3 orgs in around 7 seconds when requesting the orgs sequentially.

There's only two repos across all orgs that even have more than 100 issues open (multer and express). 100 is the most you can grab from a repo in a single trip.

Here's the gist, I ran this from the statusboard project (requires a GITHUB_TOKEN to be set in the env, and node-fetch).

I looked at the levelDB implementation for a while too, not ruling that out, but I wanted to revisit the graphql approach. I realized there would need to be some logic in place to handle a fresh build, or when a new repo is added to an org. If you're querying the db for the least recently updated project, that assumes all projects have already been indexed. So some logic would have to be written to check the DB for a project's existence, and prioritize that one if a record wasn't found.

@wesleytodd
Copy link
Member

The graphql API isn't super fun to use, but I was able to write some scratch code tonight that grabbed all the open issues across the 3 orgs in around 7 seconds when requesting the orgs sequentially.

The existing implementation also pulls closed issues, you will need to do this to track activity between indexes. An issue can be opened and resolved very quickly.

So some logic would have to be written to check the DB for a project's existence, and prioritize that one if a record wasn't found.

Yep, the config file would be the source and then sort no last index to the top.

@jonchurch
Copy link
Member Author

jonchurch commented Apr 12, 2020

I'm thinking now that the biggest perf gains would be in running some of this code concurrently.

Currently the control flow iterates over each org and repo, doing work sequentially and essentially blocking until each repo is completed, then moving on to start over with the next org. I'm not sure how to switch to concurrency with the async generators.

I think this could build out a statusboard for all three express orgs in as little as a few minutes if we were able to run loadProjects concurrently for all repos in an org. Most repos currently take less than 30 seconds to process on their own.

I'm probably overthinking things, I'll come back to look at the code with fresh eyes on monday.

To benchmark loadProject for each repo I made the following additions locally in lib/db/build-index.js:

  async function * iterateProjects (config) {
 // ...
    try {
      for await (let repo of github.getOrgRepos(graphQL, org.name)) {
+       const start = Date.now()
        const proj = new Project({
          repoOwner: repo.owner,
          repoName: repo.name
        })
        for await (let evt of loadProject(octokit, graphQL, proj, config, repo)) {
          yield evt
        }
+       console.log(`====FINISHED PROCESSING ${repo.name} in ${(Date.now() - start) / 1000} seconds`)
      }
    } catch (e) {
      yield projectDetail('ERROR', org, e)
    }
  }
}

@wesleytodd
Copy link
Member

So funny enough I had to do something almost exactly like this at work but on a much larger scale (around 8k projects) and it got me thinking about the next steps here. There is a very clear limit, the per process memory limit, at which doing things in parallel breaks and I think the express version of this would exceed this threshold. Up until that limit, you are absolutely right, it greatly speeds things up.

FWIW, storing in leveldb is what I used to get past this limit here, and it is also what I did for my task at work. The downside in both cases is, to get past the memory limit the best way is go multi thread or multi process, but leveldb is neither of these. That brings me to my next plan, which is to split into multi leveldb instances where we shard based on repo to enable mult-process.

One other option would be to write the json to disk directly and use worker threads for the parallelism (same with sharding by repo). The downside here is managing the writes to disk in a resilient and efficient way. This is more complicated than it seems and leveldb does this well.

On a side note, check out this api for this kind of measurement: https://nodejs.org/api/perf_hooks.html

@jonchurch
Copy link
Member Author

jonchurch commented Apr 20, 2020

I wrote a comment last week that I never sent bc it was too long lol. Here’s an abbreviated version.

I realized that the shortest path w/ biggest speed gain atm was to parallelize the network requests on a per org basis.

What I coded was ugly, imo, but kicking off all network requests for an org got the three express repos down to a 4 minute and change build time (total). Idk how close I got to the memory limit, which is an important point.

I fetched the data for each repo then handed it off the the generators for iteration.

@jonchurch
Copy link
Member Author

That sais, with the graphql work down to 11 minutes or so, if it doesnt time out in a github action, we’re probably good to go on our MVP.

@wesleytodd
Copy link
Member

Yeah that sounds good. The goal is get it functional, no one cares as long as the information is in front of people! I did a review of #10 yesterday, so once we get that merged we can update on https://github.com/expressjs/statusboard and merge that initial work PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants