Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: extend troubleshooting for very large repositories #329

Merged
merged 2 commits into from
May 21, 2024

Conversation

bahrmichael
Copy link
Contributor

@bahrmichael bahrmichael commented May 16, 2024

For sourcegraph/sourcegraph#62295

This PR updates the documentation with more tips for very large repositories.

There are difficulties with Code Insights where it may run for a while, and then tell the user that there were incomplete data points. This probably came from very large repositories not being able to compute reasonably fast.

In addition to this documentation update I'm working on giving users more information about which repositories lead to incomplete datapoints: sourcegraph/sourcegraph#62578


@sourcegraph/search-platform I poked a bit at the search backend when gathering this info, and would like to get your input if it's accurate, and if there may be other improvements to make complex queries run faster on very large repos :)

@mike-r-mclaughlin Could you review if this new info would be helpful for customers? I'm planning to expose the repositories that caused incomplete datapoints with sourcegraph/sourcegraph#62578. Then a customer can see which repository didn't compute, pick that one, optimize the query, and then run the big Code Insight again.

Copy link

vercel bot commented May 16, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sourcegraph-docs-v2 ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 17, 2024 8:35am

@bahrmichael bahrmichael requested review from a team and mike-r-mclaughlin May 16, 2024 08:57
Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just left some minor suggestions.

@@ -39,3 +39,6 @@ next-env.d.ts

# search index file generated on build
/public/search.json

# IDEs
.idea
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you also have great taste in IDEs 😊


You can use Code Search to test the query against a particular timestamp in a given repository.

Since Code Insights computes data points for twelve datapoints in the give time range,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unindexed search usually can take longer the further back you go in history. For older commits, more files are different from HEAD, so searcher needs to perform more brute-force file searches.

Could we suggest a specific time to target, like a worst-case? I'm not sure how far back Code Insights goes by default. We could even suggest rev:at.time(...) as a convenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I was hoping for some info like that, since I had a suspicion about older commits taking longer to search. I'll update some text above to work your suggestion in. Let me know if that sounds good :)


For example, if you want to track the version of a NPM dependency in your code base, searching for `my_library file:package.json` will compute much faster because there are less files to look at and fewer results to return.

We recommend to make your query as precise as possible (and even omit results that may be relevant) until you reach a query that is able to compute fast enough.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestions:

  • Great that you mentioned mention file filters, maybe we could mention lang too
  • We could also mention the importance of quotes "..." if your search string contains whitespace
  • Maybe we shouldn't say "and even omit results that may be relevant" since we do really want these queries to be relevant :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ideas! I wasn't so sure about the "omit" part. Dropped that now. I've added some more tips, but added a disclaimer to the lang filter. From what I've seen in Language Stats Insights, this filter needs to load the file content and read it, and can therefore be a bit slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search language filter is implemented differently, and tends to run very quickly. (If you're interested in the technical details, the search lang filter first consults the file name, and avoids loading and analyzing content in the vast majority of cases.)

@bahrmichael bahrmichael merged commit 1c1b9a8 into main May 21, 2024
5 checks passed
@bahrmichael bahrmichael deleted the bahrmichael/2024-05-16-large-repos-2 branch May 21, 2024 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants