Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incomplete datapoints can now resolve the affected repositories #62756

Merged
merged 13 commits into from
May 27, 2024

Conversation

bahrmichael
Copy link
Contributor

@bahrmichael bahrmichael commented May 17, 2024

Closes #62578

For #62295

Previously with our GraphQL api, you couldn't figure out which repositories caused incomplete datapoints. With this change you can now provide an argument to the incompleteDatapoints to not aggregate points for repositories, and then resolve the repositories for each datapoint.

This PR is needed to help debug incomplete datapoints in Code Insights. When customers create Code Insights for a large number of repositories, it's hard to understand how big the impact of incomplete datapoints is, and which repositories those issues are coming from. If you don't have access to the logs it's basically impossible to isolate problematic repositories.

Queries work as before, when you don't add the aggregateRepositories=false parameter or resolve the repository.
Screenshot 2024-05-24 at 12 22 58

When you add the aggregateRepositories=false parameter and resolve the repository, you get individual datapoints for each repository that had a problem.

Screenshot 2024-05-24 at 12 22 34

If you set aggregateRepositories=true and attempt to resolve the repository, it will be null.
Screenshot 2024-05-24 at 12 22 43

Test plan

  • Existing code paths are covered by CI
  • I will add more tests if this approach is accepted

@bahrmichael
Copy link
Contributor Author

The license check is a known issue: https://sourcegraph.slack.com/archives/C04MYFW01NV/p1715937672950199

By default, incomplete datapoints are aggregated across all repositories.
Setting this to false will allow resolving the repository.
"""
aggregateRepositories: Boolean = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: now that repositories is an array, do we still need this parameter? If a client doesn't care about the repository list, they can just exclude that from the list of fields in their query. Excluding this also removes the (documented but still maybe surprising) dependency between the repositories field and this argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Thank you for the reminder. I was able to clean it up, and things seem to work as expected. Since I can't find any problematic to the store method, it should be good as long as CI passes.

Comment on lines +880 to +882
if repoId.Valid {
mappedRepoIds[i] = int(repoId.Int64)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: the DB schema says repo_id is nullable, but that's kinda surprising to me. Do you understand why that is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found #45282 which inserts null here and mentions global queries. The repoId and repoName should be available though, based on the types that this incomplete insert runs on. I haven't found any places where the repoId and repoName on RecordSeriesPointArgs are not set. Maybe it's to reduce the number of inserts for global queries?

In the backend documentation it sounds like there should also be repo information no matter if it's global or not.

Its job is to periodically schedule a recording of 'current' values for Insights by enqueuing a recording using a global query. This only requires a single global query per insight regardless of the number of repositories,
and will return results for all the matched repositories. Each matched repository will still be recorded individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

Not blocking, just thinking out loud to try to understand this better. What is a global code insight? It kinda makes sense that a global job wouldn't have a repo ID because it's running against everything, but when would we do that? Maybe there's a special case for an insight that only runs against public repositories, so we know that all users can view all the data, and don't need to keep track of which repo the points are for?

@bahrmichael bahrmichael enabled auto-merge (squash) May 27, 2024 10:27
@bahrmichael bahrmichael changed the title feat: incomplete datapoints can now resolve the affected repository feat: incomplete datapoints can now resolve the affected repositories May 27, 2024
@bahrmichael bahrmichael merged commit ff9ef6f into main May 27, 2024
9 checks passed
@bahrmichael bahrmichael deleted the bahrmichael/62578 branch May 27, 2024 10:32
bahrmichael added a commit to sourcegraph/docs that referenced this pull request May 31, 2024
This PR updates the documentation to explain how users can use a new
GraphQL field introduced with
sourcegraph/sourcegraph#62756 to identify
repositories that cause incomplete datapoints.

For sourcegraph/sourcegraph#62295

## Pull Request approval

Although pull request approval is not enforced for this repository in
order to reduce friction, merging without a review will generate a
ticket for the docs team to review your changes. So if possible, have
your pull request approved before merging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code insights: Extend GraphQL schema to expose repositories that cause incomplete datapoints
2 participants