Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch big IN/NOT IN selections #575

Merged
merged 4 commits into from Mar 17, 2020
Merged

Batch big IN/NOT IN selections #575

merged 4 commits into from Mar 17, 2020

Conversation

pimeys
Copy link
Contributor

@pimeys pimeys commented Mar 11, 2020

This will fix the issues when joining and fetching with a big selection of records when the database doesn't like that big of a queries anymore.

What it does is it checks the filter, if it can find ONE scalar filter that is IN/NOT IN with more than 5000 items, it will split the filters into batches of at most 5000 items and run the queries in parallel.

So this is going to be batched:

query {
  findManyUser(where: { id_lt: 10000 }) {
    posts(where: { id_lt: 10000 }) {
      id
    }
  }
}

The batch size can be controlled via QUERY_BATCH_SIZE environment variable for test purposes.
Hopefully on Friday I can solve the last query in a way that works and I can also write tests for this somehow.

Copy link
Contributor

@tomhoule tomhoule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar enough with the implementation of the filters yet to say if there's high level problems with the design, but it looks good to me 💯

One super nitpicky thing: the PR description is helpful, it would be nice to have it in the commit message for future reference (sometimes I do use that :p)

}

batches
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking a bit more about this, and if the list here is large (which it is if we are batching), then we are iterating over it, but if we computed the indexes for the batches, we could iterate over those and push the batches one at a time (instead of an element at a time). Here we have per-element costs like the conditional, and the last_mut lookup (with bounds check). But that's an optimization we can do later once we have decided we do want to do this this way and the API settles down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this thing here also is not the final version. It's the first version that somehow works. Now we need some conversation about this. The first limit I'd like to lift is the problem of this not working if we have more than one IN statement in the filters.

Julius de Bruijn and others added 2 commits March 16, 2020 12:24
This will fix the issues when joining and fetching with a big selection
of records when the database doesn't like that big of a queries anymore.

What it does is it checks the filter, if it can find ONE scalar
filter that is `IN`/`NOT IN` with more than 5000 items, it will split the
filters into batches of at most 5000 items and run the queries in parallel.

So this is going to be batched:

```graphql
query {
  findManyUser(where: { id_lt: 10000 }) {
    posts(where: { id_lt: 10000 }) {
      id
    }
  }
}
```

The batch size can be controlled via `QUERY_BATCH_SIZE` environment
variable for test purposes.
Copy link
Contributor

@dpetrick dpetrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the discussed ordering issues, this looks alright.

@pimeys
Copy link
Contributor Author

pimeys commented Mar 16, 2020

Addresses #579

Copy link
Contributor

@tomhoule tomhoule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only doubt is whether this will interact with other features, and we miss it because the default is 5000 and most tests don't have that many records in the database.

libs/prisma-models/src/order_by.rs Outdated Show resolved Hide resolved
}

if let Some(ref order_by) = order {
records.order_by(order_by)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so here we actually sort ourselves right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@pimeys pimeys merged commit fb04976 into master Mar 17, 2020
@pimeys pimeys deleted the batching branch March 17, 2020 11:35
@pimeys
Copy link
Contributor Author

pimeys commented Mar 17, 2020

My only doubt is whether this will interact with other features, and we miss it because the default is 5000 and most tests don't have that many records in the database.

We'll keep an eye on this. For now with lots of manual testing and with the integration tests, I was able to find two problems:

  • The selection values should be deduplicated, which is fixed here
  • The ordering needs to happen in the application layer (connector), which is fixed here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants