Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize and batch merge pipelines list splits queries #4968

Merged
merged 1 commit into from May 15, 2024

Conversation

guilload
Copy link
Member

@guilload guilload commented May 9, 2024

Description

  • Add node_id column to splits table
  • Populate node_id column from splits.split_medata_json column
  • Add hash index on splits.node_id column
  • Batch list splits query when new spawning new indexing pipelines

How was this PR tested?

  • Updated unit tests
  • Ran this SQL query on Cicada which did not return null or empty values

@guilload guilload force-pushed the guilload/optimize-merge-pipeline-list-splits-queries branch 6 times, most recently from 46b4b27 to 6ef3f69 Compare May 10, 2024 19:58
@guilload guilload marked this pull request as ready for review May 10, 2024 20:02
@guilload guilload force-pushed the guilload/optimize-merge-pipeline-list-splits-queries branch from 6ef3f69 to f368b10 Compare May 10, 2024 20:11
@guilload guilload requested a review from fulmicoton May 10, 2024 20:17
@guilload
Copy link
Member Author

@fulmicoton, I still have to add at least three unit tests in indexing_service.rs and merge_pipeline.rs, but this PR is ready for review.

@guilload guilload force-pushed the guilload/optimize-merge-pipeline-list-splits-queries branch 4 times, most recently from 6d4e9b4 to d5e6233 Compare May 13, 2024 22:13
pub fn new(params: MergePipelineParams, spawn_ctx: &SpawnContext) -> Self {
pub fn new(
params: MergePipelineParams,
initial_immature_splits_opt: Option<Vec<SplitMetadata>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a case where we want to call this with None? What is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I covered the spawn_pipelines branch, which is the one apply_indexing_plan uses, but I did not handlespawn_pipeline, which I think is used by the local CLI ingest. This is the reason for using an option.

I will refactor the code a little bit to eliminate the option. With some tweaking, spawn_pipeline can use spawn_pipelines under the hood.

ALTER TABLE splits
ADD COLUMN node_id VARCHAR(253);

UPDATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
UPDATE
// Split metadata have been stable for quite a while, so we
// allow ourselves to do this, but please reader of the future,
// do not blindly reapply this pattern.
UPDATE

@guilload guilload force-pushed the guilload/optimize-merge-pipeline-list-splits-queries branch from d5e6233 to 2b92b06 Compare May 15, 2024 14:21
@guilload guilload force-pushed the guilload/optimize-merge-pipeline-list-splits-queries branch from 2b92b06 to 975daf4 Compare May 15, 2024 15:16
@guilload guilload merged commit 36f683a into main May 15, 2024
4 of 5 checks passed
@guilload guilload deleted the guilload/optimize-merge-pipeline-list-splits-queries branch May 15, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants