Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize (left outer) (anti) semi join with other conditions #8339

Merged
merged 26 commits into from
Dec 6, 2023

Conversation

gengliqi
Copy link
Contributor

@gengliqi gengliqi commented Nov 8, 2023

What problem does this PR solve?

Issue Number: close #8262

Problem Summary:
See #8262

What is changed and how it works?

  1. semi join with other conditions can finish quickly in some cases. It does not need to combine all matched rows and then calculate the other conditions.
  2. The size of block in the process of calculating other conditions is fixed even if matched rows are huge. No OOM will happen when using semi join.
  3. semi join with no other condition is also optimized in this PR. In the previous implementation, semi join added matched row columns in the right table. Actually, these columns are useless and will be removed soon. In this PR, these columns will not be added anymore.
// Before + TPC-H 1
mysql> select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY and a.C_PHONE > b.C_PHONE);
+----------+
| count(*) |
+----------+
|   149975 |
+----------+
1 row in set (16.04 sec)

// This PR + TPC-H 1
mysql> select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY and a.C_PHONE > b.C_PHONE);
+----------+
| count(*) |
+----------+
|   149975 |
+----------+
1 row in set (0.08 sec)

// Before + TPC-H 100
mysql> select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY and a.C_PHONE > b.C_PHONE);
ERROR 1105 (HY000): rpc error: code = Canceled desc = context canceled

// This PR + TPC-H 100
mysql> select count(*) from customer a where exists (select 1 from customer b where a.C_NATIONKEY=b.C_NATIONKEY and a.C_PHONE > b.C_PHONE);
+----------+
| count(*) |
+----------+
| 14999975 |
+----------+
1 row in set (0.89 sec)

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Optimize (left outer) (anti) semi join with other conditions

Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note-none size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 8, 2023
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

gengliqi commented Nov 8, 2023

/run-all-tests

Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

gengliqi commented Nov 9, 2023

/run-all-tests

@gengliqi gengliqi changed the title Rewrite (left outer) (anti) semi join to make it faster Optimize (left outer) (anti) semi join with other conditions Nov 9, 2023
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

gengliqi commented Nov 9, 2023

/run-all-tests

Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

gengliqi commented Nov 9, 2023

/run-all-tests

Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

/run-all-tests

@gengliqi
Copy link
Contributor Author

/run-integration-test

Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

/rebuild

dbms/src/Interpreters/Join.cpp Outdated Show resolved Hide resolved
block.insert(src_column);
}

if constexpr (STRICTNESS == ASTTableJoin::Strictness::All)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STRICTNESS == ASTTableJoin::Strictness::All means has other condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if constexpr (KIND == ASTTableJoin::Kind::LeftOuterSemi || KIND == ASTTableJoin::Kind::LeftOuterAnti)
{
auto * left_semi_column = typeid_cast<ColumnNullable *>(added_columns[right_columns - 1].get());
left_semi_column_data = &typeid_cast<ColumnVector<Int8> &>(left_semi_column->getNestedColumn()).getData();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve for left_semi_column_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

else
{
auto result = res[i].getResult();
if constexpr (KIND == ASTTableJoin::Kind::Semi || KIND == ASTTableJoin::Kind::Anti)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like for Semi and Anti join, the code branch of Any/All strictness can be merged into one if you define operator bool() on the result as result == SemiJoinResultType::TRUE_VALUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I use the function isTrueSemiJoinResult because result is an enum or bool type.

@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 22, 2023
Copy link
Contributor

ti-chi-bot bot commented Nov 22, 2023

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

/run-all-tests

@windtalker
Copy link
Contributor

Need to set probe_cache_column_threshold to 0 here for semi join

if (isNullAwareSemiFamily(kind))
probe_cache_column_threshold = 0;

Signed-off-by: gengliqi <gengliqiii@gmail.com>
Copy link
Contributor

@windtalker windtalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
Signed-off-by: gengliqi <gengliqiii@gmail.com>
@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 5, 2023

/run-all-tests

Copy link
Contributor

@yibin87 yibin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM

dbms/src/Interpreters/Join.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

ti-chi-bot bot commented Dec 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: windtalker, yibin87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Dec 6, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-11-30 08:00:56.824986398 +0000 UTC m=+1082485.490212593: ☑️ agreed by windtalker.
  • 2023-12-06 02:48:09.681985647 +0000 UTC m=+1582118.347211842: ☑️ agreed by yibin87.

@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 6, 2023

/run-all-tests

@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 6, 2023

/run-all-tests

2 similar comments
@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 6, 2023

/run-all-tests

@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 6, 2023

/run-all-tests

@gengliqi
Copy link
Contributor Author

gengliqi commented Dec 6, 2023

/run-unit-test

@gengliqi gengliqi removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2023
@ti-chi-bot ti-chi-bot bot merged commit 88f9912 into pingcap:master Dec 6, 2023
6 checks passed
@gengliqi gengliqi deleted the refine-semi-join branch December 6, 2023 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize (left outer) (anti) semi join with other conditions
3 participants