Prefilter join build side when it's too large #22667

kaikalur · 2024-05-03T23:43:36Z

Description

Optimize the build side of join using the distinct keys from left when the right (and left too) are large.

Motivation and Context

SELECT .. FROM T1 JOIN T2 USING(x)

can be very slow/memory intense when T2 is big (and T1 is also big). So the idea is to do something like dynamic filter except on the build side! So the above query becomes:

SELECT ... FROM T1 LEFT JOIN (SELECT * FROM T2 WHERE x IN (SELECT DISTINCT x FROM T1)) T2 USING(x)

This has helped us tremendously in some of our production workloads. So making it an optimization.

Impact

Test Plan

Added tests

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side.  This can be enabled with the ``join_prefilter_build_side `` session property. :pr:`22667`

     join_prefilter_build_side

agrawaldevesh · 2024-05-05T06:59:23Z

Awesome ! Is this strictly opt in or can it be hbo or cbo'd ?

Is there also a way to fail the select distinct early if its cardinality is too big ?

Finally, I did not follow: can it be applied to multiple equi join keys too ?

elharo

Great idea!

presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java

ClarenceThreepwood

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?

IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?

Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java

yingsu00

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

steveburnett · 2024-05-09T14:25:21Z

suggest minor revision of the release notes entry

== RELEASE NOTES ==

General Changes
* Add optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the ``join_prefilter_enabled`` session property. :pr:`22667`

kaikalur · 2024-05-09T14:43:13Z

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?

IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?

Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

Two potential cases - a) build side is very big and only a few keys actually match so we shuffle a lot less right side, b) after the semijoin, the build side becomes small enough to broadcast which can eliminate shuffling the full left side which could have a lot of payload.

kaikalur · 2024-05-09T14:43:51Z

I'm curious, if we already know the distinct keys in T1, why not just make it as the build side? No need to calculate the hash values, just use the distinct values as the hash values. This way there is no need to scan T1 twice.

You need to get the rest of the fields!

kaikalur · 2024-05-09T15:07:46Z

Can you share some performance numbers that you see in your workloads? Maybe even add a SqlBenchmark that showcases this optimization?
IIUC - this optimization reduces the size of the hash table that is built out of T2. In order to do this it adds a second table scan on T1 and then builds a second hash table to compute the distinct join key from T1. I'm curious where the benefit comes from? Is it just the improved performance of the semijoin?
Any thoughts on how this can be used in practice? I ask because this is not a cost based decision and can degrade performance in many usecases

Two potential cases - a) build side is very big and only a few keys actually match so we shuffle a lot less right side, b) after the semijoin, the build side becomes small enough to broadcast which can eliminate shuffling the full left side which could have a lot of payload.

Added benchmark with:

select count(1) from part join lineitem using (partkey) where part.name like '%x%'

Original: join_prefilter_build_side :: 2158.693 cpu ms :: 4.17MB peak memory :: in 75.2K, 0B, 34.8K/s, 0B/s :: out 381, 26.7KB, 176/s, 12.4KB/s With optimization: join_prefilter_build_side :: 2189.438 cpu ms :: 2.02MB peak memory :: in 90.2K, 0B, 41.2K/s, 0B/s :: out 381, 26.7KB, 174/s, 12.2KB/s See the memory reduction

kaikalur · 2024-05-09T16:38:48Z

Add a task for making it cost based - needs tracking some new stats in hbo:

https://fburl.com/2aui8j1i

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java

kaikalur · 2024-05-13T17:19:43Z

@ClarenceThreepwood - can you take a look when you get a chance?

ClarenceThreepwood

Please update the release note with the new name of the session property

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java

ClarenceThreepwood · 2024-05-14T18:24:03Z

Add a task for making it cost based - needs tracking some new stats in hbo:

https://fburl.com/2aui8j1i

This is meta internal only?

kaikalur · 2024-05-14T18:44:21Z

Add a task for making it cost based - needs tracking some new stats in hbo:
https://fburl.com/2aui8j1i

This is meta internal only?

Oops. Sorry. here correct link:

#22706

kaikalur · 2024-05-14T18:51:05Z

Please update the release note with the new name of the session property

Done

kaikalur · 2024-05-14T18:53:04Z

addressed comments

ClarenceThreepwood · 2024-05-14T19:02:02Z

Please update the release note with the new name of the session property

Done

It still has the old name here
"Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the join_prefilter_enabled session property. :pr:22667"

kaikalur · 2024-05-14T19:12:09Z

Please update the release note with the new name of the session property

Done

It still has the old name here Added a new optimization for prefiltering the build side of a join with distinct keys from the probe side. This can be enabled with the join_prefilter_enabled session property. :pr:22667

OK for real this time lol - damn scrollbar!

kaikalur · 2024-05-14T20:18:59Z

OK all comments addressed (again)

kaikalur requested review from jaystarshot, feilong-liu and a team as code owners May 3, 2024 23:43

kaikalur requested a review from presto-oss May 3, 2024 23:43

kaikalur force-pushed the prefilter branch 6 times, most recently from 0eea294 to 6b56668 Compare May 5, 2024 03:22

jaystarshot requested a review from ClarenceThreepwood May 5, 2024 03:28

elharo reviewed May 5, 2024

View reviewed changes

kaikalur force-pushed the prefilter branch 4 times, most recently from 0c67ec3 to 8cf9cdc Compare May 5, 2024 22:09

ClarenceThreepwood requested changes May 8, 2024

View reviewed changes

yingsu00 reviewed May 8, 2024

View reviewed changes

feilong-liu reviewed May 8, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Outdated Show resolved Hide resolved

kaikalur force-pushed the prefilter branch from 8cf9cdc to a3e46b8 Compare May 9, 2024 14:39

kaikalur force-pushed the prefilter branch 2 times, most recently from 2db1b6e to 74cf802 Compare May 9, 2024 15:06

kaikalur force-pushed the prefilter branch from 74cf802 to 572b908 Compare May 9, 2024 15:41

kaikalur requested a review from ClarenceThreepwood May 9, 2024 15:44

jaystarshot reviewed May 9, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java Show resolved Hide resolved

feilong-liu approved these changes May 9, 2024

View reviewed changes

ClarenceThreepwood reviewed May 14, 2024

View reviewed changes

presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/JoinPrefilter.java Show resolved Hide resolved

ClarenceThreepwood approved these changes May 14, 2024

View reviewed changes

kaikalur requested a review from pranjalssh May 15, 2024 00:03

pranjalssh approved these changes May 15, 2024

View reviewed changes

kaikalur merged commit 20f6640 into prestodb:master May 15, 2024
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefilter join build side when it's too large #22667

Prefilter join build side when it's too large #22667

kaikalur commented May 3, 2024 •

edited

agrawaldevesh commented May 5, 2024

elharo left a comment

ClarenceThreepwood left a comment

yingsu00 left a comment

steveburnett commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024 •

edited

kaikalur commented May 9, 2024

kaikalur commented May 13, 2024

ClarenceThreepwood left a comment

ClarenceThreepwood commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

ClarenceThreepwood commented May 14, 2024 •

edited

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

Prefilter join build side when it's too large #22667

Prefilter join build side when it's too large #22667

Conversation

kaikalur commented May 3, 2024 • edited

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

agrawaldevesh commented May 5, 2024

elharo left a comment

Choose a reason for hiding this comment

ClarenceThreepwood left a comment

Choose a reason for hiding this comment

yingsu00 left a comment

Choose a reason for hiding this comment

steveburnett commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024

kaikalur commented May 9, 2024 • edited

kaikalur commented May 9, 2024

kaikalur commented May 13, 2024

ClarenceThreepwood left a comment

Choose a reason for hiding this comment

ClarenceThreepwood commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

ClarenceThreepwood commented May 14, 2024 • edited

kaikalur commented May 14, 2024

kaikalur commented May 14, 2024

kaikalur commented May 3, 2024 •

edited

kaikalur commented May 9, 2024 •

edited

ClarenceThreepwood commented May 14, 2024 •

edited