Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CTE Materialization documentation #22682

Merged
merged 1 commit into from May 15, 2024

Conversation

jaystarshot
Copy link
Member

Description

Fixes #22648

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ... :pr:`12345`
* ... :pr:`12345`

Hive Connector Changes
* ... :pr:`12345`
* ... :pr:`12345`

If release note is NOT required, use:

== NO RELEASE NOTE ==

@jaystarshot jaystarshot marked this pull request as ready for review May 7, 2024 01:16
@jaystarshot jaystarshot requested review from steveburnett and a team as code owners May 7, 2024 01:16
@github-actions github-actions bot added the docs label May 7, 2024
@jaystarshot
Copy link
Member Author

cc: @feilong-liu @mlyublena @tdcmeehan @rschlussel for any suggestions

@jaystarshot jaystarshot force-pushed the cte-docs branch 2 times, most recently from 5d43a16 to ce3f455 Compare May 7, 2024 01:40
CTE Materialization
=================================

CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Table Expressions (CTEs)

=================================

CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user.
Their repeated usage within a query can lead in unnecessary redundant computations, excessive data retrieval, and high resource consumption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lead to
period after consumption

CTEs (Common Table Expressions) are defined as subqueries that appear in a WITH clause provided by the user.
Their repeated usage within a query can lead in unnecessary redundant computations, excessive data retrieval, and high resource consumption

To address this, Presto supports CTE Materialization. This feature allows intermediate CTEs to be materialized and effectively stored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"materialized and effectively stored
and reused within" --> materialized, stored, and reused within

or maybe just "reused"

and reused within the scope of the same query.

This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query.
These CTEs will be stored in temporary tables that are bucketed based on the first projection column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

period

This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query.
These CTEs will be stored in temporary tables that are bucketed based on the first projection column

To use this feature the connector used by the query must support the creation of temporary tables. Currently, only the :doc:`/connector/hive` connector offers this capability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after feature

How to use CTE Materialization
--------------

The following configurations and session properties are provided to enable CTE materialization and fine-tune its settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "are provided to"

* **Allowed values:** ``ALL``, ``NONE``, ``HEURISTIC`, ``HEURISTIC_COMPLEX_QUERIES_ONLY``
* **Default value:** ``NONE``

Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "to use"

* **Default value:** ``NONE``

Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries.
``NONE`` indicates that no CTEs will be materialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "indicates that"


Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries.
``NONE`` indicates that no CTEs will be materialized.
``ALL`` indicates that all CTEs in the query will be materialized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete "indicates that"

* **Minimum value:** ``0``
* **Default value:** ``4``

Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this sentence. Please rewrite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Suggested change
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number
When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they are appear in a query at least``cte-heuristic-replication-threshold`` number of times

* **Minimum value:** ``0``
* **Default value:** ``4``

Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Suggested change
Used with CTE Materialization Strategy is set to ``HEURISTIC` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. CTES are only materialized if they are used greater than or equal to this number
When ``cte-materialization-strategy`` is set to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``, then CTEs will be materialized if they are appear in a query at least``cte-heuristic-replication-threshold`` number of times

To address this, Presto supports CTE Materialization. This feature allows intermediate CTEs to be materialized and effectively stored
and reused within the scope of the same query.

This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query.
Materializing CTEs can improve performance when the same CTE is used multiple times in a query by reducing recomputation of the CTE. However, there is also a cost to writing to and reading from disk, so the optimization may not be beneficial for very simple CTEs or CTEs that are not used many times in a query.

and reused within the scope of the same query.

This improves performance and is especially beneficial when the same CTE is utilized multiple times in a query.
These CTEs will be stored in temporary tables that are bucketed based on the first projection column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These CTEs will be stored in temporary tables that are bucketed based on the first projection column
Materialized CTEs are stored in temporary tables that are bucketed based on the first projection column.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks for the doc! I made a lot of suggestions that are mostly repetitive. I had a few comments that I ask for your thoughts.

Also, I'm planning to reformat the Properties Reference page to remove the indents for content under each header.
Instead of cluttering this review with even more comments, I'll ask you here to remove the indents for the content of the property entries under the headings.

Here's a screenshot of my local build where I removed the indents for cte-materialization-strategy, showing the format change in the build - removing the gray vertical line - from that entry, but leaving the entry for cte-heuristic-replication-threshold to show what I'm asking you to do. Let me know if you have any questions.

Screenshot 2024-05-07 at 1 16 26 PM

presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jaystarshot for this doc. Have some comments.

Specifies the strategy to use for materializing Common Table Expressions (CTEs) in queries.
``NONE`` indicates that no CTEs will be materialized.
``ALL`` indicates that all CTEs in the query will be materialized.
``HEURISTIC` greedily materializes the earliest parent CTE which is repeated >= cte_heuristic_replication_threshold times
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaystarshot : Do you intend to have a blog article or some doc giving more examples (queries and plans) demonstrating the behavior of these config ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is a good idea, i can add it as a blog or https://github.com/prestodb/presto/wiki


The number of partitions to be used for materializing Common Table Expressions (CTEs) in queries.
This setting determines how many buckets or writers should be used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization.
A higher number of partitions might improve parallelism but also increases overhead in terms of memory and network communication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its strange that a property called partitioning actually affects the buckets. But please emphasize in the explanation that these are buckets and not partitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually comes from exchange materialization where we use the hash partition count to count the number of buckets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! I thought this was strange, but the explanation makes sense.

This setting specifies which catalog should be used for CTE materialization and for determining how to partition the materialization of CTEs in queries.
This can also be specified on a per-query basis using the ``cte_partitioning_provider_catalog`` session property.

``cte-filter-and-projection-pushdown-enabled``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have an example of this as well. Do you have something in a design doc ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#21638 There is a good example on this ticket

* **Allowed values:** ``HIVE_COMPATIBLE``, ``PRESTO_NATIVE``
* **Default value:** ``PRESTO_NATIVE``

This setting specifies the Hash function type for cte materialization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do each of these settings mean ? Can you elaborate ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cb69e89
Sample commit which added them.

Its used


I think it determines which hashing functions to use when inserting records into the buckets.

feilong-liu
feilong-liu previously approved these changes May 8, 2024
``NONE`` indicates that no CTEs will be materialized.
``ALL`` indicates that all CTEs in the query will be materialized.
``HEURISTIC` greedily materializes the earliest parent CTE which is repeated >= cte_heuristic_replication_threshold times
``HEURISTIC_COMPLEX_QUERIES_ONLY` greedily materializes the earliest parent CTE which meets the heuristic criteria and has a join or aggregate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may consider adding other operations, for example window operator which is also expensive for this strategy. But of course, this will be a separate PR from this one.


* **Type:** ``string``
* **Allowed values:** ``PAGEFILE``, ``ORC``, ``DWRF``, ``ALPHA``, ``PARQUET``, ``AVRO``, ``RCBINARY``, ``RCTEXT``, ``SEQUENCEFILE``, ``JSON``, ``TEXTFILE``, ``CSV``
* **Default value:** ``ORC``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we want to change the default value if the recommendation is another one (again on a separate PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can I will do it in a follow up PR. This is also used by exchange materialization and I did not want to break any default flow incase anyone was using the default value. Also prestissimo doesn't support PAGEFILE for now

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A couple of nits for formatting issues, but from a docs standpoint it looks good. Happy to take another look when you're done.

presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
@jaystarshot
Copy link
Member Author

Thanks for the review

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix! In addition to the one comment about the value descriptions in cte-materialization-strategy, I have a request that would be a repetitive set of changes and is simpler to make here.

Suggest removing the indent throughout to remove the vertical stripe on the left. I'm trying to get to this throughout the Presto doc and it's taking time, but your new doc following this behavior would help.

Screenshots of the first one and a half entries (before, and after - the after also includes my suggested formatting for the first item's values) from my local environment to show what I mean.

(The second entry is unchanged in after, only cte-materialization-strategy acts as an example.)

before
Screenshot 2024-05-13 at 2 32 32 PM

after
Screenshot 2024-05-13 at 2 34 17 PM

presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Only a few nits.

presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved

The number of partitions to be used for materializing Common Table Expressions (CTEs) in queries.
This setting determines how many buckets or writers should be used when materializing the CTEs, potentially affecting the performance of queries involving CTE materialization.
A higher number of partitions might improve parallelism but also increases overhead in terms of memory and network communication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining! I thought this was strange, but the explanation makes sense.

presto-docs/src/main/sphinx/admin/cte-materialization.rst Outdated Show resolved Hide resolved
@jaystarshot jaystarshot force-pushed the cte-docs branch 2 times, most recently from 45d061c to e04ddb4 Compare May 15, 2024 20:56
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pulled updated branch, new local build of docs to review, everything looks good. Thanks!

@jaystarshot jaystarshot merged commit 2aa4d42 into prestodb:master May 15, 2024
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add CTE materialization User documentation
6 participants