Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype HLL buffers in manifest files to provide column distinct estimates. #6

Open
rdblue opened this issue Jan 31, 2018 · 2 comments

Comments

@rdblue
Copy link
Contributor

rdblue commented Jan 31, 2018

Distinct counts aren't very valuable to cost-based optimization because they can't be easily merged. They should be removed. As a replacement, look into storing HLL buffers if they aren't too large.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 16, 2018

Removed distinct counts in 75088f6.

@rdblue rdblue changed the title Remove distinct counts from manifests, possibly replace with HLL buffers Prototype HLL buffers in manifest files to provide column distinct estimates. Feb 16, 2018
@omalley
Copy link
Contributor

omalley commented Mar 7, 2018

The Presto team has some code for HLL.

Format description - https://github.com/airlift/airlift/blob/master/stats/docs/hll.md
Code - https://github.com/airlift/airlift/tree/master/stats/src/main/java/io/airlift/stats/cardinality

I need to play with it, but the summaries can be pretty large.

rdblue pushed a commit that referenced this issue Dec 18, 2018
This adds a new table property, write.folder-storage.path, that controls the location of new data files.
Parth-Brahmbhatt pushed a commit to Parth-Brahmbhatt/iceberg that referenced this issue Apr 12, 2019
This adds a new table property, write.folder-storage.path, that controls the location of new data files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants