-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram #42
Comments
Ah. I can see something from your code. Don't use the array digest. Use the avltree digest. Sent from my iPhone
|
Another phoned in comment, t-digest makes no pretense of recreating densities. It recreates the cdf. Try plotting cdf's against each other. Also make sure compression factor is relatively high (200 or more) since you are interested in behavior far from the tails. Sent from my iPhone
|
I think I figured this out. The plot that you have here obscures the spacing between centroids. To recover the PDF from the t-digest you have to adjust the weight of the centroid according to the distance to the neighbors. |
I tried with your previous suggestions, that is using avltree digest with high compression factor (200). This is the histogram (pdf) that it creates: Indeed, the cdf is captured really well. I didn't really fully understand your last comment. If I still want to plot pdf, how do I do what you described? Thanks! |
Finishing this off ... yes, weighting the counts correctly is crucial to proper presentation of a PDF given a t-digest. The basic idea is that each centroid approximates the count over a particular range. To convert this to a rectangle in an approximated PDF, you need to convert the count and the range to get the slope of the CDF. Getting the bounds of the centroid is a bit tricky, but there is a pretty good way to do it: Given: centroid index i, number of centroids n, counts k[i], positions x[i], smallest x_min and largest x_max samples seen Wanted: left x_left and right x_right bounds of centroid if i == 0: Then the rectangle of the PDF can be drawn from x_left to x_right with height k[i] / (x_right-x_left) Closing this for now. |
Hi! Not sure if opening a new Issue here is the right way to go about this. I'd like to understand why the histogram created through t-digest loses so much information.
The original histogram with 500 breaks (every 0.05). Double peak is clearly visible.
.
Histogram created using t-digest centroids's means and counts. Double peak is not visible.

The raw data that I used includes 49220071 data points from 0 to 60.
This is the code I used to produce histogram-from-tdigest
Is using
Centroid
'smean()
andcount()
even the right way to go about this?Thank you.
The text was updated successfully, but these errors were encountered: