-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update histogram math #13680
base: main
Are you sure you want to change the base?
Update histogram math #13680
Conversation
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Screen.Recording.2024-03-01.at.5.46.20.PM.mov |
Nice. I'd like @beorn7 to be the one to judge the histogram logic bits, but happy to review / :+1: it from a UI code perspective after that :) |
Great stuff. I'll have a detailed look ASAP (which might be Tuesday). Thank you very much. |
Sorry, drowning in work, it will take another day or so. |
I'm playing with this now. It looks generally very nice, but there seem to be some quirks. Here my observations (really just as an outside observer, without any insight into the code, as I'm not familiar with React): The text version is sorted fine now and displays the zero bucket just fine, too. In the graphical version, the zero bucket is displayed, but "left of zero", while it should be centered on zero (it would probably be fine to only display the half "right of zero" if there are no negative buckets): With the normalization of the height of the buckets, the number next to the Y axis appear pretty useless. Maybe we can just omit them? Showing the count in the bucket seems to be a fine way. (And there is always the text representation underneath for precise numbers and for those that cannot hover the mouse pointer for accessibility reasons.) Speaking of the height of the bars, they seem off. See the following two screenshots: The 1st one highlights a bucket with 44 observations, the 2nd one a bucket with 82 buckets. The numbers are also displayed in the some way in the text part. However, the bucket with count of 44 is higher than the bucket with a count of 82. It should be the other way round, and that by a lot (almost factor 2). Remember, with the exponential rendering, all buckets have the same width in the rendering, so their height should be proportional to their count (because ultimately the area of each bar has to be proportional to the count). So there is effectively no normalization necessary (you could even display the counts directly on the y axis.) I haven't checked the linear plot in detail yet. Let's first get the exponential plot right. However, I can say that this example app creates a normal distribution of observations. So the linear plot should result in a nice Gauss bell shape. Maybe that's even working, if squinting at But as said, let's take care of it once the exponential plot works. Another note for the zero bucket: By default, it is very narrow (and it could be set to a width of zero in some use cases). That means that the normalization for the linear plot would result in an extreme height (or even infinity in case of a zero width zero bucket). There should be some limitation on the displayed height of the zero bucket, maybe something like "display it if it is at most 10x higher than the highest other bucket, but just cut it off at the max height of any other bucket otherwise". When I deliberately put observations in a zero bucket with the narrow default width, I see an overly high bucket even in the exponential plot, which tells me that some normalisation is happening (but it should not, see above): And here the funny result with a zero bucket of width zero that actually contains a few observations of exactly zero: Final topic: gaps between buckets. There can be gaps between buckets (or in other word: empty buckets, that just aren't explicitly stated in the query result, because native histograms are "sparse"), and they seem to be displayed nicely in the linear view, but they aren't displayed at all in the exponential view. I think they should. Imagine the gaps are filled with buckets of population zero and following the same exponential bucketing schema. (A more subtle topic is the possible gap between the zero bucket and the first regular bucket, but it's probably fine to not draw this one, at least not in the exponential plot.) Sorry for all the comments. The zero bucket is really a beast to plot because it breaks the regularity. So I would propose the following plan:
|
(@beorn7 Side note: I never had the expectation that the area under each bar in the exponential view had to be adjusted to be proportional to the number of observations and was surprised to hear that. Is that a commonly expected thing when displaying exponetial histograms on an exponential axis? If we didn't do that, then the Y-axis would keep making sense...) |
That's the essence of the histogram (as it was used on paper, before computers were invented…). However, the basic idea with the exponential plot is that every bucket will be rendered with the same width, so now not only the area but also the height is proportional to the bucket counte. And indeed, in that case we can keep the Y axis. |
https://en.wikipedia.org/wiki/Histogram#/media/File:Travel_time_histogram_total_n_Stata.png is the proverbial histogram from Wikipedia . Note the unit on the y axis (people per time interval). |
Yes I'll remove them. One of your screenshots show the scales to be in the order of trillion ^ 2 😆 which doesn't seem useful.
this could be fixed by simply updating the position of the 0 tick:
On an exponential scale, it currently finds the bucket closest to zero and puts the 0 marker on that bucket's lower bounds. This is incorrect, I'll fix it and put the 0 tick on true zero.
This bit is accounted for
Okay so for this one, I made the assumption that these are frequency density histograms. This means that the height is is calculated by So if we went ahead an calculated the heights for the buckets you mentioned, you'd see that they're actually consistent:
here, bucket with count 44 is actually taller than that with count 82.
here's my rationale: Yes the widths do appear equal to eachother, but they are on an exponential scale. If we calculate the width of some bucket x on the linear scale, it would have the same value as bucket x on the exponential scale. The values of the range/width are the same, they simply appear differently on each scale. Because of this, height is some proportion of Do let me know if there are gaps in my theory.
Hmm, yes I'll try and figure something out for this.
That's the same example I'm using. Question tho (my statistics is fuzzy), are these random values supposed to exactly fit in a bell curve on linear scales? or should it look somewhat like a bell curve with some outliers. If yes to the latter, I belive it would then act somewhat like a galton board, and this is what's happening already happening. Again, do let me know if I'm wrong.
I didn't anticipate this 😅. I'll revisit the zero bucket. |
Sorry for the delay in my response. Let's clarify the scaling of the height of the buckets first: I think the linear case is pretty clear, and your implementation might be correct already. In the linear case, I think we should not have values on the Y axis, as they will generally be weird to read and understand. The hover info will tell the user about the number of observations in the bucket, and that's good enough. I do believe that, within statistical noise, the graphical appearance of the the histogram in linear view should approach a Gaussian bell curve. But let's do a final verification of all of that once we have nailed the exponential view, because the latter will be the one that users are using normally. So now about the exponential view: What you use as input in your calculations is the range directly, but since we have a logarithmic x axis, you also need to take the logarithm of the range. If you do that, you'll find out that now everything cancels and the height of the buckets is proportional to the count in the bucket. That's only the case because the buckets are exponential. (Once custom bucket layouts land here, we have to revisit the rendering.) In other words: The actually displayed area of the bars (or the "numbers of pixels" in the bar) should be proportional to the count in the bucket. So for the current bucket schemas, the exponential view is very simple: height == count, and you can put the count on the y axis. Yet another comparison: Imagine a linear bucket layout (one bucket from 0 to 1, next from 1 to 2, next from 2 to 3). If you render such a histogram on a linear x axis, again height would be proportional to count. Exponential buckets rendered on a linear x axis need the multiplication math for their correct area, but if the x axis is logarithmic itself, it all cancels out and you are back to height being proportional to the count. |
Another thought: You could also do the multiplication thing in the exponential view "strictly correct" right now (I assume you have to change the current code just slightly to take into account the logarithmic nature of the x axis by doing an exponentiation at the right place). This will appear needlessly complicated for now, but will help once custom bucket definitions are a thing. |
Sorry I've been away 😬 @beorn7 I'll get back to you on this soon (before the weekend or so) |
Signed-off-by: Manik Rana <manikrana54@gmail.com>
@beorn7 Again, sorry about the delay I have now set height proportional to count for the exponential scale and kept everything same for linear. Also removed the y axis labels for linear. |
Height looks good now. The only thing left for exponential view is to position the bars correctly. I try to summarize our discussion we had on Slack: I don't know a lot about how to code this all in CSS and Typescript etc., so I make a few fundamental assumption to create some pseudocode. Please fix my assumptions where necessary. I assume the leftmost point on the x-axis in the rendering is 0 and the rightmost point is 1 (maybe it's something like 0% to 100%, but formulating it in terms of 0 to 1 is easier below). The "standard bucket width"
This should yield the same result, no matter what bucket you pick, as long as you don't pick the zero bucket. (Note: There is the special case where only the zero bucket is populated, which has to be rendered just as a single bar centered on zero.) Every bucket will be rendered with this width What's left is to map bucket boundaries to points on the x axis. For that, let's define the following values:
Now it's easy to calculate the position of bucket boundaries etc. on that 0 to 1 scale for the whole panel:
In case negative or positive buckets don't exist, just set their respective start and end value to zero, and things should just work out. |
Note that the above always puts the zero bucket directly adjacent to the highest negative bucket and the lowest positive bucket, no matter if there is a gap in between. I think that's fine (as in most cases there will be a gap anyway, and it might be more confusing to show it than to gloss over it), but we can revisit that decision if needed. |
Hello from the bug-scrub! Do you think you will come back to this @Maniktherana ? |
Hey there, yes I'm working on it now 😬 |
Signed-off-by: Manik Rana <manikrana54@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with these changes? (I haven't tried myself… 🦆)
const minPositive = buckets ? parseFloat(buckets[closestToZero.closestIdx + 1][1]) : 0; | ||
const maxNegative = buckets ? parseFloat(buckets[closestToZero.closestIdx - 1][2]) : 0; | ||
const minNegative = buckets ? parseFloat(buckets[0][1]) : 0; | ||
const startNegative = buckets ? Math.log(Math.abs(minNegative)) : 0; //start_neg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do the following like in my latest update:
const startNegative = buckets ? Math.log(Math.abs(minNegative)) : 0; //start_neg | |
const startNegative = buckets ? -Math.log(Math.abs(minNegative)) : 0; //start_neg |
const maxNegative = buckets ? parseFloat(buckets[closestToZero.closestIdx - 1][2]) : 0; | ||
const minNegative = buckets ? parseFloat(buckets[0][1]) : 0; | ||
const startNegative = buckets ? Math.log(Math.abs(minNegative)) : 0; //start_neg | ||
const endNegative = buckets ? Math.log(Math.abs(maxNegative)) : 0; //end_neg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const endNegative = buckets ? Math.log(Math.abs(maxNegative)) : 0; //end_neg | |
const endNegative = buckets ? -Math.log(Math.abs(maxNegative)) : 0; //end_neg |
const startPositive = buckets ? Math.log(minPositive) : 0; //start_pos | ||
const endPositive = buckets ? Math.log(maxPositive) : 0; //end_pos | ||
|
||
const widthNegative = startNegative - endNegative; //width_neg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the changes above, this becomes again the more intuitive:
const widthNegative = startNegative - endNegative; //width_neg | |
const widthNegative = endNegative - startNegative; //width_neg |
Math.log(Math.abs(parseFloat(buckets[buckets.length - 1][2]))) - | ||
Math.log(Math.abs(parseFloat(buckets[buckets.length - 1][1]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are only negative buckets, or even only a zero bucket, this will break (just a note for the final code).
scale === 'linear' | ||
? ((left - rangeMin) / (rangeMax - rangeMin)) * 100 + '%' | ||
: left < 0 | ||
? (Math.log(Math.abs(left)) - startNegative / widthTotal) * 100 + '%' // negative buckets boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With recent updates and fixed parenthesis:
? (Math.log(Math.abs(left)) - startNegative / widthTotal) * 100 + '%' // negative buckets boundary | |
? -(Math.log(Math.abs(left)) - startNegative) / widthTotal * 100 + '%' // negative buckets boundary |
? ((left - rangeMin) / (rangeMax - rangeMin)) * 100 + '%' | ||
: left < 0 | ||
? (Math.log(Math.abs(left)) - startNegative / widthTotal) * 100 + '%' // negative buckets boundary | ||
: (Math.log(left) - startPositive + bw + widthNegative / widthTotal) * 100 + '%'; // positive buckets boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: (Math.log(left) - startPositive + bw + widthNegative / widthTotal) * 100 + '%'; // positive buckets boundary | |
: (Math.log(left) - startPositive + bw + widthNegative) / widthTotal * 100 + '%'; // positive buckets boundary |
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Things look quite good. So this is a perfect example, as far as I can see: The zero bucket is unpopulated here, but rendered correctly as an empty bucket centered on the 0 x-axis mark. However, with fewer and less populated buckets, things look weirdly off: The zero tick is not centered (although there is a centered line visible), and the lowest positive bucket is rendered over the highest negative bucket (the one closest to zero). Note that in this example, the buckets are not contiguous. There is a gap between the lowest and second lowest positive bucket. Maybe that's triggering some weird bug. Here is an example with an actually populated zero bucket (and, for extra difficulty, the zero bucket does not conicide with a usual bucket boundary, so that the buckets next to the zero bucket have a different width): It seem that the "closest to zero" detection has a problem. In different news: This is a case where the buckets do not all have the same width, even without custom bucket boundaries. But maybe we don't need to handle this quite special case. |
In this last case, are they truly of different width or are they overlapping? Also looks like the zero bucket goes from negative to positive ranges. Don't think that one is explicitly handled |
const minPositive = buckets ? parseFloat(buckets[closestToZero.closestIdx + 1][1]) : 0; | ||
const maxNegative = buckets ? parseFloat(buckets[closestToZero.closestIdx - 1][2]) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the zero bucket is not populated, it isn't included in the buckets list, so the closestIdx ± 1 will fail.
} | ||
} | ||
|
||
return { closest, closestIdx }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not really using the returned closest
, are you?
I think this should instead return the index of the closest-to-zero negative bucket and the index of the closest-to-zero positive bucket separately, so you don't need the ±1 magic above (which fails if there is no zero bucket in buckets
).
const widthPositive = endPositive - startPositive; //width_pos | ||
const widthTotal = widthNegative + expBucketWidth + widthPositive; //width_total | ||
|
||
function findClosestToZero(numbers: number[]): closestToZeroType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you need to pass in the full buckets (or the upper limit for the negative buckets and the lower limit for the positive buckets), and you must not get confused by the zero bucket (which might or might not exist).
They are truly different width. Buckets are never really overlapping.
Yes, that's the very definition of the zero bucket (it goes from positive to negative, or it is just zero with zero width). |
Another idea for |
agreed, |
WRT overlapping zero bucket: If you look at the numbers in your screenshot above, you'll see that no buckets are overlapping. This is a "feature" of the zero bucket. If you set it such that it doesn't coincide with the "natural" bucket boundary of the neighboring buckets, those buckets are cut short, and that's the one case where you get smaller buckets with an otherwise standard exponential schema. There are custom buckets in the making, and those will have variable bucket length anyway. So eventually, we have to handle them. Right now, we can ignore it for now, but we need to get that special case around the zero bucket solved. I think with the formulas developed so far, it should "just work", and the artifacts we see are caused by some implementation problems. |
Yes, and I think that needs to be fixed (and it's unexpected from the math). |
I played around a bit and found that updating this nudged it in the right direction: -const widthTotal = widthNegative + expBucketWidth + widthPositive; //width_total
+const widthTotal = widthNegative + 3 * expBucketWidth + widthPositive; //width_total |
But isn't that just ad-hoc compensation for a mistake that we have done elsewhere? I think with a fixed closestToZero, based on the bounds of the zero bucket, things should all fall into place nicely. |
You are correct. I also believe the same when it comes to positioning the buckets. Zero bucket calculation might also be affecting the widths |
Signed-off-by: Manik Rana <manikrana54@gmail.com>
@beorn7 updated some calculations:
It fixes some things but but theres still the issue of these overlapped buckets: but in some cases, theres no such issue: finally, sometimes a bucket may have ranges with very small boundaries (order of 10^-8 or so) and such buckets seem to occupy random positions. in this case, the highlighted bucket should be the right side of the zero marker: |
At first glance, I would say your first screenshot is already the correct behavior. Maybe we should do the "frequency density thing" in that case, but maybe not. A zero bucket overlapping with the neighboring buckets is a scenario that should be avoided generally, and I'm not sure there is a good way of treating it graphically. I will look at the details ASAP to come to a conclusion (which will be early next week, hopefully). I will also look into the issue with the last screenshot then. Floating point precision issues are probably hard to avoid in certain situations, and maybe they even play out a bit differently in the Javascript VM than in a native program… However, 1e-8 isn't that small in terms of IEEE 754. As said, I'll try to get to the bottom of it. And if anyone beats me to it before the start of next week, even better. |
Signed-off-by: Manik Rana <manikrana54@gmail.com>
Addresses #11269
Bucket count is now represented by area rather than height
Height is now =
count / range
, creating a height proportional to frequency density@beorn7