Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: points_per_hour artificially creates "fake" data points #1007

Open
AleXSR700 opened this issue Sep 14, 2023 · 20 comments
Open

Bug: points_per_hour artificially creates "fake" data points #1007

AleXSR700 opened this issue Sep 14, 2023 · 20 comments

Comments

@AleXSR700
Copy link

Hello everyone,
I am new to this card but noticed that the feature points_per_hour artificially creates data points of the value is above the real value.

So if the database contains 20 data points in hour x but you set points_per_hour to 40, mini-graph-card will create 20 "fake" data points and create unnecessary system load.

A lot of entities (almost all of mine) update on value change and not on time basis. So it is not possible for the user to know how many points a given hour will have.

As mini-graph-card access the database and "knows" how many points there are in one hour, would it be possible to update the function to use points_per_hour as an upper limit rather than a forced absolute?

Thank you for your consideration. :)
Alex

@ildar170975
Copy link
Collaborator

This is not a bug. This is called “interpolation”.

@AleXSR700
Copy link
Author

Sorry Ildar, I respect your work and the support you give on the forum A LOT, but you are not objective if you think this is a feature and not a bug.

The point of this points_per_hour is to reduce system load.
The current implementation significantly increases the load in all cases where the user did not select less points than there are data points.
The cost of which is very poor charting (especially since the data points are evenly spread and have nothing to do with the source data anymore).

Clearly, for this function to work properly, it should never ever create additional work load by creating additional data points. It should always be upper limit and not forced.

So, a sensible implementation would either be
a) use as upper limit where no fake points are created
b) do not spread data points evenly but instead according to relevance. if there is only one peak in the last 5 min of an hour, you need two base line data points before the peak and the rest for the peak.

Since b) is very complicated, a) is the reasonable approach.

@ildar170975
Copy link
Collaborator

I absolutely have no intention to offend you.
When I say “not bug” - I mean that the current behavior is by design.
Points are taken from DB, assume there are 10 of them within an hour. Then a curve is build based on a set of “assumed points”. The “assumed points” are based on real ones - but could be either of a smaller amount or bigger, this depends on points_per_hour.
Compare with history-graph - it is absolutely precise graph since it is made of real points.
And compare with “sensor card” graph - it has an option similar to points_per_hour, and graphs are very approximate.
So, mini-graph-card by design allows either to build approximate graphs or “precise” - and the latter is still “not absolutely precise” because anyway based on “average_func” (if I recall this option properly), even when points_per_hour=120 for a sensor which changes every minute.

I would be happy to have an option like ”use raw data without processing”. There was a FR for this, but it was postponed like “needs a design change”.

@AleXSR700
Copy link
Author

AleXSR700 commented Sep 14, 2023

Yes, use raw data is a valid alternative also.
But it would still mean that points_per_hour is a last resort kind of feature.
The accuracy of the charts per default is very poor. It does not represent the real data very well.

So evenly spreading data points is probably the least ideal approach to solving the issue.

A better approach is grouping in means or grouping around delta values.
But I admit that that is difficult to code.

A raw data option would make most sense if combined with piints per hour in a "simple"
if (points_per_hour > raw data) raw data

I love the look and slickness of mini-graph-card. I just feel that it lacks the ability for representing the data.

@ildar170975
Copy link
Collaborator

The accuracy of the charts per default is very poor. It does not represent the real data very well.

Agree.

For me, displaying raw data (with "step" transitions) will make this card 100% useful...
Currently I have to define points_per_hour at least same as actual freq (and all my sensors with graphs do have a fixed scan_interval).

@ildar170975
Copy link
Collaborator

As a conclusion: I propose to convert this issue into a FR, may be even 3 separate FR:

  1. Add a "raw data" method - i.e. points are taken from DB w/o dealing with points_per_hour, average_func.
  2. Change default values for points_per_hour & smoothing: currently these default values create a graph which may differ a lot from history-graph.
  3. Change a way how ..... (please add your proposals).

@filmkorn
Copy link

Is there a way to make the card request at most the number points that are actually available in the db - even if of points_per_hour is a higher number?

Suppose there's 10 data points per hour at random intervals.

  • With points_per_hour: 50 I'd get only those 10 real data points. The card would interpolate those 10 depending on smoothing.
  • With points_per_hour: 2 the card would only get 2 at whatever value home assistant returns.

The reason I'm asking is that there's no per-entity points_per_hour setting that I can see. If you're showing data with vastly different fidelity - you'd want to use a points_per_hour setting that is closer to the high-fidelity data. However it seems that Home assistant then returns stepped data points for interpolated values:

image
This graph is quite ugly. The upper 'Outdoor' data has data points only at the step changes shown in the graph.

I'd like to smooth out the upper 'Outdoor' curve, while maintaining high fidelity in the lower 'Indoor' curve that shows a sharp peak. I cannot figure out how to do this without destroying all fidelity in the lower curve or otherwise have the upper curve show stepped.

@ildar170975
Copy link
Collaborator

ildar170975 commented Nov 29, 2023

This graph is quite ugly.

Then show a standard history-graph for these entities with same hours_to_show.
Otherwise how can people understand if a graph is wrong/correct.

Defining a high value (at least equal to a real freq) for points_per_hour is a way of displaying "exact data" (like in history-graph).
"Smoothing" - is an opposite goal.

@filmkorn
Copy link

filmkorn commented Nov 30, 2023

I've done some further digging this is more complex than I thought.

Many entities do not record values if the values have not changed. That makes a lot of sense for performance and storage reasons as values that don't change seem redundant. However once a value actually changes, there's a large gap between the data points. There's no good way to plot this unless you know the update frequency of the entity.

On a side note - it seems that home assistant once linearly interpolated the data points but this behaviour was changed with home-assistant/core#10590.

History (roughly the same time period):
image

Coming back to the 'ugly graph' - it's easy to see in history graph explorer why spline or linear interpolation is not a good way to plot both "Outdoor" = pm25 and "Indoor" = "U.S. Air Quality Index" of the same time period.

image

The blue line just before the peak of pm25 is incorrect and only stepped interpolation makes sense here. This is because (in this case) esphome does not send unchanged values (by default unless a heartbeat filter is used). And I believe home assistant also does not record unchanged values in the data base. So there's essentially no way to tell at what time the value was last at 0 right before the first value of the peak.

For entities that update regularly (do not skip recording values if unchanged) but in large intervals - I think linear or spline interpolation does make sense as that may more accurately represent the measured reality (ie air temperature does not change in steps). There's a lot of low power zigbee devices that only update sporadically or when a value has changed that could benefit from a representation of the interpolated (linear or spline) raw data.

Currently the only way I see to make a mini-graph-card of data points with large intervals and big changes NOT show the steps given by home assistant is to under-sample the data using points_per_hour or group_by - which greatly reduces the fidelity of the graph.

So I guess - as a FR: having the option to display raw data with configurable interpolation would be nice in some cases but incorrect in others.

@ildar170975
Copy link
Collaborator

in history graph explorer

Here I stopped reading since this is not a standard card which was requested.

@filmkorn
Copy link

Updated my previous response with a screenshot of the history (I assume this is equivalent to the history card but makes it easier to adjust the time period).

All I'm saying is that after some more investigation, mini-graph-card does plot the sparse data more accurately than if it was interpolated (see incorrect blue line History Graph Explorer's default spline interpolation in my previous reply).

@ildar170975
Copy link
Collaborator

ildar170975 commented Nov 30, 2023

@filmkorn
There are 2 methods to represent a graph between 2 points:
-- stepline (history-graph);
-- line (mini-graph-card).
The "line" method allows to show nice graphs (especially if smoothed) - but it shows really "fake" data. "Fake" does not mean they are wrong - it only means that these data were not read from a sensor.
The "stepline" method shows the exact data read from a sensor.

The mini-graph-card (m-g-c) does not show ONLY exact data read from a sensor ("actual data").
Instead, it creates an array of data - which contains:
-- either some SUBSET of "actual data" (if points_per_hour < than number of readings),
-- or an "enriched" SUPERSET of "actual data" (oherwise).

Since m-g-c uses the "line" method - the SUBSET may or may not represent an actual curve accurately, it depends on partucular "actual data". In your example an interpolated "history graph explorer" for "PM25" is "ugly" since it does not show a "sudden peak", it is smoothed. In some other cases the SUBSET may reflect an actual trend.

The "enriched" SUPERSET allows to mimic a stepline method.
Assume we have a "speedtest" sensor with readings each 3 hours. Showing a graph with the "line" method will give an impression of presence of actual readings every hour/minute - which is wrong. That is why a user may decide to have a high points_per_hour which shows "steplines".

So, using high points_per_hour is a trick to mimic a "stepline".
But it requires more resources.
Probably m-g-c should support points_per_hour per-entity - i.e. to mimic "steplines" for some particular entity.

Imho the best solution could be supporting "actual data" (#366, #126, #538) - along with added "stepline" method (no dedicated FR so far).


As for you graphs:
For me, m-g-c shows these graphs rather precise (but I would use smoothing: false):

изображение

Imho the "history graph explorer" graph is "ugly" since it is not as accurate as history-graph for these particular data.

@filmkorn
Copy link

filmkorn commented Nov 30, 2023

Completely agree with you. To explain why I stumbled across this 'bug' report - I was trying to have a smooth / subsampled representation of the 'US air pollution indeex' (upper line) while keeping the accurate spike in pm25. This doesn't seem possible at the moment as the only way to do this is using the raw data and interpolate it vaguely like history graph explorer. But - like you said -

the "history graph explorer" graph is "ugly"

And I fully agree. The spline before the spike of pm25 is simply wrong as I wasn't slowly burning butter on the pan starting midnight for several hours like the h-g-e would suggest.

This is because in this case as the raw data cannot be accurately linearly or bi-linearly interpolated. So having the option to limit the array to the raw data and not create an 'enriched SUPERSET' is pointless in my particular case.

@ildar170975
Copy link
Collaborator

ildar170975 commented Nov 30, 2023

this 'bug' report

Imho (said it already) this is not called a "bug" since the "enriched SUPERSET" is a "designed thing".
Yes, it creates plenty of SAME points; yes, this needs resources.
But currently this is the only way to mimic "steplines" - since there is no "use only actual data" support.

So having the option to limit the array to the raw data and not create an 'enriched SUPERSET' is pointless in my particular case.

I do not think I understood this expression...
Since you have a "spike data" - there are only 2 ways:
-- use m-g-c with mimiced "stepline" method (high points_per_hour);
-- use history-graph with a precise "stepline" method.
Cannot say anything about other 3rd party plugins.

@filmkorn
Copy link

use m-g-c with mimiced "stepline" method (high points_per_hour);

This is my preferred option as m-g-c does look much nicer than other cards. I don't want to use a mix of history-graph and m-g-c on the same dashboard.

Here's my three feature requests for m-g-c:

  • Allow to use the raw data or to limit the query to not retrieve more points from hass than the raw data provides.
  • Add a stepline interpolation. Currently only spline (smoothing: true) and linear (smoothing: false) are supported. Adding this would avoid having to use a history card for this feature.
  • Allow to set points_per_hour and smoothing per entity instead of the whole graph.

That said, I don't care enough about this to bother making feature requests.

@ildar170975
Copy link
Collaborator

ildar170975 commented Nov 30, 2023

1st request - #366
2nd request - #1038
3rd request for smoothing - #1039

As for points_per_hour per-entity - not ready to create this FR yet.
Someone who needs it and have some ideas - is more than welcome to do it.

Closing the issue now.
Yet the topic surely may be discussed here.

@AleXSR700
Copy link
Author

AleXSR700 commented Dec 1, 2023

I am sorry, this thread should not be closed. Nothing is completed.
The card uselessly creates fake data points and artificially creates huge system load.

So the user has to chose between "loss of information" or excessive system load.
The user cannot know how many data points there are. The card does/can. So clearly the user has to chose between the above because of this bug.

Please read my posts before closing as those issues are not resolved and the card remains unusable for real data tracking until it is. It is just a "fun", pretty card with absolutely no data analysis usability.

@ildar170975 ildar170975 reopened this Dec 1, 2023
@ildar170975
Copy link
Collaborator

ildar170975 commented Dec 1, 2023

I may reopen, no problem.
But creating these points is not a bug. This is done by a design. These points are not fake, they repeat actual points.

If #366 will be implemented - then we'll have 2 options:
-- display only actual data - no points_per_hour, no missed points, no interpolation, no repeating points;
-- display as it is CURRENTLY - allowing showing trends (and probably there will be no sense to use a higher points_per_hour value).

@AleXSR700
Copy link
Author

AleXSR700 commented Dec 1, 2023

Hello Ildar,

they are fake because if the data point is at 7 pm and the card creates one at 8 pm, then that data point is fake.

If, e.g. your PowerDelta is 20, then you might not get an updated value for 2 hours. That does not mean that the value remained the same. It means that it did not change significantly enough.
That is a big difference.

So it is not a feature. And it makes no sense either. Creating lots of fake data points does not result in a better fit. That is simply false. If you want a better fit then you need more fit options. Fake data points are not that.

The best representation is always the maximum of data points. Any fake additional points will be biased and less accurate.
So it is a bug because the max data points the card creates should equal the actual amount of data points available.

The best approach would be a setting that lets you specify an upper limit.
I.e. define the maximum number of data points to plot and if the real amount is smaller, plot all real data points.

@ildar170975
Copy link
Collaborator

ildar170975 commented Dec 1, 2023

Assume a device is polled every XX minutes.
We have this history-graph:
изображение
Now, based on your logic, - is the red point is fake?

Answer: no, it is an acceptable assumption.
An actual physical sensor may measure some parameter every 10 sec; a device may process these data and generate a "point" every 30 sec; HA may poll a device every 1 minute.
Repeating same value till the next acquired change is an acceptable assumption.
Same is about generating "enriched superset" with a higher points_per_hour.

Repeat the 100th time: nobody says the current method is 100% true; one more #366 method should be added - as an addition, not a replacement. But then BOTH methods will be an assumption ANYWAY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants