You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are currently facing a significant discrepancy in the results when creating a windowed table from a stream of IoT data in our ksqlDB setup. The observed output significantly differs from the expected results.
The stream includes a devicedatetime field, which we have designated as the timestamp field for creating windows.
CREATE STREAM STRTRACKERDATA_TRIM WITH (KAFKA_TOPIC='strtrackerdata_trim', PARTITIONS=50, REPLICAS=1, RETENTION_MS=3600000, TIMESTAMP='devicedatetime', VALUE_FORMAT='JSON') AS SELECT DEVICEID, REASON, RMSEVENTTYPE, DEVICEDATETIME, UNIQUEID, RECORDSTATUS FROM STRTRACKERDATA EMIT CHANGES;
In the table tblrmsabnormalevent_test, our usecase require grouping on windowed aggregation and for uniquely identify the records in the window. We have organized the data by applying the GROUP BY clause to the fields REASON, RMSEVENTTYPE, DEVICEID, and TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') to structure and store the data.
CREATE TABLE tblrmsabnormalevent_test WITH (KEY_FORMAT='JSON') AS
SELECT
DEVICEID,
REASON,
RMSEVENTTYPE,
AS_VALUE(DEVICEID) AS DEVICE_ID,
AS_VALUE(REASON) AS DEVICE_REASON,
AS_VALUE(RMSEVENTTYPE) AS DEVICE_RMSEVENTTYPE,
COUNT() AS EVENTCOUNT,
FROM_UNIXTIME(WINDOWSTART) as WINDOW_START,
FROM_UNIXTIME(WINDOWEND) as WINDOW_END,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') AS WINDOW_HOUR
FROM
STRTRACKERDATA_TRIM
WINDOW TUMBLING ( SIZE 1 DAY, GRACE PERIOD 1 HOUR )
GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai')
HAVING (
(reason=1096 AND count()>(37524))
OR (reason=1098 AND count()>(1024))
OR (reason=1097 AND count()>(424))
OR (reason NOT IN(1096,1098,1097) AND count()>(8*24))) emit final;
To Reproduce
Steps to reproduce the behavior, include:
The version of KSQL.
0.29
Sample source data.
In the stream strtrackerdata_trim - we have the following items
In the stream STRTRACKERDATA_TRIM - we have more than 1 crore records
expected behaviour:
When we are doing aggreation by grouping GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') with windowing or data extraction and aggregation with external application we are getting similar result as below
CREATE TABLE rmsanalysis_withoutwindowing WITH (KEY_FORMAT='JSON') AS
SELECT
DEVICEID,
REASON,
RMSEVENTTYPE,
AS_VALUE(DEVICEID) AS DEVICE_ID,
AS_VALUE(REASON) AS DEVICE_REASON,
AS_VALUE(RMSEVENTTYPE) AS DEVICE_RMSEVENTTYPE,
COUNT() AS EVENTCOUNT,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') AS WINDOW_HOUR
FROM
STRTRACKERDATA_TRIM
GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai')
HAVING (
(reason=1096 AND count()>(37524))
OR (reason=1098 AND count()>(1024))
OR (reason=1097 AND count()>(424))
OR (reason NOT IN(1096,1098,1097) AND count()>(8*24))) emit final;
We don't understand where the issue was and why there is so much data Discrepancy. On running the aggregation we are seeing that aggregated values for the columns are coming incorrectly. The aggregated values seems to be lesser than the actual values expected.
The text was updated successfully, but these errors were encountered:
nagababu9652
changed the title
Data Discrepancy in Windowed Table Creation from IoT Data Stream
Data Discrepancy in Tumbling Windowed Table Creation from IoT Data Stream
Jan 30, 2024
Describe the bug
We are currently facing a significant discrepancy in the results when creating a windowed table from a stream of IoT data in our ksqlDB setup. The observed output significantly differs from the expected results.
The stream includes a devicedatetime field, which we have designated as the timestamp field for creating windows.
CREATE STREAM STRTRACKERDATA_TRIM WITH (KAFKA_TOPIC='strtrackerdata_trim', PARTITIONS=50, REPLICAS=1, RETENTION_MS=3600000, TIMESTAMP='devicedatetime', VALUE_FORMAT='JSON') AS SELECT DEVICEID, REASON, RMSEVENTTYPE, DEVICEDATETIME, UNIQUEID, RECORDSTATUS FROM STRTRACKERDATA EMIT CHANGES;
In the table tblrmsabnormalevent_test, our usecase require grouping on windowed aggregation and for uniquely identify the records in the window. We have organized the data by applying the GROUP BY clause to the fields REASON, RMSEVENTTYPE, DEVICEID, and TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') to structure and store the data.
CREATE TABLE tblrmsabnormalevent_test WITH (KEY_FORMAT='JSON') AS
SELECT
DEVICEID,
REASON,
RMSEVENTTYPE,
AS_VALUE(DEVICEID) AS DEVICE_ID,
AS_VALUE(REASON) AS DEVICE_REASON,
AS_VALUE(RMSEVENTTYPE) AS DEVICE_RMSEVENTTYPE,
COUNT() AS EVENTCOUNT,
FROM_UNIXTIME(WINDOWSTART) as WINDOW_START,
FROM_UNIXTIME(WINDOWEND) as WINDOW_END,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') AS WINDOW_HOUR
FROM
STRTRACKERDATA_TRIM
WINDOW TUMBLING ( SIZE 1 DAY, GRACE PERIOD 1 HOUR )
GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai')
HAVING (
(reason=1096 AND count()>(37524))
OR (reason=1098 AND count()>(1024))
OR (reason=1097 AND count()>(424))
OR (reason NOT IN(1096,1098,1097) AND count()>(8*24))) emit final;
To Reproduce
Steps to reproduce the behavior, include:
+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
|DEVICEID |REASON |RMSEVENTTYPE |DEVICEDATETIME |UNIQUEID |RECORDSTATUS
+--------------------------+--------------------------+--------------------------+--------------------------+--------------------------+
|***********3337 |1096 |none |2024-01-25T13:02:31.662 |17202 |1 |
|***********8067 |1096 |none |2024-01-25T08:28:26.366 |21191 |1 |
|***********9377 |1096 |none |2024-01-25T12:59:26.000 |28145 |1 |
|***********6900 |1096 |none |2024-01-25T13:28:25.000 |63235 |1 |
In the stream STRTRACKERDATA_TRIM - we have more than 1 crore records
expected behaviour:
When we are doing aggreation by grouping GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') with windowing or data extraction and aggregation with external application we are getting similar result as below
CREATE TABLE rmsanalysis_withoutwindowing WITH (KEY_FORMAT='JSON') AS
SELECT
DEVICEID,
REASON,
RMSEVENTTYPE,
AS_VALUE(DEVICEID) AS DEVICE_ID,
AS_VALUE(REASON) AS DEVICE_REASON,
AS_VALUE(RMSEVENTTYPE) AS DEVICE_RMSEVENTTYPE,
COUNT() AS EVENTCOUNT,
TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai') AS WINDOW_HOUR
FROM
STRTRACKERDATA_TRIM
GROUP BY REASON, RMSEVENTTYPE, DEVICEID,TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd','Asia/Dubai')
HAVING (
(reason=1096 AND count()>(37524))
OR (reason=1098 AND count()>(1024))
OR (reason=1097 AND count()>(424))
OR (reason NOT IN(1096,1098,1097) AND count()>(8*24))) emit final;
select * from rmsanalysis_withoutwindowing where window_hour = '2024-01-26' and eventcount > 10000 emit changes;
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|REASON |RMSEVENTTYPE |DEVICEID |WINDOW_HOUR |DEVICE_ID |DEVICE_REASON |DEVICE_RMSEVENTTYPE|EVENTCOUNT |
+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+
|1102 |"accident" |869395034886220 |2024-01-26 |869395034886220 |1102 |"accident" |71840 |
|1096 |none |864004044727409 |2024-01-26 |864004044727409 |1096 |none |55813 |
|1096 |none |868259024679634 |2024-01-26 |868259024679634 |1096 |none |13476 |
|1098 |none |359804081928998 |2024-01-26 |359804081928998 |1098 |none |41384 |
|1096 |none |864004044727235 |2024-01-26 |864004044727235 |1096 |none |10258 |
|1096 |none |864004044732086 |2024-01-26 |864004044732086 |1096 |none |16627 |
|1098 |none |359804081279582 |2024-01-26 |359804081279582 |1098 |none |17590 |
Actual behaviour:
When we are running the same query for the windowed table we are not getting the records with eventcount > 10000 at all.
select * from tblrmsabnormalevent_test where window_hour = '2024-01-26' and eventcount > 10000 emit changes;
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
|REASON |RMSEVENTTYPE|DEVICEID |WINDOW_HOUR |WINDOWSTART |WINDOWEND |DEVICE_ID |DEVICE_REASO|DEVICE_RMSEV|EVENTCOUNT |WINDOW_START|WINDOW_END |
| | | | | | | |N |ENTTYPE | | | |
+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Query terminated
when eventcount> 2000 we are getting below ouput.
We don't understand where the issue was and why there is so much data Discrepancy. On running the aggregation we are seeing that aggregated values for the columns are coming incorrectly. The aggregated values seems to be lesser than the actual values expected.
The text was updated successfully, but these errors were encountered: