Improving Trip Segmentation by reducing DB calls #958

humbleOldSage · 2024-02-06T22:38:08Z

The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are :

db calls for transition and motion dataframes are moved upstream from is_tracking_restarted_in_range function and get_ongoing_motion_in_range in restart_checking.py to trip_segmentaiton.py. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting.
All the other changes in trip_segmentation.py,dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py are just to support the change in point 1 ( above).

The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are : 1. db calls for transition and motion dataframes are moved upstream from `is_tracking_restarted_in_range` function and `get_ongoing_motion_in_range` in `restart_checking.py` to `trip_segmentaiton.py`. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting. 2. All the other changes in `trip_segmentation.py`,`dwell_segmentation_dist_filter.py` and `dwell_segmentation_time_filter.py` are just to support the change in point 1 ( above).

humbleOldSage · 2024-02-06T23:00:33Z

Replying to the comment #956 (comment) here :

Please also separate the vectorization from the code that actually reduces DB calls so we can evaluate each of them independently

The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.

Also, what are you testing this on?

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

Why do we have the magic number of 327?

When the data in emission/tests/data/real_examples/shankari_2015-aug-27 is setup, querying for 'background/filtered_location' from segment_into_trips indwell_segmentation_time_filter.py returns 327 points .

     filtered_points_pre_ts_diff_df = timeseries.get_data_df("background/filtered_location", time_query)

The for loop then iterates on these points

    for idx, row in filtered_points_df.iterrows():

humbleOldSage · 2024-02-06T23:17:59Z

there is a comment in the issue that indicates that there were some failures, but I don't see any updates on the fix. e-mission/e-mission-docs#1041 (comment)

This is in regards to the fact that the db call for "background/filtered_location":

 ts.get_data_df("background/filtered_location", time_query)

Is present in these 3 files trip_segmentation, dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py. Since dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py are downstream from trip_segmentation, the calls in these 2 should be easily replaceable by passing dataframe from trip_segementation and thus there should just be 1 db call. However, if we do so the test for combinedWrapper in TestTripSegmentation fails. So, for now there are separate calls in each function.

This causes an overhead of ~0.2s for androidWrapper and IosWrapper in the current test setup.

However, there is nothing else failing here.

shankari · 2024-02-08T02:22:26Z

The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.

And what was it before? Just reporting the new number does not capture the difference. I don't see a comment in the issue for the time taken in androidWrapper before your change. I only see "It took ~6.8s to run the entire segment_current_trips from ..."

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

This is not a sufficient test. shankari_2015-aug-27 only contains android data. That is why you see that "IOS Wrapper at 0.4s", it is not actually running any significant iOS code. You have also flagged this as "Looking into the Android run time for now.". You need to test and report real numbers on iOS as well.

humbleOldSage · 2024-02-13T21:35:52Z

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

This is not a sufficient test. shankari_2015-aug-27 only contains android data.

On second look, for the iOSWrapper, the test reads data fromiphone_2015-11-06. All in all, there are indeed two real world example files, one for android, another for iOS in the test.

    etc.setupRealExample(self, "emission/tests/data/real_examples/shankari_2015-aug-27")
    self.androidUUID = self.testUUID

    .
    .
    .
    .
    with open("emission/tests/data/real_examples/iphone_2015-11-06") as fp:

That is why you see that "IOS Wrapper at 0.4s", it is not actually running any significant iOS code.

I believe the IosWrapper is at 0.4s due to lesser trips, which we come to know after running the pipeline. ( 2 in case of iOS vs 8 for android ).

I tried other real example files with platform as iOS. The other files with iOS data are :

iphone_2016-02-22 ,
iphone_3_2016-02-22,
iphone_2016-02-22,
sunil_2016-07-20

once the example is set (using all of them one by one), there are are either 0 or 2 trips for iOS, giving the similar runtime of ~0.4s - 0.5s.

humbleOldSage · 2024-02-13T22:17:00Z

The runtimes are :

	iOS	Android	Combined
Previous	~0.6s	~6.8s	~7.9s
Current.	~0.4s	~2.1s	~2.6s

shankari · 2024-03-16T00:59:35Z

@JGreenlee scalability improvement

shankari

I don't think that < 10 trips is going to give us a real sense of the scalability improvement. Please run more complex tests:

TestPipelineRealData for a combination of correctness and performance
against a real large dataset, after resetting the pipeline, for a rough estimate of performance.

shankari · 2024-03-16T05:20:27Z

emission/analysis/intake/segmentation/restart_checking.py

+    transition_df_start_idx=transition_df.ts.searchsorted(start_ts,side='left')    
+    transition_df_end_idx=transition_df.ts.searchsorted(end_ts,side='right')
+    transition_df_for_current=transition_df.iloc[transition_df_start_idx:transition_df_end_idx]


Is there a reason you are using this instead of something like

transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]

or

transition_df_for_current=transition_df.query('ts >= start_ts & .ts <= end_ts')

First was the performance reason, O(log (n)) here vs O (n) in others.
Second, in case of query(), it creates a boolean series that marks rows to keep or discard, which happens internally, which increases temporary memory usage. For very large DataFrames, this can be an issue.

can use this one ,if log(n) vs O(n) is not an issue :

transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]

shankari · 2024-03-16T05:21:42Z

emission/analysis/intake/segmentation/restart_checking.py

+    motion_df_start_idx=motion_df.ts.searchsorted(start_ts,side='left')    
+    motion_df_end_idx=motion_df.ts.searchsorted(end_ts,side='right')
+    filtered_motion_df=motion_df.iloc[motion_df_start_idx:motion_df_end_idx]


shankari · 2024-03-16T05:26:03Z

emission/analysis/intake/segmentation/trip_segmentation.py

@@ -7,6 +7,7 @@
 from builtins import *
 from builtins import object
 import logging
+import pandas as pd


why do we need to add this import?

Unused currently. Removed.

shankari · 2024-03-16T05:30:53Z

...ion/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_dist_filter.py

-    def has_trip_ended(self, lastPoint, currPoint, timeseries):
+    def has_trip_ended(self, lastPoint, currPoint, timeseries, motion_df):


if you are already passing in the motion_df here, why do you need to also pass in the timeseries?

it's being used here :

timeseries.invalidate_raw_entry(currPoint["_id"])

shankari · 2024-03-16T05:31:24Z

...ion/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_dist_filter.py

-            if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, timeseries):
+            if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, self.transition_df):


why is transition_df a module field?

shankari · 2024-03-16T05:31:50Z

...ion/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_dist_filter.py

-            ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, timeseries)
+            ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, motion_df)


while motion_df is not. They are both read upfront and should be treated the same way for clarity

shankari · 2024-03-16T05:34:25Z

...on/analysis/intake/segmentation/trip_segmentation_methods/trip_end_detection_corner_cases.py

-    non_still_motions = [ma for ma in motionInRange if ma["data"]["type"] not in ignore_modes_list and ma["data"]["confidence"] == 100] 
-    logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions])
+    non_still_motions=motionInRange[~motionInRange['type'].isin(ignore_modes_list) & (motionInRange['confidence'] ==100)]
+    #logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions])    logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions])


this change doesn't appear to be related to this fix.

shankari · 2024-03-16T05:35:09Z

...ion/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py

-    def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes):
+    def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes, transition_df, motion_df):


Ditto. if we are passing in the transition_df and motion_df, why do we also need the timeseries?

Here, it's unused. Cleared.

Code cleanup as per the comments.

humbleOldSage mentioned this pull request Feb 6, 2024

Improving Trip Segmentation by reducing DB calls #956

Closed

shankari requested changes Mar 16, 2024

View reviewed changes

Faster Trip Segmentation Cleanup

89c2387

Code cleanup as per the comments.

shankari mentioned this pull request May 22, 2024

Pipeline stuck for user because TRIP_SEGMENTATION did not mark 'completed' nor 'failed' e-mission/e-mission-docs#1075

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Trip Segmentation by reducing DB calls #958

Improving Trip Segmentation by reducing DB calls #958

humbleOldSage commented Feb 6, 2024

humbleOldSage commented Feb 6, 2024

humbleOldSage commented Feb 6, 2024

shankari commented Feb 8, 2024

humbleOldSage commented Feb 13, 2024 •

edited

humbleOldSage commented Feb 13, 2024

shankari commented Mar 16, 2024

shankari left a comment

shankari Mar 16, 2024

humbleOldSage Mar 29, 2024

shankari Mar 16, 2024

shankari Mar 16, 2024

humbleOldSage Mar 29, 2024

shankari Mar 16, 2024

humbleOldSage Apr 3, 2024

shankari Mar 16, 2024

shankari Mar 16, 2024

shankari Mar 16, 2024

shankari Mar 16, 2024

humbleOldSage Apr 3, 2024

		def has_trip_ended(self, lastPoint, currPoint, timeseries):
		def has_trip_ended(self, lastPoint, currPoint, timeseries, motion_df):

		if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, timeseries):
		if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, self.transition_df):

		ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, timeseries)
		ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, motion_df)

		def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes):
		def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes, transition_df, motion_df):

Improving Trip Segmentation by reducing DB calls #958

Are you sure you want to change the base?

Improving Trip Segmentation by reducing DB calls #958

Conversation

humbleOldSage commented Feb 6, 2024

humbleOldSage commented Feb 6, 2024

humbleOldSage commented Feb 6, 2024

shankari commented Feb 8, 2024

humbleOldSage commented Feb 13, 2024 • edited

humbleOldSage commented Feb 13, 2024

shankari commented Mar 16, 2024

shankari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humbleOldSage commented Feb 13, 2024 •

edited