Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Trip Segmentation by reducing DB calls #958

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

humbleOldSage
Copy link
Contributor

The changes below that led to these performance upgrades are investigated in e-mission/e-mission-docs#1041 . They are :

  1. db calls for transition and motion dataframes are moved upstream from is_tracking_restarted_in_range function and get_ongoing_motion_in_range in restart_checking.py to trip_segmentaiton.py. The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting.

  2. All the other changes in trip_segmentation.py,dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py are just to support the change in point 1 ( above).

The changes  below  that led to these performance upgrades are investigated in   e-mission/e-mission-docs#1041 . They are :

1.  db calls for transition and motion dataframes are moved upstream from  `is_tracking_restarted_in_range` function  and `get_ongoing_motion_in_range` in `restart_checking.py` to `trip_segmentaiton.py`.  The old setting which had multiple db calls ( for each iteration ) now happen once in the improved setting.

2. All the other changes in `trip_segmentation.py`,`dwell_segmentation_dist_filter.py` and
 `dwell_segmentation_time_filter.py` are just to support the change  in point 1 ( above).
@humbleOldSage
Copy link
Contributor Author

Replying to the comment #956 (comment) here :

Please also separate the vectorization from the code that actually reduces DB calls so we can evaluate each of them independently

The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.

Also, what are you testing this on?

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

Why do we have the magic number of 327?

When the data in emission/tests/data/real_examples/shankari_2015-aug-27 is setup, querying for 'background/filtered_location' from segment_into_trips indwell_segmentation_time_filter.py returns 327 points .

     filtered_points_pre_ts_diff_df = timeseries.get_data_df("background/filtered_location", time_query)

The for loop then iterates on these points

    for idx, row in filtered_points_df.iterrows():

@humbleOldSage
Copy link
Contributor Author

there is a comment in the issue that indicates that there were some failures, but I don't see any updates on the fix. e-mission/e-mission-docs#1041 (comment)

This is in regards to the fact that the db call for "background/filtered_location":

 ts.get_data_df("background/filtered_location", time_query)

Is present in these 3 files trip_segmentation, dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py. Since dwell_segmentation_dist_filter.py and dwell_segmentation_time_filter.py are downstream from trip_segmentation, the calls in these 2 should be easily replaceable by passing dataframe from trip_segementation and thus there should just be 1 db call. However, if we do so the test for combinedWrapper in TestTripSegmentation fails. So, for now there are separate calls in each function.

This causes an overhead of ~0.2s for androidWrapper and IosWrapper in the current test setup.

However, there is nothing else failing here.

@shankari
Copy link
Contributor

shankari commented Feb 8, 2024

The current PR just has reduced DB calls. In this case, the runtime for androidWrapper is at ~2.1s ( as mentioned here e-mission/e-mission-docs#1041 (comment) in the issue) , IOS Wrapper at 0.4s and CombinedWrapper at ~2.6s.

And what was it before? Just reporting the new number does not capture the difference. I don't see a comment in the issue for the time taken in androidWrapper before your change. I only see "It took ~6.8s to run the entire segment_current_trips from ..."

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

This is not a sufficient test. shankari_2015-aug-27 only contains android data. That is why you see that "IOS Wrapper at 0.4s", it is not actually running any significant iOS code. You have also flagged this as "Looking into the Android run time for now.". You need to test and report real numbers on iOS as well.

@humbleOldSage
Copy link
Contributor Author

humbleOldSage commented Feb 13, 2024

I am using the TestTripSegmentation.py that read data from emission/tests/data/real_examples/shankari_2015-aug-27

This is not a sufficient test. shankari_2015-aug-27 only contains android data.

On second look, for the iOSWrapper, the test reads data fromiphone_2015-11-06. All in all, there are indeed two real world example files, one for android, another for iOS in the test.

    etc.setupRealExample(self, "emission/tests/data/real_examples/shankari_2015-aug-27")
    self.androidUUID = self.testUUID

    .
    .
    .
    .
    with open("emission/tests/data/real_examples/iphone_2015-11-06") as fp:

That is why you see that "IOS Wrapper at 0.4s", it is not actually running any significant iOS code.

I believe the IosWrapper is at 0.4s due to lesser trips, which we come to know after running the pipeline. ( 2 in case of iOS vs 8 for android ).

I tried other real example files with platform as iOS. The other files with iOS data are :

iphone_2016-02-22 ,
iphone_3_2016-02-22,
iphone_2016-02-22,
sunil_2016-07-20

once the example is set (using all of them one by one), there are are either 0 or 2 trips for iOS, giving the similar runtime of ~0.4s - 0.5s.

@humbleOldSage
Copy link
Contributor Author

The runtimes are :

iOS Android Combined
Previous ~0.6s ~6.8s ~7.9s
Current. ~0.4s ~2.1s ~2.6s

@shankari
Copy link
Contributor

@JGreenlee scalability improvement

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that < 10 trips is going to give us a real sense of the scalability improvement. Please run more complex tests:

  • TestPipelineRealData for a combination of correctness and performance
  • against a real large dataset, after resetting the pipeline, for a rough estimate of performance.

Comment on lines +20 to +22
transition_df_start_idx=transition_df.ts.searchsorted(start_ts,side='left')
transition_df_end_idx=transition_df.ts.searchsorted(end_ts,side='right')
transition_df_for_current=transition_df.iloc[transition_df_start_idx:transition_df_end_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you are using this instead of something like

    transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]

or

    transition_df_for_current=transition_df.query('ts >= start_ts & .ts <= end_ts')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First was the performance reason, O(log (n)) here vs O (n) in others.
Second, in case of query(), it creates a boolean series that marks rows to keep or discard, which happens internally, which increases temporary memory usage. For very large DataFrames, this can be an issue.

can use this one ,if log(n) vs O(n) is not an issue :

transition_df_for_current=transition_df[transition_df.ts >= start_ts && transition_df.ts <= end_ts]

Comment on lines +39 to +41
motion_df_start_idx=motion_df.ts.searchsorted(start_ts,side='left')
motion_df_end_idx=motion_df.ts.searchsorted(end_ts,side='right')
filtered_motion_df=motion_df.iloc[motion_df_start_idx:motion_df_end_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@@ -7,6 +7,7 @@
from builtins import *
from builtins import object
import logging
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to add this import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused currently. Removed.

def has_trip_ended(self, lastPoint, currPoint, timeseries):
def has_trip_ended(self, lastPoint, currPoint, timeseries, motion_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are already passing in the motion_df here, why do you need to also pass in the timeseries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's being used here :

   timeseries.invalidate_raw_entry(currPoint["_id"])

if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, timeseries):
if eaisr.is_tracking_restarted_in_range(lastPoint.ts, currPoint.ts, self.transition_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is transition_df a module field?

ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, timeseries)
ongoing_motion_in_range = eaisr.get_ongoing_motion_in_range(lastPoint.ts, currPoint.ts, motion_df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while motion_df is not. They are both read upfront and should be treated the same way for clarity

Comment on lines 26 to 27
non_still_motions = [ma for ma in motionInRange if ma["data"]["type"] not in ignore_modes_list and ma["data"]["confidence"] == 100]
logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions])
non_still_motions=motionInRange[~motionInRange['type'].isin(ignore_modes_list) & (motionInRange['confidence'] ==100)]
#logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions]) logging.debug("non_still_motions = %s" % [(ecwm.MotionTypes(ma["data"]["type"]), ma["data"]["confidence"], ma["data"]["fmt_time"]) for ma in non_still_motions])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change doesn't appear to be related to this fix.

def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes):
def has_trip_ended(self, prev_point, curr_point, timeseries, last10PointsDistances, last5MinsDistances, last5MinTimes, transition_df, motion_df):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. if we are passing in the transition_df and motion_df, why do we also need the timeseries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it's unused. Cleared.

Code cleanup as per the comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review done; Changes requested
2 participants