Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check unit test code #826

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

corinne-hcr
Copy link
Contributor

TestDataPreprocessing.py and TestGetUsers.py need to read trips. Also need help with setup and teardown function.

@shankari
Copy link
Contributor

Please submit a separate PR for the code refactoring since the unit tests are not complete.

@shankari shankari mentioned this pull request Jul 19, 2021
@corinne-hcr
Copy link
Contributor Author

Just for records.
1.I am currently reading in shankari_2016-06-20.expected_confirmed_trips. I can get 6 confirmed trips in total. But when the process goes to similarity code (filter_too_short function), the code cannot read start_place.
2.To be able to test the functions for analyzing common trips, I need to make common trips.
3.Here are codes for creating start_place(or end_place), suggested by @shankari

import emission.core.wrapper.place as ecwp
start_place = ecwp.Place()
start_place.location = ct.start_loc
start_place['_id'] = ct.start_place
import emisson.storage.timeseries.abstract_timeseries as esta
esta.update(start_place)

P.S. ct is an instance of a confirmed trip for which you are trying to create a start place

@corinne-hcr
Copy link
Contributor Author

I am still blocked by this:
start_place is an instance of Place(), contains the location the id. But esta.update(start_place) doesn't work.

AttributeError: module 'emission.storage.timeseries.abstract_timeseries' has no attribute 'update'

The definition of update() is pass.
I don't know how to solve that.

@shankari
Copy link
Contributor

pass is a static method in timeseries (esta.Timeseries.update) should work. Again, my suggestions are not intended to be runnable code - they give you an indication of the methods you should look at, but you need to actually understand the python and adapt as necessary.

@corinne-hcr
Copy link
Contributor Author

I run these codes to get filter_trips, butesta.TimeSeries.update(start_place) doesn't work.
The error is

start_lon = start_place.data.location["coordinates"][0]
AttributeError: 'NoneType' object has no attribute 'data'
import emission.core.wrapper.place as ecwp
import emission.storage.timeseries.abstract_timeseries as esta
self.readAndStoreTripsFromFile("emission/tests/data/real_examples/shankari_2016-06-20.expected_confirmed_trips")
ts = esta.TimeSeries.get_time_series(self.testUUID)
user = self.testUUID
trips = preprocess.read_data(user)
for trip in trips:
    ct = trip
    start_place = ecwp.Place()
    start_place.location = ct.data.start_loc
    start_place['_id'] = ct.data.start_place
    esta.TimeSeries.update(start_place)
filter_trips = preprocess.filter_data(trips, 100)
print(filter_trips)

@shankari
Copy link
Contributor

Again, my suggestions are not intended to be runnable code - they give you an indication of the methods you should look at, but you need to actually understand the code and adapt as necessary.

You are trying to read start_place.data.location. You are setting start_place.location. Make them be consistent and it will work.

If I need to give you runnable code, I might as well write the code myself.

@corinne-hcr
Copy link
Contributor Author

To be consistent like this?
But it still doesn't work

user = self.testUUID
trips = preprocess.read_data(user)
for trip in trips:
  ct = trip
  ct.data.start_place = ecwp.Place()
  ct.data.start_place.location = ct.data.start_loc
  ct.data.start_place.data['_id'] = ct.data.start_place
  esta.TimeSeries.update(ct.data.start_place.data)
  test = similarity.filter_too_short([ct], 100)

For this line,

ct.data.start_place = ecwp.Place()

the error is

AttributeError: property start_place is read-only

If I use

start_place.location = ct.start_loc
start_place.data['_id'] = ct.start_place

the error is

AttributeError: property start_loc is not defined for Entry

Since ct is an instance of a confirmed trip, so it is in Entry class. Only ct.data.start_loc can access start_loc

@shankari
Copy link
Contributor

shankari commented Jul 22, 2021

@corinne-hcr have you checked out the e-mission data model and the different wrapper classes?
https://github.com/e-mission/e-mission-server/tree/master/emission/core/wrapper

They are linked from the timeseries notebook (Timeseries_Sample.ipynb) that you experimented with earlier. The notebook also outlines the various ways to access the timeline objects as entry objects, data tables, etc. Please review them and become familiar with the data model.

To be consistent like this?

No. You don't want to modify the cleaned trip object. you want to read the cleaned trip object and modify the start place, since that is the object that you are creating.

Concretely. Entry objects should set data into the "data" sub-object.

    start_place["data"]["location"] = ct.data.start_loc

@shankari
Copy link
Contributor

Double-checking the code

def update(entry):

works with an entry.

The Entry class includes both data and metadata. Place is an example of data, so you need to call

def update_data(user_id, key, obj_id, data):

instead.

@shankari
Copy link
Contributor

shankari commented Jul 22, 2021

To clarify: my previous comment:

Concretely. Entry objects should set data into the "data" sub-object.

Assumed that you were creating an Entry object. Since you are creating a Place object that corresponds to data in an entry object with metadata.key = analysis/cleaned_place, you can use .location directly.

@shankari
Copy link
Contributor

Again, looking at the implementation of update_data in TimeSeries, it is a static method without a concrete implementation. you need to call the method in BuiltinTimeSeries, which has the actual implementation instead. Note that static methods are not overridden by subclasses.

@corinne-hcr
Copy link
Contributor Author

corinne-hcr commented Jul 23, 2021

I figured that I would get a different result if I changed similarity to extract coordinates directly from trip.data.start_loc["coordinates"](for example) instead of start_place = esda.get_entry(esda.CLEANED_PLACE_KEY, trip.data.start_place)
Here is the result from start_place = esda.get_entry(esda.CLEANED_PLACE_KEY, trip.data.start_place)
1st round
image

2nd round
image

Here is the result from trip.data.start_loc["coordinates"]
1st round
image

2nd round
image

Also, the way I extract coordinates for the 2nd round now is just extracting from the confirmed trip, not from place. Should I change?

@shankari
Copy link
Contributor

shankari commented Jul 23, 2021

I can't see the images, I am not sure that they were uploaded correctly, but that is an interesting finding.

Having said that:

  • the existing results are not immutable. they are not every good (so that we don't need to strive to meet them) and the choice of using the place v/s the trip locations is based on a decision by an undergrad who worked on the project for a summer. I don't think there was a lot of thought put into it, at least, there's no documentation of such thinking. This is why documentation is important :)
  • This is also why unit tests are important. If we had them, we would be able to see where the difference originated and see if it was meaningful, instead of looking at the final result and trying to make sense of it.

Given that the original choice of using the place locations was somewhat arbitrary anyway, I think it is fine to make this change unless the results are dramatically worse. I have already made the change in my local branch as I dig deep into the current similarity code to see how we can improve our results.

Part of the goal of having this kind of testing is to at least know that there are differences, and to reason about whether they are important. If the differences are meaningful, we fix the code. If they are not, and the new code is an improvement, we fix the tests. We are in the second case here.

@shankari
Copy link
Contributor

@corinne-hcr I still can't see the images

@shankari
Copy link
Contributor

@corinne-hcr I can see the images now. I don't see a significant difference between the two results; while individual points are moved around, that falls within the bounds of normal probability differences. If you are familiar with generating boxplots, and can generate a boxplot with the results, that would make it more clear. But generating the boxplot is not a priority.

shankari added a commit to shankari/e-mission-server that referenced this pull request Jul 24, 2021
Instead of looking up the place and getting it instead.

This has two advantages over the current implementation:
1. We don't have to make 2 separate database calls for each trip
    Note that we compute an nxn distance matrix, so this is likely to be a
    substantial savings
2. We can pass in a in-memory trip list. That makes it easier to write unit
    tests, and to use alternate load methods (e.g. for working with federated data
    e-mission/e-mission-eval-private-data@952c476

@corinne-hcr reported that the place location and the trip start/loc locations
are not identical. We don't have unit tests to verify this (alas!) but the top
level results are not changed significantly.

So the ROI seems high enough; we are going ahead with this change.
e-mission#826 (comment)
@shankari
Copy link
Contributor

shankari commented Jul 26, 2021

At a high level, there should typically be multiple tests for each function to test various scenarios that might happen, not just ones that happened to occur while processing one particular user in one particular dataset and one possible method. (e.g. corinne-hcr@b46a370) However, this should hopefully catch most regressions while refactoring so it is a good start!

@corinne-hcr Can you list out what additional tests you plan to complete?

I will review this tomorrow, I have a presentation to finish tonight to meet my own deadlines.

@corinne-hcr
Copy link
Contributor Author

Currently, the similarity code is not significant changed. @shankari added new ways of accessing. But the old ways should still work. So the test of similarity code should still be fine.

@shankari
Copy link
Contributor

shankari commented Jul 26, 2021

For the record, @corinne-hcr's questions (asked in private chat) were:

I realize you change and also wrote a test for similarity
Since you have changed similarity code, my test for the old one may be useless. Should I integrate the new similarity code to evaluation_pipeline? e.g. change the first_round code
There are so many changes now. I am confused what my next step is

My response was:

I explicitly did not change the similarity code significantly; your existing code should work without any changes (I was not comfortable making changes without unit tests)
I added new ways of accessing it which I use, but the old ways should still work

@corinne-hcr so do you have any additional questions? How many more tests do you plan to write and what is their ETA?

@corinne-hcr
Copy link
Contributor Author

I plan to write 3 more tests - get_score, second_round_of_clustering, evaluation_pipeline
They should be done by tomorrow's meeting.
I feel overwhelmed with so many changes at this point. I need some time tomorrow to digest the recent PR and changed ways. I don't know which way and what analysis I need to run and put in the paper. I think I will need to ask some questions after reading the codes. Please forgive me that I am probably not able to fully understand the new analysis in a very short time based on my knowledge and skillset.

@shankari
Copy link
Contributor

shankari commented Jul 26, 2021

@corinne-hcr I would start with e-mission/e-mission-eval-private-data@abf4f78 which compares the similarity code with different settings. In particular, the results around whether or not to filter and whether or not to use the cutoff. Note that I am now returning the labels from similarity (mapped to the original trip indices) so we don't need to maintain the "trip" data structure any more.

Note also that I have different results for the filtered_cutoff case than you do. I am pretty sure this is because of a bug in my h-score calculation code.
e-mission/e-mission-eval-private-data#28 (comment)

I don't care about that for the pipeline changes, since my selected settings are no_filter_no_cutoff, but you can feel free to do so to add additional analysis results to the paper!

@corinne-hcr
Copy link
Contributor Author

I don't care about that for the pipeline changes, since my selected settings are no_filter_no_cutoff, but you can feel free to do so to add additional analysis results to the paper!

I remember you mentioned single level in one of the PR. Do you mean that you just use 1 round of clustering instead of 2 rounds? If so, the evaluation_pipeline is not useful anymore?

@shankari
Copy link
Contributor

Not sure what you mean by single_level (link to code would be helpful). But yes, for the deployment on Monday night, I will use only the first round, and I will not filter or cutoff any trips, and I will not model only one split. I anticipate changing the evaluation pipeline significantly, as I said earlier.

@shankari shankari changed the base branch from random-forest-mode-detection to master September 23, 2023 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants