Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add timeseries purge script #899

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

paultranvan
Copy link

When operating a server with a lot of users, the Stage_timeseries database can quickly become quite big.
In the case where only the Stage_analysis_timeseries is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.

@shankari I have concerns about the dates handling. As far as I understand, the metadata.write_ts is written by the device for the docs coming from ios/android. If a user edits a trip, the date will change, right?
Now let's assume the following events:

  • T0: a user edits a trip on the mobile, with a metadata.write_ts = T0
  • T1: the pipeline is run on the server, producing a last_ts_run = T1
  • T2: the data from the user mobile is sync on the server
  • T3: this purge script is run, removing all the timeseries with a metadata.write_ts < T1

But in this scenario, the edited trip with T0 < T1 is not deleted, because it is in the usercache collection, right? It will be moved in the timeseries collection only at the first step of the pipeline. And if a problem occurs during the pipeline execution, the last_ts_run date won't be updated for the CREATE_CONFIRMED_OBJECTS step.

Hope I understood correctly, please correct me if I'm not!

When operating a server, the `Stage_timeseries` database can become
quite big.
In the case where only the `Stage_analysis_timeseries` is actually
useful after the pipeline execution, the user's timeseries can be
deleted to speed up the pipeline and gain some disk space.
@shankari
Copy link
Contributor

shankari commented Feb 1, 2023

@paultranvan

the OpenPATH data model is that the timeseries, which consists of user inputs, cannot be recreated (since we can't go back in time) so is extremely precious and immutable.
The analysis timeseries is what we infer from those user inputs, so it can be recreated multiple times, potentially with multiple algorithms (e.g. map matching, trajectory matching, etc), use cases that we haven't even considered yet (e.g. https://journals.sagepub.com/doi/full/10.1177/0361198118798286), potentially with refinements to the algorithms.
This model allows us to reset the pipeline and re-run in case we have errors, or improvements to the algorithms (e.g. e-mission/e-mission-docs#843)
Please see chapter 5 (specifically chapter 5.1.3) of my thesis for additional information
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-180.html

So while running this script is clearly optional, the script itself is inherently contrary to the spirit of the project, and I am very reluctant to merge it.

Regardless, for your better understanding of the pipeline:

  1. the last_ts_run pipeline state is more for a log of when the pipeline was run. The last_processed_ts is the one used for actual pipeline operations, including resetting
  2. the user does not edit trip objects on the server. The user inputs are saved as separate timeseries objects (so they cannot be deleted) and matched to the auto-generated trips as part of the pipeline. Please see use labels to override trip mode e-mission-docs#476 for more discussion and documentation of the current implementation

@shankari
Copy link
Contributor

shankari commented Feb 1, 2023

Although I don't think we should delete the timeseries, I would support moving them to a separate long-term/archival storage after a while. This would still allow pipeline resets as long as issues were detected quickly (before the timeseries data is archived), and could also allow for improved algorithms by copying the data back from the archival storage.

@shankari
Copy link
Contributor

shankari commented Feb 1, 2023

You might want to see the related
e-mission/e-mission-docs#845

@TTalex
Copy link
Contributor

TTalex commented Feb 2, 2023

Hey, here's some context related to this PR.

The way we find use-cases in France is always linked to experimental projects, financed by local authorities. The caveat that comes with this kind of projects is that all gathered data has a limited lifetime: at the end of the program, all user data has to be deleted.

Given the duration of these projects (around a year usually), the likelihood that we rerun timeseries analysis or that new use cases appear is very low. In practice, it never happened.
Linked to that, we noticed that deletion of already processed data was a quick way to significantly improve pipeline performances.

I'm wondering if we're an isolated case, or if other community members fit this requirement of limited lifetime of data. If we're alone, I agree that this PR is no use for the central repo, if not, maybe it should be part of the available toolkit.

Another element of context that is quite important, is that this is the first step in our goal, with Cozy Cloud, of optional decentralization of openPath. The goal of this PR is to provide a way of removing old data from the pipeline process. It doesn't exclude the idea of moving such data to long term storage before that, although this would require another process not included in this PR. In a decentralized approach, such data would be stored in the personal cloud storage of each user, and users would be free to decide to save or delete it.

In the case where we're an isolated case, do you think this PR would be useful if it included some kind of long term storage before deletion ? Perhaps by exporting the data to files (targz'ed CSVs?) to begin with ?

@paultranvan
Copy link
Author

Hello @shankari,
I understand your point of view, and your vision about the project.
To clarify what we are trying to do, let me add a bit more of context, to be sure we understand each other. As you know, at Cozy Cloud, we are building a decentralized personal Cloud where any user can store any data. It makes a lot of sense for us to offer the possibility to users to retrieve their trips and better understand their habits, their carbon impact, find incentives to improve it, etc. And eventually, we are planning to implement distributed collaboration protocols to collectively learn about trips usages, while enforcing anonymity.

In our decentralized vision, the data should only remain in each Cozy instance, 1 for user. In the long-term vision, we would strive for a decentralized openPATH, but for now, we would run a centralized server to compute the trips, and each instance will periodically request the /getTrips endpoint to get their own trips, on their own instance. Once it is done, there is no reason for us to keep the data on the openPATH server. Actually purging the data is even a requirement for us, because:

  • a centralized server with a lot of sensitive user data is definitely attractive for hackers
  • keeping user data in a centralized place is the opposite of the personal Cloud vision
  • the european reglementation (GDPR) has very strict rules about data retention based on the finality. It you don't need the data anymore, you have to remove it after a while (e.g. 90 days).

In our personal Cloud use-case, we do not need to keep the data in the centralized server, as long as we do not work on the pipeline execution itself, improve the algorithms and so on. This is definitely not to be excluded in the future, but for now our short-term goal is to be able to operate a scalable openPATH server for thousands of users.

So maybe something that would make more sense for both of us would be a script that would purge all the data, except for the user profile + usercache? In this perspective, we would just need a way to store a "retrieval date" for each user, once the data has been imported on a Cozy instance, to know when starting the purge is safe.

@paultranvan
Copy link
Author

I didn't see @TTalex comment before posting, but it is actually quite complementary :)

@shankari
Copy link
Contributor

shankari commented Feb 5, 2023

@asiripanich @lgharib @ericafenyo do you have a sense of whether your use cases require/prefer retention of raw timeseries data? Please see the discussion above for additional context.

@paultranvan we have a monthly meeting scheduled at 9pm CET on the last Monday of each month focused around technical coordination. Would you like to attend? If we don't hear back from the others by then, we can bring it up as a topic of discussion

To summarize, I believe that the two options that have been proposed are:

  • long-term storage before deletion (@TTalex)
  • script that would purge all data except for profile + usercache (@paultranvan)

@shankari
Copy link
Contributor

shankari commented Feb 5, 2023

@TTalex @paultranvan couple of follow ups to your insightful comments:

The NREL hosted version of the OpenPATH server is already decentralized

Each partner gets their own database, their own server instance, and their own public and admin dashboards. We support separate configs for each partner that are used to customize the platform to meet their needs. As a concrete example, consider the UPRM CIVIC vs. MassCEC deployments

Microservice UPRM CIVIC MM-MassCEC
Join page https://uprm-civic-openpath.nrel.gov/join/ https://mm-masscec-openpath.nrel.gov/join/
Config spec (note separate server URLs) https://github.com/e-mission/nrel-openpath-deploy-configs/blob/main/configs/uprm-civic.nrel-op.json https://github.com/e-mission/nrel-openpath-deploy-configs/blob/main/configs/mm-masscec.nrel-op.json
Public dashboard https://uprm-civic-openpath.nrel.gov/public/ https://mm-masscec-openpath.nrel.gov/public/

It is true that it is not decentralized at the level of individual users, but it is the same concept, only at a much more massive scale - e.g. see my recent issue e-mission/e-mission-docs#852

NREL OpenPATH already supports anonymized trip usage analysis

This is essentially the public dashboard, which aggregates trip information across multiple users. Note that there is you cannot anonymize fine-grained spatio-temporal information, particularly if it is linked to a particular user. We have monthly charts, and will likely add aggregate spatial analysis similar to the graphs below soon.

Scatter plot of start/end points (Blue: all trips, Green: e-bike trips, Red: car-like trips)

image

Trajectories for e-bike trips ONLY at various zoom levels

image

At this point, for the public dashboard, we plan to stay with coarse-grained static images to avoid issues with repeated queries. For the longer-term, we plan to automate analyses as much as possible in a "code against data" scenario so that we can understand the high level impact without needing access to individual spatio-temporal data.

Note that we also have a more sophisticated prototype that supports arbitrary code execution against user data (https://github.com/e-mission/e-mission-upc-aggregator) but it is not in our roadmap for now because:

  • we think that adding error for differential privacy could be challenging given the existing sources of error wrt imperfect algorithms. An energy estimate of $x \pm y$ is useless if $x \approx y$
  • we don't think that our partners have the time and the technical capability to write scripts to analyse the data, let alone make the scripts open source. They would also need some sample non-private data to write the scripts against. It has been much more successful to write the scripts ourselves (and make them open source, of course) so that they can only supply the narrative around the visualizations.

Our plan is to write the public and admin dashboards, figure out how they are being used, and then figure out how to tighten down the user spatio-temporal access to support those use cases.

@paultranvan what did you have in mind for "distributed collaboration protocols to collectively learn about trips usages, while enforcing anonymity."?

@shankari
Copy link
Contributor

shankari commented Feb 5, 2023

Also, @paultranvan @TTalex, wrt

but for now, we would run a centralized server to compute the trips, and each instance will periodically request the /getTrips endpoint to get their own trips, on their own instance.

we are going to remove the getTrips endpoint once the label and diary screen unification is complete. I brought this up during the last monthly meeting. I'm still not sure why you chose to use the getTrips endpoint instead of /datastreams/find_entries/<time_type>. It is not really even that easy to use wrt exporting: it can only export one day at a time and you don't know when the day is "complete". Regardless, it is clunky and slow and we plan to remove it.

From my post-meeting notes:

  • As @eric Afenyo pointed out, the diary page was slow because we were creating the entire geojson, including the full trajectory information, on the server and returning it one day at a time
  • We have replaced this with the label screen, which pulls only the trips first, displays them, and then fills in the trajectory information later – e.g. Review Unified Label/Diary Screen PRs e-mission-docs#779
    • I am not sure that cordova is the cause for slowness per se, as much as it is an incorrect access pattern.
  • We will remove the diary screen in a future PR and only support the label screen going forward

If you want to copy data over periodically in a principled fashion, you probably want to look at
e-mission/e-mission-docs#643 and the export process developed for that project
#824 (sorry, the issue is mostly about the UI, which we ended up not completing)

@TTalex
Copy link
Contributor

TTalex commented Feb 6, 2023

Thanks for the detailed replies, as always 🙂 !

@paultranvan
Copy link
Author

Thank you for all the insights.

what did you have in mind for "distributed collaboration protocols to collectively learn about trips usages, while enforcing anonymity."?

I was thinking about the work we started with a PhD student: https://www.theses.fr/2019SACLV067
And we are pursuing this work with another thesis on a more AI-related approach: https://dl.acm.org/doi/10.1145/3468791.3468821
Although we didn't think about a trip use-case for now, and focused on the distributed protocol aspects, rather than the data treatment itself.

NREL OpenPATH already supports anonymized trip usage analysis

That's very interesting and might eventually be useful for us in the future!

I'm still not sure why you chose to use the getTrips endpoint instead of /datastreams/find_entries/<time_type>

Well I'm unsure as well :)
From what I recall, /datastreams/find_entries/<time_type> was not as complete as the getTrips in the sense that it lacked detailed trajectory data such as GPS points, speeds, etc. Our need is a HTTP API endpoint to easily get all the GeoJSON data related to each trip so it can be displayed on the user side (in our case, we have a dedicated web app . It seems getTrips was doing the job (but I noticed it was quite slow as well, and not being able to specify a time range is indeed not very practical).
I'm not sure what e-mission/e-mission-docs#779 does exactly as the related PR are quite big (especially e-mission/e-mission-phone#871), but from what you say, I assume "fills in the trajectory information later" is a request to /datastreams/find_entries/<time_type> on a particular key list?

@shankari
Copy link
Contributor

shankari commented Feb 15, 2023

@paultranvan

Well I'm unsure as well :)

😄

From what I recall, /datastreams/find_entries/<time_type> was not as complete as the getTrips in the sense that it lacked detailed trajectory data such as GPS points, speeds, etc.

Yes, find_entries, called with analysis/confirmed_trip returns the trips. You need to make a second set of calls to retrieve the section and location objects.

I'm not sure what e-mission/e-mission-docs#779 does exactly as the related PR are quite big (especially e-mission/e-mission-phone#871), but from what you say, I assume "fills in the trajectory information later" is a request to /datastreams/find_entries/<time_type> on a particular key list?

Yes, it is a call to datastreams/find_entries/<time_type> with the analysis/recreated_location key and the start and end timestamps of the trip. These calls happen in the background and lazily load the trajectory information. So the latency of the user is quite small.

Having said all that, we are now (as part of the trip and place additions functionality for @asiripanich and @AlirezaRa94, and the upcoming diary and label unification), planning to precompute composite_trip objects which will combine trip, place, section, stop and trajectories into one object to simplify the call pattern without increasing response time.

We currently don't envision the composite trips being in geojson format since we do have to implement trip2geojson on the phone anyway, but we could be persuaded to change that.

Again, if you are actively working on OpenPATH, I would strongly encourage you to attend the monthly developer meetings so that we can discuss and finalize high-level changes as a community. Please let me know if I should send you an invite.

@asiripanich
Copy link
Member

For my studies, I use the bin/debug/purge_user.py script to purge the database once a participant has finished their data collection. That has been more than enough for my use cases.

@paultranvan
Copy link
Author

planning to precompute composite_trip objects which will combine trip, place, section, stop and trajectories into one object to simplify the call pattern without increasing response time.
We currently don't envision the composite trips being in geojson format since we do have to implement trip2geojson on the phone anyway, but we could be persuaded to change that.

Ok, that's definitely something that would be interesting for us, particularly for the GeoJSON format! This is a requirement for us, as we are using this format in our CoachCO2 app, using data imported from the server through the dedicated connector

Again, if you are actively working on OpenPATH, I would strongly encourage you to attend the monthly developer meetings so that we can discuss and finalize high-level changes as a community.

Unfortunately, I'll be off for few weeks... Let's talk about this when I come back 🙏

@shankari
Copy link
Contributor

Ok, that's definitely something that would be interesting for us, particularly for the GeoJSON format! This is a requirement for us, as we are using this format in our CoachCO2 app, using data imported from the server through the dedicated connector

Even if we return in a non-geojson format (which is what we are leaning towards now), you should be able to convert to geojson in your connector code. We will convert to geojson on the phone (display layer) before display. There is a tradeoff between the amount of information exchanged and the standardization; we can discuss the design further here.

Unfortunately, I'll be off for few weeks... Let's talk about this when I come back 🙏

We meet every month so maybe you could attend the one in March? Let me know.
@paultranvan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants