Purge timeseries (with csv export) #904

TTalex · 2023-03-30T16:58:59Z

This PR is built atop PR #899

The idea is to offer a tool to purge entries from the timeseries database after they have been used by the pipeline process.
The goal is to reduce database footprint of active users, and improve overall pipeline performances.

This PR differs from PR #899 by:

Exporting data before deletion to a configurable CSV file, to avoid losing precious information (default file being /tmp/old_timeseries_<userUUID>.csv)
Adding a unit test file to confirm proper export and deletion using real data
Allowing the method to be called outside the command line
Adding the following command line parameters (and the corresponding method args) :

Short args	Long arg	Description	Default value
-d	--dir_name	Target directory for exported csv data	"/tmp"
	--file_prefix	File prefix for exported csv data	"old_timeseries_"
	--unsafe_ignore_save	Ignore csv export of deleted data (not recommended, this operation is definitive)	False (unset)

As a reminder, existing command args were:

Short args	Long arg	Description	Default value
-h		Help command
-e	--user_email	targeted user email (mutually exclusive with -u)	(required)
-u	--user_uuid	targeted user uuid (mutually exclusive with -e)	(required)

Some dev decision, subject to change:

Since the script targets specific user data, I opted to keep it dissociated from bin/debug/purge_user.py
The script was kept inside the bin folder, in practice its use might fit in a cronjob, which I would not characterize as a "debug" tool
The script only acts on the specific user, maybe an extension could be considered to retrieve all active users and run the method on each.
I have placed the unit test file in a new emission/tests/binTests folder, since I couldn't find any tests matching this one.
The default target directory is set to /tmp, which might break on non-unix machines
File prefix and target directory params are probably a bit too sensitive to leading and ending slashes respectively, but this was deemed acceptable

Open to advices and suggestions.

When operating a server, the `Stage_timeseries` database can become quite big. In the case where only the `Stage_analysis_timeseries` is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.

Also added associated unit tests

shankari · 2023-12-14T22:30:34Z

@MukuFlash03 this is the next task for you to work on. You should:

check to see if this works
add tests for it
figure out how to restore from the dump for people who want to work on it
discuss with cloud services where we can dump this data from an AWS perspective, (e.g. arctic storage), and then
actually write the script to launch it periodically and put it into our internal repo

You should also consider what it would take to restore the data if/when we want to work with the raw data again
Maybe we should store this as a mongodump instead to make it easier to restore. Maybe that should be a script option.

I anticipate this will take ~ 3 weeks to complete 😄

MukuFlash03 · 2023-12-28T18:17:00Z

This is working, as it is. CSV file is storing the required data from timeseries db.
Currently it's stored as a temporary file but I'm keeping it on a permanent basis for now for testing purposes.
Next will test command line usage too.
Will work on adding more tests and changing to mongodump next.

Following up discussion in this PR.

paultranvan and others added 2 commits January 31, 2023 19:03

Added csv export feature to bin/purge_user_timeseries

6ea8ac5

Also added associated unit tests

MukuFlash03 mentioned this pull request Jan 8, 2024

Purge + Restore user timeseries data with long-term storage #952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purge timeseries (with csv export) #904

Purge timeseries (with csv export) #904

TTalex commented Mar 30, 2023

shankari commented Dec 14, 2023 •

edited

MukuFlash03 commented Dec 28, 2023 •

edited

Purge timeseries (with csv export) #904

Are you sure you want to change the base?

Purge timeseries (with csv export) #904

Conversation

TTalex commented Mar 30, 2023

shankari commented Dec 14, 2023 • edited

MukuFlash03 commented Dec 28, 2023 • edited

shankari commented Dec 14, 2023 •

edited

MukuFlash03 commented Dec 28, 2023 •

edited