Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge timeseries (with csv export) #904

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

TTalex
Copy link
Contributor

@TTalex TTalex commented Mar 30, 2023

This PR is built atop PR #899

The idea is to offer a tool to purge entries from the timeseries database after they have been used by the pipeline process.
The goal is to reduce database footprint of active users, and improve overall pipeline performances.

This PR differs from PR #899 by:

  • Exporting data before deletion to a configurable CSV file, to avoid losing precious information (default file being /tmp/old_timeseries_<userUUID>.csv)
  • Adding a unit test file to confirm proper export and deletion using real data
  • Allowing the method to be called outside the command line
  • Adding the following command line parameters (and the corresponding method args) :
Short args Long arg Description Default value
-d --dir_name Target directory for exported csv data "/tmp"
--file_prefix File prefix for exported csv data "old_timeseries_"
--unsafe_ignore_save Ignore csv export of deleted data (not recommended, this operation is definitive) False (unset)

As a reminder, existing command args were:

Short args Long arg Description Default value
-h Help command
-e --user_email targeted user email (mutually exclusive with -u) (required)
-u --user_uuid targeted user uuid (mutually exclusive with -e) (required)

Some dev decision, subject to change:

  • Since the script targets specific user data, I opted to keep it dissociated from bin/debug/purge_user.py
  • The script was kept inside the bin folder, in practice its use might fit in a cronjob, which I would not characterize as a "debug" tool
  • The script only acts on the specific user, maybe an extension could be considered to retrieve all active users and run the method on each.
  • I have placed the unit test file in a new emission/tests/binTests folder, since I couldn't find any tests matching this one.
  • The default target directory is set to /tmp, which might break on non-unix machines
  • File prefix and target directory params are probably a bit too sensitive to leading and ending slashes respectively, but this was deemed acceptable

Open to advices and suggestions.

paultranvan and others added 2 commits January 31, 2023 19:03
When operating a server, the `Stage_timeseries` database can become
quite big.
In the case where only the `Stage_analysis_timeseries` is actually
useful after the pipeline execution, the user's timeseries can be
deleted to speed up the pipeline and gain some disk space.
@shankari
Copy link
Contributor

shankari commented Dec 14, 2023

@MukuFlash03 this is the next task for you to work on. You should:

  • check to see if this works
  • add tests for it
  • figure out how to restore from the dump for people who want to work on it
  • discuss with cloud services where we can dump this data from an AWS perspective, (e.g. arctic storage), and then
  • actually write the script to launch it periodically and put it into our internal repo

You should also consider what it would take to restore the data if/when we want to work with the raw data again
Maybe we should store this as a mongodump instead to make it easier to restore. Maybe that should be a script option.

I anticipate this will take ~ 3 weeks to complete 😄

@MukuFlash03
Copy link
Contributor

MukuFlash03 commented Dec 28, 2023

This is working, as it is. CSV file is storing the required data from timeseries db.
Currently it's stored as a temporary file but I'm keeping it on a permanent basis for now for testing purposes.
Next will test command line usage too.
Will work on adding more tests and changing to mongodump next.

Following up discussion in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants