-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Purge + Restore user timeseries data with long-term storage #952
base: master
Are you sure you want to change the base?
Purge + Restore user timeseries data with long-term storage #952
Conversation
When operating a server, the `Stage_timeseries` database can become quite big. In the case where only the `Stage_analysis_timeseries` is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.
Also added associated unit tests
Print() statements weren't being logged in AWS Cloudwatch logs. Logging.debug() statements are meant for this purpose. These statements may or may not show up in normal execution output depending on the set logger level.
Choosing JSON instead of CSV since: 1. CSV does not retain nested dict-like document data structure of MongoDB documents. 2. CSV stores redundant empty NaN columns as well.
CSV export kept on hold for now as restoring from CSV is complicated due to loss of data structure. This commit includes working code for export as JSON file and import from JSON file.
Default option for now is JSON which is easier for data restore. Provided export flags as a boolean dictionary which calls the specific export function as per the set boolean flag.
Thinking of options to be made available for type of export file - CSV, JSON, Mongodump
I’m proposing default option to be JSON which will occur irrespective of other type of export (CSV for now). Pros of this:
Pros of having Mongodump as default option:
Currently integrated CSV export as an optional export. Once Mongodump is implemented and even that data can be imported, then even Mongodump becomes an option for export. I have grouped all export data file types in a boolean dictionary which can the call specific export function is that flag is set. Next working on adding Mongodump as an option. |
Built on and added tests for normal data operations of purge() and restore(). Added edge cases tests: 1. Loading duplicate or already existing data by calling restore function again. 2. Loading from empty JSON file containing no data. Will add additional tests if needed.
Changed file path of empty json file used for testing to generic /var/tmp instead of local path.
Latest fixes work with JSON files used as the export file type and this same file works for importing data correctly back into the database as well. However, with CSV files, it's a bit complicated because of two reasons:
Also, with Mongodump, I tried including the python subprocess module to use it to run the mongodump command as a terminal command but there were issues with dependencies missing (such as mongodump), then had to figure out how to run and/or save the file locally in the Docker container and then extract it out of there if needed for analysis on one's local system. With these in mind, JSON currently seems a good option for basic export/import. |
Yes, that is the export PR. We can also potentially use the associated pipeline state for incremental exports since this is planned to be part of a running system. I would suggest the use of command line flags to indicate whether we want an incremental dump or a full dump. |
As discussed earlier, this needs to use the existing export method. |
Building on top of PR #904 with the objective of integrating this into the current codebase.
Will be analyzing the file storage options (CSV, JSON) as well as including the option for mongodump/restore.
Additionally long-term storage to AWS will be looked at as well.