Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Purge + Restore user timeseries data with long-term storage #952

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

MukuFlash03
Copy link
Contributor

Building on top of PR #904 with the objective of integrating this into the current codebase.
Will be analyzing the file storage options (CSV, JSON) as well as including the option for mongodump/restore.
Additionally long-term storage to AWS will be looked at as well.

paultranvan and others added 7 commits January 31, 2023 19:03
When operating a server, the `Stage_timeseries` database can become
quite big.
In the case where only the `Stage_analysis_timeseries` is actually
useful after the pipeline execution, the user's timeseries can be
deleted to speed up the pipeline and gain some disk space.
Print() statements weren't being logged in AWS Cloudwatch logs.

Logging.debug() statements are meant for this purpose.

These statements may or may not show up in normal execution output depending on the set logger level.
Choosing JSON instead of CSV since:
1. CSV does not retain nested dict-like document data structure of MongoDB documents.
2. CSV stores redundant empty NaN columns as well.
CSV export kept on hold for now as restoring from CSV is complicated due to loss of data structure.

This commit includes working code for export as JSON file and import from JSON file.
Default option for now is JSON which is easier for data restore.

Provided export flags as a boolean dictionary which calls the specific export function as per the set boolean flag.
@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Jan 9, 2024

Thinking of options to be made available for type of export file - CSV, JSON, Mongodump

  • CSV -> Complicates restoring due to: NaN field columns, loss of data structure
  • JSON -> Works good with both export and import
  • Mongodump -> Should be same as JSON

I’m proposing default option to be JSON which will occur irrespective of other type of export (CSV for now). Pros of this:

  • Currently JSON export/import has been implemented succesfully.
  • Avoid complexities of mongodump's readability from the user's perspective, as it can be viewed in a text editor and data can be interpreted more easily as compared to Mongodump.

Pros of having Mongodump as default option:

  • Purge + Restore should be much simpler as it should be some built-in functionality of simply loading mongodump.

Currently integrated CSV export as an optional export. Once Mongodump is implemented and even that data can be imported, then even Mongodump becomes an option for export.
This will mean JSON file will always be generated so will get either just JSON or a combination of JSON and other file types (JSON + CSV or JSON + CSV + Mongodump).

I have grouped all export data file types in a boolean dictionary which can the call specific export function is that flag is set.

Next working on adding Mongodump as an option.

Mahadik, Mukul Chandrakant added 2 commits January 10, 2024 19:00
Built on and added tests for normal data operations of purge() and restore().

Added edge cases tests:
1. Loading duplicate or already existing data by calling restore function again.
2. Loading from empty JSON file containing no data.

Will add additional tests if needed.
Changed file path of empty json file used for testing to generic /var/tmp instead of local path.
@MukuFlash03
Copy link
Contributor Author

MukuFlash03 commented Jan 19, 2024

Latest fixes work with JSON files used as the export file type and this same file works for importing data correctly back into the database as well.

However, with CSV files, it's a bit complicated because of two reasons:

  1. nested data structure is lost 2) data is stored as strings which converts types like arrays into strings too.

Also, with Mongodump, I tried including the python subprocess module to use it to run the mongodump command as a terminal command but there were issues with dependencies missing (such as mongodump), then had to figure out how to run and/or save the file locally in the Docker container and then extract it out of there if needed for analysis on one's local system.

With these in mind, JSON currently seems a good option for basic export/import.
I will check out the already existing code for JSON export too.
I found this export function mentioned here

@shankari
Copy link
Contributor

Yes, that is the export PR. We can also potentially use the associated pipeline state for incremental exports since this is planned to be part of a running system. I would suggest the use of command line flags to indicate whether we want an incremental dump or a full dump.

@shankari
Copy link
Contributor

As discussed earlier, this needs to use the existing export method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Review done; Changes requested
4 participants