Purge + Restore user timeseries data with long-term storage #952

MukuFlash03 · 2024-01-08T17:32:36Z

Building on top of PR #904 with the objective of integrating this into the current codebase.
Will be analyzing the file storage options (CSV, JSON) as well as including the option for mongodump/restore.
Additionally long-term storage to AWS will be looked at as well.

When operating a server, the `Stage_timeseries` database can become quite big. In the case where only the `Stage_analysis_timeseries` is actually useful after the pipeline execution, the user's timeseries can be deleted to speed up the pipeline and gain some disk space.

Also added associated unit tests

Print() statements weren't being logged in AWS Cloudwatch logs. Logging.debug() statements are meant for this purpose. These statements may or may not show up in normal execution output depending on the set logger level.

…tore-timeseries

Choosing JSON instead of CSV since: 1. CSV does not retain nested dict-like document data structure of MongoDB documents. 2. CSV stores redundant empty NaN columns as well.

CSV export kept on hold for now as restoring from CSV is complicated due to loss of data structure. This commit includes working code for export as JSON file and import from JSON file.

Default option for now is JSON which is easier for data restore. Provided export flags as a boolean dictionary which calls the specific export function as per the set boolean flag.

MukuFlash03 · 2024-01-09T21:08:14Z

Thinking of options to be made available for type of export file - CSV, JSON, Mongodump

CSV -> Complicates restoring due to: NaN field columns, loss of data structure
JSON -> Works good with both export and import
Mongodump -> Should be same as JSON

I’m proposing default option to be JSON which will occur irrespective of other type of export (CSV for now). Pros of this:

Currently JSON export/import has been implemented succesfully.
Avoid complexities of mongodump's readability from the user's perspective, as it can be viewed in a text editor and data can be interpreted more easily as compared to Mongodump.

Pros of having Mongodump as default option:

Purge + Restore should be much simpler as it should be some built-in functionality of simply loading mongodump.

Currently integrated CSV export as an optional export. Once Mongodump is implemented and even that data can be imported, then even Mongodump becomes an option for export.
This will mean JSON file will always be generated so will get either just JSON or a combination of JSON and other file types (JSON + CSV or JSON + CSV + Mongodump).

I have grouped all export data file types in a boolean dictionary which can the call specific export function is that flag is set.

Next working on adding Mongodump as an option.

Built on and added tests for normal data operations of purge() and restore(). Added edge cases tests: 1. Loading duplicate or already existing data by calling restore function again. 2. Loading from empty JSON file containing no data. Will add additional tests if needed.

Changed file path of empty json file used for testing to generic /var/tmp instead of local path.

MukuFlash03 · 2024-01-19T01:09:25Z

Latest fixes work with JSON files used as the export file type and this same file works for importing data correctly back into the database as well.

However, with CSV files, it's a bit complicated because of two reasons:

nested data structure is lost 2) data is stored as strings which converts types like arrays into strings too.

Also, with Mongodump, I tried including the python subprocess module to use it to run the mongodump command as a terminal command but there were issues with dependencies missing (such as mongodump), then had to figure out how to run and/or save the file locally in the Docker container and then extract it out of there if needed for analysis on one's local system.

With these in mind, JSON currently seems a good option for basic export/import.
I will check out the already existing code for JSON export too.
I found this export function mentioned here

shankari · 2024-01-19T01:19:45Z

Yes, that is the export PR. We can also potentially use the associated pipeline state for incremental exports since this is planned to be part of a running system. I would suggest the use of command line flags to indicate whether we want an incremental dump or a full dump.

shankari · 2024-01-21T05:14:26Z

As discussed earlier, this needs to use the existing export method.

paultranvan and others added 7 commits January 31, 2023 19:03

Added csv export feature to bin/purge_user_timeseries

6ea8ac5

Also added associated unit tests

Replaced print() with logging.debug()

d20011b

Print() statements weren't being logged in AWS Cloudwatch logs. Logging.debug() statements are meant for this purpose. These statements may or may not show up in normal execution output depending on the set logger level.

Merge remote-tracking branch 'ttalex/purge-timeseries' into purge-res…

0d0a0ba

…tore-timeseries

Storing data as JSON + Restore code added

ae6eae6

Choosing JSON instead of CSV since: 1. CSV does not retain nested dict-like document data structure of MongoDB documents. 2. CSV stores redundant empty NaN columns as well.

Current working code for JSON based purge/restore of data

78979ff

CSV export kept on hold for now as restoring from CSV is complicated due to loss of data structure. This commit includes working code for export as JSON file and import from JSON file.

Added CSV export as an option

2c1ef44

Default option for now is JSON which is easier for data restore. Provided export flags as a boolean dictionary which calls the specific export function as per the set boolean flag.

MukuFlash03 mentioned this pull request Jan 9, 2024

Purge timeseries (with csv export) #904

Open

Mahadik, Mukul Chandrakant added 2 commits January 10, 2024 19:00

Updated test file path

479a37f

Changed file path of empty json file used for testing to generic /var/tmp instead of local path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Purge + Restore user timeseries data with long-term storage #952

Purge + Restore user timeseries data with long-term storage #952

MukuFlash03 commented Jan 8, 2024

MukuFlash03 commented Jan 9, 2024 •

edited

MukuFlash03 commented Jan 19, 2024 •

edited

shankari commented Jan 19, 2024

shankari commented Jan 21, 2024

Purge + Restore user timeseries data with long-term storage #952

Are you sure you want to change the base?

Purge + Restore user timeseries data with long-term storage #952

Conversation

MukuFlash03 commented Jan 8, 2024

MukuFlash03 commented Jan 9, 2024 • edited

MukuFlash03 commented Jan 19, 2024 • edited

shankari commented Jan 19, 2024

shankari commented Jan 21, 2024

MukuFlash03 commented Jan 9, 2024 •

edited

MukuFlash03 commented Jan 19, 2024 •

edited