Improve JSON processing performance #396

matteofigus · 2024-01-30T18:38:18Z

Description of changes:

Follow-up PR from #395 - move the optimisation logic outside of the JSON handler to simplify logic and use less memory (as the hashmap is created at the start of the job, rather than copied on each line iterator.

I also included File Size in the log as it could be useful during troubleshooting.

PR Checklist:

Changelog updated
Unit tests (and integration tests if applicable) provided
All tests pass
Pre-commit checks pass
Debugging code removed
If releasing a new version, have you bumped the version in the main CFN template?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Increase the speed of the json_handler by migrating from a list to a set. Move from O(n) to O(1)

… into json-perf

cmclel7 · 2024-02-14T15:14:48Z

backend/ecs_tasks/delete_files/parquet_handler.py

    else:
        for i in range(0, len(column["Columns"])):
            if is_column_type_decimal(schema, column["Columns"][i]):
-                for composite_match in column["MatchIds"]:
-                    composite_match[i] = Decimal(composite_match[i])
+                columns_copy = set()


I'm wondering if there is a way to do this without creating another copy of the column.

Do you think that there is a possibility for there to be particularly large columns here?

Primarily driven by the fact we have seen some large simple match lists but I haven't seen any use of composite personally.

The possibility is real but composite are definitely uncommon, yes, and also this is effectively running if a column identifier is also of type decimal, which is very uncommon, so I think the possibility of hitting both use-cases is very remote.

But your comment is valid. I guess before we were operating on arrays, and therefore re-iterating and casting the Decimals in this particular use-case wasn't memory intensive because we were operating on the existing data structure. I guess here we are sort of doing the opposite really, working out the hashmaps before, and copying to array only here.

I guess that if you have more than a decimal column identifier, this could be doing the copy multiple times, and perhaps I can optimise that to happen only once. I'll give it a go.

I did re-write some logic to optimise the multi-column decimal scenario. I think this is the best I can think of, because I don't think iterating over a Python set allows me to dynamically add/remove values? I think the best is to just create a copy and re-write it at the end. Thoughts @ctd @cmclel7 ?

Yeah, I think you need to build a new collection.

You could use generator expressions here as well, for instance:

def cast_column_values(column, schema): """ Method to cast stringified MatchIds to their actual types """ if column["Type"] == "Simple": if is_column_type_decimal(schema, column["Column"]): column["MatchIds"] = set(Decimal(m) for m in column["MatchIds"]) else: decimal_columns = set( i for i, col in enumerate(column["Columns"]) if is_column_type_decimal(schema, col) ) if decimal_columns: decimal_casted = set( tuple( Decimal(m) if i in decimal_columns else m for i, m in enumerate(composite_match_tuple) ) for composite_match_tuple in column["MatchIds"] ) column["MatchIds"] = decimal_casted return column

tbf after writing this I'm not sure that it will actually perform better. It saves having to build an intermediate list, but rather than iterating over decimal_columns (and only looping that many times) it adds a check step to len(composite_match_tuple). Swings and roundabouts perhaps.

Ah I actually like your changes. Using a set on the decimal_columns will have slight impact given I would assume the array to be usually around 2 or 3 but if we consider doing that lookup to up to millions, it could help.

I also like the idea of not using an intermediate array for match_array, so I incorporated this in my change. Thanks!

backend/ecs_tasks/delete_files/parquet_handler.py

Co-authored-by: Chris Deigan <ctd@users.noreply.github.com>

codecov-commenter · 2024-02-15T13:09:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

❗ No coverage uploaded for pull request base (master@3efdceb). Click here to learn what that means.
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##             master     #396   +/-   ##
=========================================
  Coverage          ?   99.71%           
=========================================
  Files             ?       31           
  Lines             ?     1744           
  Branches          ?        0           
=========================================
  Hits              ?     1739           
  Misses            ?        5           
  Partials          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cmclel7 and others added 8 commits January 25, 2024 15:42

Change list of match ids to set for json_handler.py

30657c5

Increase the speed of the json_handler by migrating from a list to a set. Move from O(n) to O(1)

Update CHANGELOG.md

ca23bac

Bump Version 0.65 -> 0.66

05be129

Include optimisation for composite json matches

c497147

Merge branch 'master' of github.com:awslabs/amazon-s3-find-and-forget…

31be324

… into json-perf

Improve JSON performance and include filesize in the logs

f6defcb

Bump version

2a4f0b7

Cleanup test

a7d63cf

matteofigus marked this pull request as ready for review January 30, 2024 19:00

matteofigus requested a review from cmclel7 February 5, 2024 01:50

cmclel7 reviewed Feb 14, 2024

View reviewed changes

ctd reviewed Feb 14, 2024

View reviewed changes

backend/ecs_tasks/delete_files/parquet_handler.py Outdated Show resolved Hide resolved

Update backend/ecs_tasks/delete_files/parquet_handler.py

26e0d21

Co-authored-by: Chris Deigan <ctd@users.noreply.github.com>

matteofigus added 2 commits February 15, 2024 16:35

Don't copy the columns multiple time for multiple Decimal identifiers

ff3bd0c

Further improvements

ec1e662

matteofigus requested a review from cmclel7 February 20, 2024 11:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve JSON processing performance #396

Improve JSON processing performance #396

matteofigus commented Jan 30, 2024 •

edited

cmclel7 Feb 14, 2024

matteofigus Feb 15, 2024

matteofigus Feb 15, 2024

ctd Feb 16, 2024

matteofigus Feb 16, 2024

codecov-commenter commented Feb 15, 2024

Improve JSON processing performance #396

Are you sure you want to change the base?

Improve JSON processing performance #396

Conversation

matteofigus commented Jan 30, 2024 • edited

cmclel7 Feb 14, 2024

Choose a reason for hiding this comment

matteofigus Feb 15, 2024

Choose a reason for hiding this comment

matteofigus Feb 15, 2024

Choose a reason for hiding this comment

ctd Feb 16, 2024

Choose a reason for hiding this comment

matteofigus Feb 16, 2024

Choose a reason for hiding this comment

codecov-commenter commented Feb 15, 2024

Codecov Report

matteofigus commented Jan 30, 2024 •

edited