#1622: Fix issues with nested context manager calls #1626

richardt-engineb · 2024-01-25T09:47:53Z

Summary

fixes frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622

Problem statement

The Resource class is also a Context Manager. That is, it implements the __enter()__ and __exit()__ methods to allow the use of with Resource(...) statements.

Prior to this PR, there was no limit on nesting with statements on the same Resource, but this caused problems because while the second __enter()__ allowed the Resource to already be open, the first __exit()__ would close() the Resource while the higher level context would expect it to still be open.

This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions.

This was made more complex because these with statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc.

All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality.

Solution

This PR prevents nested contexts being created by throwing an exception when the second, nested, with is attempted. This means that code that risks these issues can be quickly identified and resolved during development.

The best way to resolve it is to use Resource.to_copy() to copy so that the nested with is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway.

This PR also updates a number of the internal uses of with to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called.

Breaking Change

This is technically a breaking change as any external code that was developed using nested with statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use to_copy() or similar.

However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. to_petl() iterators are now independent rather than causing changes in each other)

So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions.

Tests

All existing unit tests run and pass unchanged
New unit tests were added to cover the updated behaviour
-- These unit tests were confirmed to fail without the updates in this PR (where appropriate).
-- These unit tests now pass with the updated code.
The original script that identified the issue in frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622 was run and now gives the correct result (all rows appropriately converted and saved to file)

roll · 2024-01-29T09:44:37Z

Thanks @richardt-engineb !

It's a complex change I will ask for a peer review in the chat

# Summary ## Problem statement The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers). That is, it implements the `__enter()__` and `__exit()__` methods to allow the use of `with Resource(...)` statements. Prior to this PR, there was no limit on nesting `with` statements on the same `Resource`, but this caused problems because while the second `__enter()__` allowed the `Resource` to already be open, the first `__exit()__` would `close()` the Resource while the higher level context would expect it to still be open. This would cause errors like "ValueError: I/O operation on closed file", or the iterator would appear to start from part way through a file rather than at the start of the file, and other similar behaviour depending on the exact locations of the nested functions. This was made more complex because these `with` statements were often far removed from each other in the code, hidden behind iterators driven by generators, etc. They also could have different behaviour depending on number of rows read, the type of Resource (local file vs inline, etc.), the different steps in a pipeline, etc. etc. All this meant that the problem was rare, hard to reduce down to an obvious reproduction case, and not realistic to expect developers to understand while developing new functionality. ## Solution This PR prevents nested contexts being created by throwing an exception when the second, nested, `with` is attempted. This means that code that risks these issues can be quickly identified and resolved during development. The best way to resolve it is to use `Resource.to_copy()` to copy so that the nested `with` is acting on an independent view of the same Resource, which is likely what is intended in most cases anyway. This PR also updates a number of the internal uses of `with` to work on a copy of the Resource they are passed so that they are independent of any external code and what it might have done with the Resource prior to the library methods being called. ## Breaking Change This is technically a breaking change as any external code that was developed using nested `with` statements - possibly deliberately, but more likely unknowingly not falling into the error cases - will have to be updated to use `to_copy()` or similar. However, the library functions have all been updated in a way that doesn't change their signature or their expected behaviour as documented by the unit tests. All pre-existing unit tests pass with no changes, and added unit tests for the specific updated behaviour do not require any unusual constructs. It is still possible that some undocumented and untested side effect behaviours are different than before and any code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other) So it is likely that very few actual impacts will occur in real world code, and the exception thrown does it's best to explain the issue and suggest resolutions. # Tests - All existing unit tests run and pass unchanged - New unit tests were added to cover the updated behaviour - These unit tests were confirmed to fail without the updates in this PR (where appropriate). - These unit tests now pass with the updated code. - The original script that identified the issue in frictionlessdata#1622 was run and now gives the correct result (all rows appropriately converted and saved to file)

# Summary This fixes formatting issues identified by the linter. No functional changes are included. # Tests - hatch run lint passes - hatch run test passes (and coverage is sufficient)

# Summary The CI tests identified some issues that don't show up on a normal test run. This commit fixes those issues. It also highlighted that there were numerous areas that didn't have sufficient test coverage for the case that the caller had already opened the resource. The indexer has some notable changes, but the biggest area affected is the parsers when writing from an already opened source. This commit adds unit tests for the index and all the parser formats for this case, and fixes the code to support the lack of nested contexts. # Tests - Setup the required databases for CI by copying the commands in the github actions - Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient - Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient This also means that all linting etc. has been run too.

augusto-herrmann

I have never used nested with statements with Resource objects, so I do not think this would directly affect our own existing code.

However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use Resource.to_copy() when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.

It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.

richardt-engineb · 2024-02-13T17:44:17Z

Thanks for the commens @augusto-herrmann (and sorry about the delay in replying)

I have never used nested with statements with Resource objects, so I do not think this would directly affect our own existing code.

It's good to hear that it wouldn't affect your own use of Frictionless. The main cases that would affect would be where things in the library happen to be called in such a way that nested with statements happen. E.g. when using a pipeline with steps, if they re-uses the same resource object the whole way through (and not copies) you can end up with nested with statements (usually related to to_petl()-based generators etc.)

However, by looking the code on this PR I can notice that a lot of internal code of the Frictionless Framework was changed to use Resource.to_copy() when using context managers. Does copying a Resource object entail copying the data as well or just metadata? If this does indeed create an in-memory copy of the whole resource, including data, this is likely to increase memory usage, which might be an issue if the Resource includes a lot of data.

It would be nice to make a test comparing memory usage and running time before and after the change, especially with large Resources.

to_copy() is:

def to_copy(self, **options: Any) -> Self:
        """Create a copy from the resource"""
        return super().to_copy(
            data=self.data,
            basepath=self.basepath,
            detector=self.detector,
            package=self.package,
            **options,
        )

The main thing in there that could take up space is indeed the data which could noticeably increase the space used. However the mitigation would be that if using very large datasets it seems likely that data would come from loading a file which is only using generators/iterators rather than caching the whole file in memory. And conversely, anywhere that full in-memory data is being used is likely to be small enough that the memory increase is not a concern overall.

Are there any tests you would specifically like to see to increase your confidence, or use-cases where huge in-memory data sets are be used in ways that this would make harder?

Even in this change, someone with huge in-memory data-sets could use open and close explicitly to avoid copies. That wouldn't affect the internals however, so it does become a question of correctness vs memory use. I did consider doing reference counting instead of copying, however that still had the problem where the state of the generators/iterators was inconsistent depending on what else had been called (e.g. pipeline ordering, etc). So e.g. the tests/table/test_to_petl.py tests that I added do not pass without the to_copy() changes I made (and would not pass in a reference counting approach).

I also considered adding a SAFE_ITERATORS or similar flag that would either copy or not copy depending on value, but that also seemed like asking for trouble in the future as when the issues happen they are very subtle and hard to understand. So it would almost need to be a DANGEROUS_THINGS_WILL_EVENTUALLY_GO_WRONG flag or something!

As I said, thanks for the feedback, and happy to look at any additional tests etc. that you think would conclusively show a meaningful difference that would convince you either way.

roll added the work-in-progress label Jan 29, 2024

richardt-engineb added 3 commits January 30, 2024 14:25

Fix linting issues

599d8a0

# Summary This fixes formatting issues identified by the linter. No functional changes are included. # Tests - hatch run lint passes - hatch run test passes (and coverage is sufficient)

richardt-engineb force-pushed the frictionless-1622 branch from 998e6d3 to 2d01036 Compare January 30, 2024 14:27

augusto-herrmann reviewed Feb 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1622: Fix issues with nested context manager calls #1626

#1622: Fix issues with nested context manager calls #1626

richardt-engineb commented Jan 25, 2024

roll commented Jan 29, 2024 •

edited

augusto-herrmann left a comment

richardt-engineb commented Feb 13, 2024

#1622: Fix issues with nested context manager calls #1626

Are you sure you want to change the base?

#1622: Fix issues with nested context manager calls #1626

Conversation

richardt-engineb commented Jan 25, 2024

Summary

Problem statement

Solution

Breaking Change

Tests

roll commented Jan 29, 2024 • edited

augusto-herrmann left a comment

Choose a reason for hiding this comment

richardt-engineb commented Feb 13, 2024

roll commented Jan 29, 2024 •

edited