Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622

Open
richardt-engineb opened this issue Jan 18, 2024 · 2 comments · May be fixed by #1626
Open

frictionless transform unhandled exception on 200 line csv, but not 197 line csv #1622

richardt-engineb opened this issue Jan 18, 2024 · 2 comments · May be fixed by #1626

Comments

@richardt-engineb
Copy link

Overview

I am finding a very strange error when doing a transfrom (either in python code or via the command line tool). Depending on the size of the input file the transform succeeds fine, or throws an "I/O operation on closed file" exception. The number of lines required to trigger it seems to vary, even by execution environment.

On a M1 Mac Mini it's currently 198 lines crashes, 197 lines passes. On a gitpod instance (Ubuntu), it was around the same yesterday, but today is more like 150. In our code version it can take 10k lines+. But there is always a size above which this fails (and a size far short of e.g. settings.FIELD_SIZE_LIMIT).

Example Command Line

% frictionless transform data/crash-transform/data.csv --pipeline data/crash-transform/pipeline.json
╭─ Error ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ [step-error] Step is not valid: "cell_replace" raises "[step-error] Step is not valid: "table_normalize" raises "[source-error] The data source has not supported or has         │
│ inconsistent contents: I/O operation on closed file. " "                                                                                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Pipeline.json

{
  "steps": [
    { "type": "table-normalize" },
    { "type": "cell-replace", "pattern": "BLANK", "replace": "" },
    { "type": "cell-replace", "pattern": "blank", "replace": "" },
    { "type": "cell-replace", "pattern": "NULL", "replace": "" },
    { "type": "cell-replace", "pattern": "null", "replace": "" },
    {
      "name": "NewSumField",
      "type": "field-add",
      "formula": "Field1 + Field2"
    },
    { "name": "NewConstantField", "type": "field-add", "value": "NewValue" }
  ]
}

data.csv

Field1,Field2,Random1,Random2,Random3,Random4,Random5,Random6,Random7,Random8,Random9
0,0,BLANK,blank,NULL,null,5val0,6val0,7val0,8val0,9val0
1,10,1val1,2val1,3val1,4val1,5val1,6val1,7val1,8val1,9val1
2,20,1val2,2val2,3val2,4val2,5val2,6val2,7val2,8val2,9val2
3,30,BLANK,2val3,3val3,4val3,5val3,6val3,7val3,8val3,9val3
... extend as needed ...

Sample files

data.csv
pipeline.json

@richardt-engineb
Copy link
Author

I've been doing some debugging of this and have some ideas what it might be related to, but I'm relatively new to Frictionless so I may be confused about the architecture or the specifics.

resource.py enter and exit not re-entrant

There seems to be an issue where __enter__ and __exit__ are not re-entrant but are called multiple times.
__enter__ is:

    # TODO: shall we guarantee here that it's at the beggining for the file?
    # TODO: maybe it's possible to do type narrowing here?
    def __enter__(self):
        if self.closed:
            self.open()
        return self

__exit__ is:

def __exit__(self, type, value, traceback):  # type: ignore
        self.close()

So if you do something like (pseudo-code):

resource = Resource()
with resource as r1:
  # A1
  with resource as r2:
    # B1
  # A2

Then __enter__ is called twice. The first time does self.open(), then second time just returns. When the first with ends, then __exit__ will be called and thus self.close() is immediately called. So at point A2, you are still within the first with but the resource is now closed.

If you started calling a generator on the resource (e.g. row_stream()) at A1 it works, but if you call it again at A2 the resource is closed and the iteration will (eventually) fail.

Why eventually fail (and not immediately)

I believe the number of rows matters because the csv parser takes a sample, and then iterates through chain(sample, self.loader.text_stream). So my belief is that if the file fits within the sample then the self.close() doesn't matter because the sample has already been read. But if you read beyond the size of sample then you are into the self.loader.test_stream which has been closed as above.

Other concerns

The comment # TODO: shall we guarantee here that it's at the beggining for the file? seems to suggest that the expectation is that every with should start at the beginning of the file, whereas the current code just returns self which seems like it will be whatever the current state of the file is (or state of the parser -> loader I think?).

Proposed solutions

I can see a number of ways to potentially fix this, but my limited understanding of the architecture and the central nature of resource.py makes me cautious about going too far down one route without advice from more knowledgeable people. I'm happy to work on the options and produce a PR given some advice on the best way to go.

Option 1: Make enter and exit reference counting

We could add a reference count so that __exit__ only calls close() when the number of __exit__s match the number of __enter__s. this would mean that the Resource was still open at A2 above. I have tested this and it seems to work (i.e. can process large files without crashing).

However I think I am then seeing other weird behaviour like our code only starts outputing line 297 of the input rather than line 1, which I suspect is due to this "re-open in the current state (e.g. first 100 lines previously read into sample) rather than starting at the beginning" behaviour but haven't done enough debugging to prove.

Option 2: Make enter return a deep clone of the Resource with a reset parser/loader as needed

This would mean that in the above pseudo-code r1 and r2 are actually different objects with different independent views on the same file, and closing r2 would have no affect on r1. This seems broadly the safest, and would be in agreement with the comment in the code about being at the beginning of the file. However, it's also the most significant change and I'm not sure if it is in line with the expected architecture.

With enough nesting or chaining, you could also presumably reach a point where you can't open any more connections to the file (particularly remote files).

Option 3: something else

Perhaps we are using it wrong, or I'm misunderstanding entirely what is going wrong, so there may be an entirely different way and better way to fix it.

@richardt-engineb
Copy link
Author

Further Investigation: Option 2 seems correct, but design goals confirmation needed

Summary of proposal(s)

See details below, but from further investigation I believe that Option 2 - returning a clone/copy - is the correct solution, but there are still some decisions on the right way to do it. Essentially we could:

  • Option 2a: always RETURN A COPY from Resource::__enter__(). This is easy, but requires everyone to use the full with resource as target: format (and not just with resouce: which doesn't use the returned copy and instead expects the resource to be mutated in place). The code uses both styles of with at present.
  • Option 2b: make Resouce::__enter__() single use by throwing an exception if it's called multiple times (due to nested contexts). Callers would explicitly have to to_copy() before nesting withs. This is like 2a but potentially more explicit.
  • Option 2B: Make Resource be some kind of FACADE WITH INTERNAL CONTEXT STACK/CACHE with a new context being pushed in __enter__() and popped in __exit__(). This is hard, but would allow both types of with to work (with the state of the Resource being the state at the top of the context stack)

I believe that Option 2a is the simplest and best solution, with Option 2b a close second, but I'm keen to get thoughts from more experience users before making a PR.

Simplified error demo for discussion

As discussed in the previous comment, there is a relatively simple reproduction case. The pseudo code for that is below, but I'm also attaching a file of working demo (remove the .txt used to allow github to upload it): crash.py.txt

resource = TableResource(file="data.csv")
with resource as r1:
  # A1
  # iterate r1.row_stream() 100 times and print out the row
  with resource as r2:
    # B1
    # iterate r2.row_stream() 50 times and print out the row
  # A2
  # iterate r1.row_stream() another 100 times and print out the row

With no changes this crashes at # A2 with ValueError: I/O operation on closed file. as the end of the with resource as r2: closes the resource. While this might seem like a case of "don't do that", this does actually happen in pipelines within the step functions etc.

Why not Option 1 (reference counting)?

By reference counting, I mean increment a counter whenever __enter__() is called, decrement it in __exit__(), and only call close() when the ref count reaches 0 in __exit__().

This prevents the crash, but the output of the above code would be:

  1. At A1 it outputs rows 1...100
  2. At B1 it outputs rows 101...150 (i.e. continues from the state of r1)
  3. At A2 it outputs rows 151...251 (i.e. again continues from the state of r2)

While this might be understandable in the above file, it often happens deeply nested in the pipeline in ways that wouldn't be expected and give the wrong result. e.g. it happens when cell-replace for all cells reads the file to get the header row (in the petl function fieldnames in return convert(table, fieldnames(table), *args, **kwargs) in convertall().

This means that if we implement reference counting, and run the pipeline above it does run to completion but skips the first 297 rows in the output (as those rows are consumed setting up the pipeline, and aren't available for yielding by the time the whole generator chain gets row_stream-ed to write the file).

Option 2 - cloning the resource - works as expected

Option 2a - simple return a copy

Change the __enter__() in resource.py to:

def __enter__(self):
        copy = self.to_copy()
        copy.open()
        return copy

This also prevents the crash, and the output of the above code would be:

  1. At A1 it outputs rows 1...100
  2. At B1 it outputs rows 1...50 (i.e. r2 is its own context that starts again at the start of the file)
  3. At A2 it outputs rows 101...201 (i.e. r1 continues where it left off in A1, and is not affected by B1)

However, more globally this works if and only if you go through and change all usages of with resource: to with resource as copy: style. This is neccessary so that (a) you actually use the copy in the subsequent code, and (b) you close the copy. If you use the basic with resource: you are expecting the resource itself to opened, and this code doesn't do that (for the same reason the existing code doing that can be an issue).

This can be resolved by going through the library and fixing all the cases to use the with...as... style, but also requires all future code to also follow that pattern. Note that as far as I know, there is no way for __enter__() to know if its being used in the with ... as style or just with, so there's no way to throw an exception or similar to guarantee the correct usage. Throwing a descriptive exception on closing an already closed Resource might be a partial help, but more likely the code crashes in various "resource not open" places before that (though those could also be updated to give descriptive messages to explain the issue).

This would also be a breaking change for any external code that uses with without as, though I would argue that that code is already either subtly broken or works by chance rather than design given the issues discussed above.

I have implemented this fix in a private copy and it does work for the pipeline given at the start, and does return the correct rows (starting at row 1 as expected). So I am confident this can be made to work in a relatively neat PR.

Option 2b - make Resources explicitly single-use

We could make Resources explicitly single-use by throwing an exception if __enter__() is called on a resource that has been previously entered. This is similar to the above, but instead of doing it internally and forcing the use of with ... as this would make it external and force the use of resource.to_copy() before any nested with.

The benefit is that you get an explicit exception when you do the wrong thing.

The downside is that you have to update every occurence of a nested with, not just those that don't use with ... as..., and makes the code somewhat more verbose.

I have not implemented this, but it is really the same as 2a with the code changes in a different place.

Option 2c - caching context stack

You could in theory do something similar to 2a, but keep the copy on an internal stack. Every time __enter__() is called you push a new context on the stack, and every __exit__() pops it. The Resource class would use the latest state, and on exit would have it's state entirely updated with the state before the previous enter.

This would make either form of with work, but is much more complicated, particularly with the various specialisations of Resource (TableResource, JsonResource, etc), the dynamic metadata, etc.

I have not implemented this, and I'm somewhat concerned about the scale and complexity of trying to implement an appropriate facade and context stack.

Option 3 - something I've missed?

I'd also be very happy to hear that there's something obvious I've missed, and there's a different and simpler way to to fix it!.

Discussion in relation to PEP 343 – The “with” Statement

These issues and possible solutions are somewhat discussed in the PEP that defined the with statement:
https://peps.python.org/pep-0343/#caching-context-managers

Many context managers (such as files and generator-based contexts) will be single-use objects. Once the exit() method has been called, the context manager will no longer be in a usable state (e.g. the file has been closed, or the underlying generator has finished execution).

Requiring a fresh manager object for each with statement is the easiest way to avoid problems with multi-threaded code and nested with statements trying to use the same context manager. It isn’t coincidental that all of the standard library context managers that support reuse come from the threading module - they’re all already designed to deal with the problems created by threaded and nested usage.

We could enforce Resources to be single-use objects as they are essentially generators which the PEP suggests are single-use, but that seems to go against as the architecture as there are numerous places where with statements are nested instead other with statements (though disguised by the composable nature of pipeline steps etc.). Option 2a and 2b are both really a way to make resources be single-use in practice - by using a new single-use copy for each nested with - either explicitly or implicitly.

Equally they do suggest the use of caching contexts as per Option 2B, and it seems something like this is implemented for Decimal precision (see example 8 in https://peps.python.org/pep-0343/#examples). That does however seem a lot harder in the Resource case due to the complexity of Resources.

richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jan 25, 2024
# Summary

## Problem statement

The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers).  That is, it implements the
`__enter()__` and `__exit()__` methods to allow the use of
`with Resource(...)` statements.

Prior to this PR, there was no limit on nesting `with` statements on
the same `Resource`, but this caused problems because while the second
`__enter()__` allowed the `Resource` to already be open, the first
`__exit()__` would `close()` the Resource while the higher level context
would expect it to still be open.

This would cause errors like "ValueError: I/O operation on closed file",
or the iterator would appear to start from part way through a file rather
than at the start of the file, and other similar behaviour depending on
the exact locations of the nested functions.

This was made more complex because these `with` statements were often
far removed from each other in the code, hidden behind iterators driven
by generators, etc. They also could have different behaviour depending on
number of rows read, the type of Resource (local file vs inline, etc.),
the different steps in a pipeline, etc. etc.

All this meant that the problem was rare, hard to reduce down to an
obvious reproduction case, and not realistic to expect developers to
understand while developing new functionality.

## Solution

This PR prevents nested contexts being created by throwing an exception
when the second, nested, `with` is attempted.  This means that code that
risks these issues can be quickly identified and resolved during development.

The best way to resolve it is to use `Resource.to_copy()` to copy so that
the nested `with` is acting on an independent view of the same Resource,
which is likely what is intended in most cases anyway.

This PR also updates a number of the internal uses of `with` to work
on a copy of the Resource they are passed so that they are independent
of any external code and what it might have done with the Resource prior
to the library methods being called.

## Breaking Change

This is technically a breaking change as any external code that was
developed using nested `with` statements - possibly deliberately, but
more likely unknowingly not falling into the error cases - will have to
be updated to use `to_copy()` or similar.

However, the library functions have all been updated in a way that
doesn't change their signature or their expected behaviour as documented
by the unit tests. All pre-existing unit tests pass with no changes,
and added unit tests for the specific updated behaviour do not require
any unusual constructs.  It is still possible that some undocumented
and untested side effect behaviours are different than before and any
code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other)

So it is likely that very few actual impacts will occur in real world
code, and the exception thrown does it's best to explain the issue
and suggest resolutions.

# Tests

- All existing unit tests run and pass unchanged
- New unit tests were added to cover the updated behaviour
    - These unit tests were confirmed to fail without the updates
      in this PR (where appropriate).
    - These unit tests now pass with the updated code.
- The original script that identified the issue in frictionlessdata#1622 was run and
  now gives the correct result (all rows appropriately converted and
  saved to file)
richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jan 25, 2024
# Summary

The CI tests identified some issues that don't show up on a normal test run.  This commit fixes those issues.

It also highlighted that there were numerous areas
that didn't have sufficient test coverage for the case
that the caller had already opened the resource.

The indexer has some notable changes, but the biggest
area affected is the parsers when writing
from an already opened source.  This commit adds
unit tests for the index and all the parser formats for this case,
and fixes the code to support the lack of nested contexts.

# Tests

- Setup the required databases for CI by copying the commands in the github actions
- Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient
- Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient

This also means that all linting etc. has been run too.
richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jan 30, 2024
# Summary

## Problem statement

The `Resource` class is also a [Context Manager](https://docs.python.org/3/reference/datamodel.html#context-managers).  That is, it implements the
`__enter()__` and `__exit()__` methods to allow the use of
`with Resource(...)` statements.

Prior to this PR, there was no limit on nesting `with` statements on
the same `Resource`, but this caused problems because while the second
`__enter()__` allowed the `Resource` to already be open, the first
`__exit()__` would `close()` the Resource while the higher level context
would expect it to still be open.

This would cause errors like "ValueError: I/O operation on closed file",
or the iterator would appear to start from part way through a file rather
than at the start of the file, and other similar behaviour depending on
the exact locations of the nested functions.

This was made more complex because these `with` statements were often
far removed from each other in the code, hidden behind iterators driven
by generators, etc. They also could have different behaviour depending on
number of rows read, the type of Resource (local file vs inline, etc.),
the different steps in a pipeline, etc. etc.

All this meant that the problem was rare, hard to reduce down to an
obvious reproduction case, and not realistic to expect developers to
understand while developing new functionality.

## Solution

This PR prevents nested contexts being created by throwing an exception
when the second, nested, `with` is attempted.  This means that code that
risks these issues can be quickly identified and resolved during development.

The best way to resolve it is to use `Resource.to_copy()` to copy so that
the nested `with` is acting on an independent view of the same Resource,
which is likely what is intended in most cases anyway.

This PR also updates a number of the internal uses of `with` to work
on a copy of the Resource they are passed so that they are independent
of any external code and what it might have done with the Resource prior
to the library methods being called.

## Breaking Change

This is technically a breaking change as any external code that was
developed using nested `with` statements - possibly deliberately, but
more likely unknowingly not falling into the error cases - will have to
be updated to use `to_copy()` or similar.

However, the library functions have all been updated in a way that
doesn't change their signature or their expected behaviour as documented
by the unit tests. All pre-existing unit tests pass with no changes,
and added unit tests for the specific updated behaviour do not require
any unusual constructs.  It is still possible that some undocumented
and untested side effect behaviours are different than before and any
code relying on those may also be affected (e.g. `to_petl()` iterators are now independent rather than causing changes in each other)

So it is likely that very few actual impacts will occur in real world
code, and the exception thrown does it's best to explain the issue
and suggest resolutions.

# Tests

- All existing unit tests run and pass unchanged
- New unit tests were added to cover the updated behaviour
    - These unit tests were confirmed to fail without the updates
      in this PR (where appropriate).
    - These unit tests now pass with the updated code.
- The original script that identified the issue in frictionlessdata#1622 was run and
  now gives the correct result (all rows appropriately converted and
  saved to file)
richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jan 30, 2024
# Summary

The CI tests identified some issues that don't show up on a normal test run.  This commit fixes those issues.

It also highlighted that there were numerous areas
that didn't have sufficient test coverage for the case
that the caller had already opened the resource.

The indexer has some notable changes, but the biggest
area affected is the parsers when writing
from an already opened source.  This commit adds
unit tests for the index and all the parser formats for this case,
and fixes the code to support the lack of nested contexts.

# Tests

- Setup the required databases for CI by copying the commands in the github actions
- Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient
- Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient

This also means that all linting etc. has been run too.
richardt-engineb added a commit to Engine-B/frictionless-py that referenced this issue Jan 30, 2024
# Summary

The CI tests identified some issues that don't show up on a normal test run.  This commit fixes those issues.

It also highlighted that there were numerous areas
that didn't have sufficient test coverage for the case
that the caller had already opened the resource.

The indexer has some notable changes, but the biggest
area affected is the parsers when writing
from an already opened source.  This commit adds
unit tests for the index and all the parser formats for this case,
and fixes the code to support the lack of nested contexts.

# Tests

- Setup the required databases for CI by copying the commands in the github actions
- Run `hatch run +py=3.11 ci:test` and ensure all tests pass and coverage remains sufficient
- Run `hatch run test` in case it is different and ensure all tests pass and coverage remains sufficient

This also means that all linting etc. has been run too.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant