Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors Describing and Opening Parquet Formatted File with Blank Values #1637

Open
kkalbaugh opened this issue Feb 5, 2024 · 0 comments
Open

Comments

@kkalbaugh
Copy link

kkalbaugh commented Feb 5, 2024

Unable to open or describe parquet formatted files containing blank values. I've tied passing the missing values as '' as well. Both CLI and Python generate the same error. The same data works as expected in CSV format. I can also load the CSV as a resource and write it out as a parquet file, but then I cannot read the generated .parquet.

frictionless describe parquet_data.parquet --yaml --stats --field-missing-values='' --debug

or

from frictionless import Resource
from frictionless import Detector
detector = Detector(field_missing_values=[''])
resource = Resource('parquet_data.parquet',detector=detector)
resource.open()

Error output:

TypeError                                 Traceback (most recent call last)
Cell In[48], line 1
----> 1 resource = describe('/home/jovyan/work/TEST-Frictionless/parquet_data_v7.parquet')
      2 pprint(resource)

File /opt/conda/lib/python3.11/site-packages/frictionless/actions/describe.py:31, in describe(source, name, type, stats, **options)
     11 def describe(
     12     source: Optional[Any] = None,
     13     *,
   (...)
     17     **options: Any,
     18 ) -> Metadata:
     19     """Describe the data source
     20 
     21     Parameters:
   (...)
     29         Metadata: described metadata e.g. a Table Schema
     30     """
---> 31     return Resource.describe(source, name=name, type=type, stats=stats, **options)

File /opt/conda/lib/python3.11/site-packages/frictionless/resource/resource.py:567, in Resource.describe(cls, source, name, type, stats, **options)
    564     return package
    566 # Package
--> 567 resource.infer(stats=stats)
    568 if type == "package":
    569     package = Package(resources=[resource])

File /opt/conda/lib/python3.11/site-packages/frictionless/resources/table.py:466, in TableResource.infer(self, stats)
    464     note = "Resource.infer canot be used on a open resource"
    465     raise FrictionlessException(errors.ResourceError(note=note))
--> 466 with self:
    467     if not stats:
    468         return

File /opt/conda/lib/python3.11/site-packages/frictionless/resource/resource.py:261, in Resource.__enter__(self)
    259 def __enter__(self):
    260     if self.closed:
--> 261         self.open()
    262     return self

File /opt/conda/lib/python3.11/site-packages/frictionless/resources/table.py:167, in TableResource.open(self)
    165 self.__open_labels()
    166 self.__open_fragment()
--> 167 self.__open_schema()
    168 self.__open_header()
    169 self.__open_lookup()

File /opt/conda/lib/python3.11/site-packages/frictionless/resources/table.py:202, in TableResource.__open_schema(self)
    200 def __open_schema(self):
    201     self.metadata_assigned.add("schema")
--> 202     self.schema = self.detector.detect_schema(
    203         self.fragment,
    204         labels=self.labels,
    205         schema=self.schema,
    206         field_candidates=system.detect_field_candidates(),
    207     )
    208     self.stats.fields = len(self.schema.fields)

File /opt/conda/lib/python3.11/site-packages/frictionless/detector/detector.py:380, in Detector.detect_schema(self, fragment, labels, schema, field_candidates)
    378     continue
    379 source = cells[index] if len(cells) > index else None
--> 380 is_field_missing_value = source in self.field_missing_values
    381 if is_field_missing_value:
    382     max_score[index] -= 1

File missing.pyx:392, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

Github not allowing parquet files so I zipped it up here.
parquet_data.zip
Here's the csv
parquet_data.csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant