`wr.dyamodb.read_items` gives a pyarrow error #2399

lmmx · 2023-07-20T16:24:28Z

Describe the bug

I'm not quite sure where this problem is arising, but just attempting to read 10 rows from a database table. It looks like the error is thrown before my query is even run: from trying to resolve the table metadata itself?

Traceback (most recent call last):
  File "/home/louis/dev/testing/wrangler/wrangler_demo.py", line 3, in <module>
    items = wr.dynamodb.read_items(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_utils.py", line 174, in inner
    return func(*args, **kwargs)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 635, in read_items
    return _read_items(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 384, in _read_items
    return _read_items_scan(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 341, in _read_items_scan
    return _utils.table_refs_to_df(items, arrow_kwargs)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_distributed.py", line 105, in wrapper
    return cls.dispatch_func(func)(*args, **kw)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_utils.py", line 882, in table_refs_to_df
    return _table_to_df(pa.concat_tables(tables, promote=True), kwargs=kwargs)
  File "pyarrow/table.pxi", line 5371, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field user_id has incompatible types: decimal128(4, 0) vs decimal128(6, 0

How to Reproduce

import awswrangler as wr

items = wr.dynamodb.read_items(
    table_name="my-table",
    max_items_evaluated=10,  # limit the number of items to 10 for testing
    columns=["foo_id", "user_id", "bar_id"],  # specify the columns to read
)

print(items)

Expected behavior

The columns in this query are all N (Number) and should be integer dtype. I can't even get it to run though. I looked at what it was implying and it seems to refer to the arrow decimal128 type's precision (scale being 0 means there are no characters after the decimal place) and my interpretation is that it indicates there are numbers returned at different orders of magnitude (powers of 10)?

I just created a conda environment to test this library out as an alternative to the boto3 client access to dynamodb. Happy to try any suggestions to fix

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.10

AWS SDK for pandas version

3.2.1

Additional context

$ pip list
Package           Version
----------------- -------
awswrangler       3.2.1
boto3             1.28.7
botocore          1.31.7
jmespath          1.0.1
numpy             1.25.1
packaging         23.1
pandas            2.0.3
pip               23.1.2
pyarrow           12.0.1
python-dateutil   2.8.2
pytz              2023.3
s3transfer        0.6.1
setuptools        67.8.0
six               1.16.0
typing_extensions 4.7.1
tzdata            2023.3
urllib3           1.26.16
wheel             0.38.4

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

jaidisido · 2023-07-20T19:47:42Z

Thanks for raising this @lmmx. The issue here is that the user_id column appears to have numbers with varying decimal precisions and the pyarrow tables obtained from paginating end up with different schemas. #2401 fixes this and will be available in the next release. You will have to pass the schema via pyarrow_additional_kwargs however, for example:

import pyarrow as pa

schema = pa.schema([("foo_id", pa.int8()), ("user_id", pa.decimal128(6, 0)), ("bar_id", pa.int8())])

items = wr.dynamodb.read_items(
    table_name="my-table",
    max_items_evaluated=10,  # limit the number of items to 10 for testing
    columns=["foo_id", "user_id", "bar_id"],  # specify the columns to read
    pyarrow_additional_kwargs={"schema": schema},
)

Note that if you simply wish to obtain the items (without converting them to a pandas dataframe), you can pass the as_dataframe=False argument. At that point you won't need to pass the pyarrow schema.

* fix: support pyarrow schema in DynamoDB read_items #2399 --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

lmmx added the bug Something isn't working label Jul 20, 2023

jaidisido added a commit that referenced this issue Jul 20, 2023

fix: support pyarrow schema in DynamoDB read_items #2399

0bd6a2f

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

jaidisido mentioned this issue Jul 20, 2023

fix: support pyarrow schema in DynamoDB read_items #2399 #2401

Merged

jaidisido linked a pull request Jul 20, 2023 that will close this issue

fix: support pyarrow schema in DynamoDB read_items #2399 #2401

Merged

jaidisido closed this as completed in #2401 Jul 21, 2023

jaidisido added a commit that referenced this issue Jul 21, 2023

fix: support pyarrow schema in DynamoDB read_items #2399 (#2401)

31c4bd0

* fix: support pyarrow schema in DynamoDB read_items #2399 --------- Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`wr.dyamodb.read_items` gives a pyarrow error #2399

`wr.dyamodb.read_items` gives a pyarrow error #2399

lmmx commented Jul 20, 2023

jaidisido commented Jul 20, 2023

wr.dyamodb.read_items gives a pyarrow error #2399

wr.dyamodb.read_items gives a pyarrow error #2399

Comments

lmmx commented Jul 20, 2023

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

jaidisido commented Jul 20, 2023

`wr.dyamodb.read_items` gives a pyarrow error #2399

`wr.dyamodb.read_items` gives a pyarrow error #2399