Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.dyamodb.read_items gives a pyarrow error #2399

Closed
lmmx opened this issue Jul 20, 2023 · 1 comment · Fixed by #2401
Closed

wr.dyamodb.read_items gives a pyarrow error #2399

lmmx opened this issue Jul 20, 2023 · 1 comment · Fixed by #2401
Labels
bug Something isn't working

Comments

@lmmx
Copy link

lmmx commented Jul 20, 2023

Describe the bug

I'm not quite sure where this problem is arising, but just attempting to read 10 rows from a database table. It looks like the error is thrown before my query is even run: from trying to resolve the table metadata itself?

Traceback (most recent call last):
  File "/home/louis/dev/testing/wrangler/wrangler_demo.py", line 3, in <module>
    items = wr.dynamodb.read_items(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_utils.py", line 174, in inner
    return func(*args, **kwargs)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 635, in read_items
    return _read_items(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 384, in _read_items
    return _read_items_scan(
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/dynamodb/_read.py", line 341, in _read_items_scan
    return _utils.table_refs_to_df(items, arrow_kwargs)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_distributed.py", line 105, in wrapper
    return cls.dispatch_func(func)(*args, **kw)
  File "/home/louis/miniconda3/envs/wr310/lib/python3.10/site-packages/awswrangler/_utils.py", line 882, in table_refs_to_df
    return _table_to_df(pa.concat_tables(tables, promote=True), kwargs=kwargs)
  File "pyarrow/table.pxi", line 5371, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unable to merge: Field user_id has incompatible types: decimal128(4, 0) vs decimal128(6, 0

How to Reproduce

import awswrangler as wr

items = wr.dynamodb.read_items(
    table_name="my-table",
    max_items_evaluated=10,  # limit the number of items to 10 for testing
    columns=["foo_id", "user_id", "bar_id"],  # specify the columns to read
)

print(items)

Expected behavior

The columns in this query are all N (Number) and should be integer dtype. I can't even get it to run though. I looked at what it was implying and it seems to refer to the arrow decimal128 type's precision (scale being 0 means there are no characters after the decimal place) and my interpretation is that it indicates there are numbers returned at different orders of magnitude (powers of 10)?

I just created a conda environment to test this library out as an alternative to the boto3 client access to dynamodb. Happy to try any suggestions to fix

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.10

AWS SDK for pandas version

3.2.1

Additional context

$ pip list
Package           Version
----------------- -------
awswrangler       3.2.1
boto3             1.28.7
botocore          1.31.7
jmespath          1.0.1
numpy             1.25.1
packaging         23.1
pandas            2.0.3
pip               23.1.2
pyarrow           12.0.1
python-dateutil   2.8.2
pytz              2023.3
s3transfer        0.6.1
setuptools        67.8.0
six               1.16.0
typing_extensions 4.7.1
tzdata            2023.3
urllib3           1.26.16
wheel             0.38.4
@lmmx lmmx added the bug Something isn't working label Jul 20, 2023
jaidisido added a commit that referenced this issue Jul 20, 2023
Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>
@jaidisido jaidisido linked a pull request Jul 20, 2023 that will close this issue
@jaidisido
Copy link
Contributor

Thanks for raising this @lmmx. The issue here is that the user_id column appears to have numbers with varying decimal precisions and the pyarrow tables obtained from paginating end up with different schemas. #2401 fixes this and will be available in the next release. You will have to pass the schema via pyarrow_additional_kwargs however, for example:

import pyarrow as pa

schema = pa.schema([("foo_id", pa.int8()), ("user_id", pa.decimal128(6, 0)), ("bar_id", pa.int8())])

items = wr.dynamodb.read_items(
    table_name="my-table",
    max_items_evaluated=10,  # limit the number of items to 10 for testing
    columns=["foo_id", "user_id", "bar_id"],  # specify the columns to read
    pyarrow_additional_kwargs={"schema": schema},
)

Note that if you simply wish to obtain the items (without converting them to a pandas dataframe), you can pass the as_dataframe=False argument. At that point you won't need to pass the pyarrow schema.

jaidisido added a commit that referenced this issue Jul 21, 2023
* fix: support pyarrow schema in DynamoDB read_items #2399
---------

Signed-off-by: Abdel Jaidi <jaidisido@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants