Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 13, 2024

What changes were proposed in this pull request?

This is similar with #36545. This PR proposes to infer the map types from all pairs instead of the first pair.

Why are the changes needed?

To have the consistent behaivor. e.g.,

>>> spark.createDataFrame([[1], [2], ["a"], ["c"]]).collect()
[Row(_1='1'), Row(_1='2'), Row(_1='a'), Row(_1='c')]

Does this PR introduce any user-facing change?

Yes. See below

Without Spark Connect:

>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'name': 'A', 'payment': '200.5'})]
>>> spark.conf.set("spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled", True)
>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'name': None, 'payment': 200.5})]

With Spark Conenct:

>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'payment': '200.5', 'name': 'A'})]
>>> spark.conf.set("spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled", True)
>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/connect/session.py", line 635, in createDataFrame
    _table = LocalDataToArrowConversion.convert(_data, _schema)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../spark/python/pyspark/sql/connect/conversion.py", line 378, in convert
    return pa.Table.from_arrays(pylist, schema=pa_schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 3974, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1464, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 373, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 343, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'A' with type str: tried to convert to double

How was this patch tested?

Unittests added

Was this patch authored or co-authored using generative AI tooling?

No.

@HyukjinKwon HyukjinKwon requested a review from ueshin May 13, 2024 02:33
@HyukjinKwon HyukjinKwon changed the title [SPARK-48247][PYTHON] Use all values in a python dict when inferring MapType schema [SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema May 13, 2024
.internal()
.doc("PySpark's SparkSession.createDataFrame infers the key/value types of a map from all " +
"paris in the map by default. If this config is set to true, it restores the legacy " +
"behavior of only inferring the type from the first pair.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: first non-null pair

@xinrong-meng
Copy link
Member

LGTM, thank you!

@HyukjinKwon
Copy link
Member Author

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants