[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

HyukjinKwon · 2024-05-13T02:33:15Z

What changes were proposed in this pull request?

This is similar with #36545. This PR proposes to infer the map types from all pairs instead of the first pair.

Why are the changes needed?

To have the consistent behaivor. e.g.,

>>> spark.createDataFrame([[1], [2], ["a"], ["c"]]).collect()
[Row(_1='1'), Row(_1='2'), Row(_1='a'), Row(_1='c')]

Does this PR introduce any user-facing change?

Yes. See below

Without Spark Connect:

>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'name': 'A', 'payment': '200.5'})]
>>> spark.conf.set("spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled", True)
>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'name': None, 'payment': 200.5})]

With Spark Conenct:

>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
[Row(outer={'payment': '200.5', 'name': 'A'})]
>>> spark.conf.set("spark.sql.pyspark.legacy.inferMapTypeFromFirstPair.enabled", True)
>>> spark.createDataFrame([{"outer": {"payment": 200.5, "name": "A"}}]).collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/connect/session.py", line 635, in createDataFrame
    _table = LocalDataToArrowConversion.convert(_data, _schema)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../spark/python/pyspark/sql/connect/conversion.py", line 378, in convert
    return pa.Table.from_arrays(pylist, schema=pa_schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 3974, in pyarrow.lib.Table.from_arrays
  File "pyarrow/table.pxi", line 1464, in pyarrow.lib._sanitize_arrays
  File "pyarrow/array.pxi", line 373, in pyarrow.lib.asarray
  File "pyarrow/array.pxi", line 343, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'A' with type str: tried to convert to double

How was this patch tested?

Unittests added

Was this patch authored or co-authored using generative AI tooling?

No.

xinrong-meng · 2024-05-14T17:57:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("PySpark's SparkSession.createDataFrame infers the key/value types of a map from all " +
+        "paris in the map by default. If this config is set to true, it restores the legacy " +
+        "behavior of only inferring the type from the first pair.")


nit: first non-null pair

xinrong-meng · 2024-05-14T18:02:57Z

LGTM, thank you!

…onf.scala

HyukjinKwon · 2024-05-14T23:22:14Z

Merged to master.

Use all values in a python dict when inferring MapType schema

3f9f2f0

github-actions bot added SQL DOCS PYTHON CONNECT labels May 13, 2024

HyukjinKwon requested a review from ueshin May 13, 2024 02:33

HyukjinKwon changed the title ~~[SPARK-48247][PYTHON] Use all values in a python dict when inferring MapType schema~~ [SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema May 13, 2024

xinrong-meng reviewed May 14, 2024

View reviewed changes

xinrong-meng approved these changes May 14, 2024

View reviewed changes

Update sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLC…

f90763d

…onf.scala

HyukjinKwon closed this in 42c1c8f May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

HyukjinKwon commented May 13, 2024 •

edited

xinrong-meng May 14, 2024

xinrong-meng commented May 14, 2024

HyukjinKwon commented May 14, 2024

[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

[SPARK-48247][PYTHON] Use all values in a dict when inferring MapType schema #46547

Conversation

HyukjinKwon commented May 13, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

xinrong-meng May 14, 2024

Choose a reason for hiding this comment

xinrong-meng commented May 14, 2024

HyukjinKwon commented May 14, 2024

HyukjinKwon commented May 13, 2024 •

edited