fix: infer narrowest numeric type when combining numeric columns #602

TrevorBergeron · 2024-04-10T16:37:25Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast

Thanks! A few questions though.

Also, since this makes some user-visible changes, refactor: isn't the right prefix as that will be hidden. Let's do fix: retain integer column type when aggregating over integer columns or similar.

tswast · 2024-04-15T16:29:15Z

bigframes/core/__init__.py

@@ -354,9 +354,6 @@ def unpivot(
        *,
        passthrough_columns: typing.Sequence[str] = (),
        index_col_ids: typing.Sequence[str] = ["index"],
-        dtype: typing.Union[


I recall this was added for matrix multiplication, is that right? Can you help me understand what this was for and why it's now safe to remove it?

This was added so we could mix compatible numerics together in a single df.sum() or similar aggregation. You would manually pass the float dtype here and it would cast everything to that type. However, this isn't necessary as we can find the common supertype of the inputs and coerce to that automatically.

tswast · 2024-04-15T19:26:01Z

bigframes/core/__init__.py

@@ -372,15 +369,62 @@ def unpivot(
        Returns:
            ArrayValue: The unpivoted ArrayValue
        """
+        # 1. construct array_value from row_labels, with offsets


I'm having a bit of trouble understanding this function. Any chance we could break it up into at least 3 private functions based on these three steps?

For example, I see below that this is taking the list of row labels and creating a local array. I think that was the part I was missing from this: the fact that row labels is data we have locally.

factored out 2 private functions. and yes, we have the row labels locally here. usually, these were the former column names that are being stacked.

tswast · 2024-04-15T19:31:15Z

bigframes/core/__init__.py

+            pa.Table.from_pylist(rows), session=self.session
+        ).promote_offsets(unpivot_offset_id)
+
+        # 2. cross join labels_array with main table


Can you help me understand why this join is necessary -- ideally in those docstring comments when we split into functions. I'm guessing it's because we are decreasing the number of columns and increasing the number of rows, so the number of cells (rows X columns) should remain about the same?

tswast · 2024-04-15T21:32:26Z

bigframes/core/__init__.py

+        else:
+            joined_array = labels_array.join(self, join_def=join)
+
+        # 3. Build output columns by switching between input columns


It might help to include the fact that we are going from many columns and fewer rows -> fewer columns and more rows here too.

tswast · 2024-04-15T21:35:34Z

bigframes/core/compile/scalar_op_compiler.py

@@ -1346,6 +1365,25 @@ def clip_op(
        )


+@scalar_op_compiler.register_nary_op(ops.switch_op)
+def switch_op(*cases_and_outputs: ibis_types.Value) -> ibis_types.Value:
+    # ibis can handle most type coercions, but we need to force bool -> int


I wonder if it's worth leaving them an issue to allow this implicit cast?

maybe? even pandas isn't really consistent on whether bools are numeric or not.

tswast · 2024-04-15T21:36:34Z

bigframes/operations/__init__.py

@@ -148,6 +145,21 @@ def order_preserving(self) -> bool:
        return False


+@dataclasses.dataclass(frozen=True)
+class NaryOp:


Thoughts on making this a superclass to Unary, Binary, Ternary?

makes sense, done

tswast · 2024-04-15T21:39:08Z

bigframes/operations/__init__.py

@@ -664,6 +676,46 @@ def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionT

 clip_op = ClipOp()

+
+class SwitchOp(NaryOp):


Could we name this CaseWhenOp to match SQL https://cloud.google.com/bigquery/docs/reference/standard-sql/conditional_expressions ?

TIL: This is a supported op name in pandas now too: https://pandas.pydata.org/docs/reference/api/pandas.Series.case_when.html (learned via https://stackoverflow.com/a/73619829/101923)

refactor: Remove unpivot node

Loading
Loading status checks…

a74e837

TrevorBergeron requested review from a team as code owners April 10, 2024 16:37

TrevorBergeron requested a review from tswast April 10, 2024 16:37

product-auto-label bot added size: l api: bigquery labels Apr 10, 2024

TrevorBergeron and others added 6 commits April 10, 2024 16:45

remove more dead code

Loading
Loading status checks…

07dcfbc

fix bool coercion

9e6707d

Merge remote-tracking branch 'github/main' into no_unpivot_node

Loading
Loading status checks…

3480797

fix doctests

Loading
Loading status checks…

6d99102

Merge remote-tracking branch 'github/main' into no_unpivot_node

Loading
Loading status checks…

e0907e6

Merge branch 'main' into no_unpivot_node

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

d273e7f

tswast requested changes Apr 15, 2024

View reviewed changes

TrevorBergeron added 2 commits April 15, 2024 23:42

refactor based on pr comments

Loading
Loading status checks…

dd8f9ff

Merge branch 'main' into no_unpivot_node

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Verified
Learn about vigilant mode

Loading
Loading status checks…

fff0d16

TrevorBergeron requested a review from tswast April 16, 2024 16:15

TrevorBergeron changed the title ~~refactor: Remove unpivot node~~ fix: retain integer column type when aggregating over integer columns Apr 16, 2024

TrevorBergeron changed the title ~~fix: retain integer column type when aggregating over integer columns~~ fix: infer narrowest numeric type when combining numeric columns Apr 16, 2024

TrevorBergeron added 2 commits April 17, 2024 20:59

Merge remote-tracking branch 'github/main' into no_unpivot_node

d4a0a71

rename stuff and add more explanations

Loading
Loading status checks…

fc87812

TrevorBergeron enabled auto-merge (squash) April 17, 2024 21:39

tswast approved these changes Apr 17, 2024

View reviewed changes

TrevorBergeron merged commit 8f9ece6 into main Apr 17, 2024
15 of 16 checks passed

TrevorBergeron deleted the no_unpivot_node branch April 17, 2024 22:48

release-please bot mentioned this pull request Apr 17, 2024

chore(main): release 1.3.0 #617

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: infer narrowest numeric type when combining numeric columns #602

fix: infer narrowest numeric type when combining numeric columns #602

TrevorBergeron commented Apr 10, 2024

tswast left a comment

tswast Apr 15, 2024

TrevorBergeron Apr 16, 2024

tswast Apr 15, 2024

tswast Apr 15, 2024

TrevorBergeron Apr 16, 2024

tswast Apr 15, 2024

tswast Apr 15, 2024

tswast Apr 15, 2024

TrevorBergeron Apr 16, 2024

tswast Apr 15, 2024

TrevorBergeron Apr 16, 2024

tswast Apr 15, 2024

TrevorBergeron Apr 16, 2024

		@@ -664,6 +676,46 @@ def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionT

		clip_op = ClipOp()


		class SwitchOp(NaryOp):

fix: infer narrowest numeric type when combining numeric columns #602

fix: infer narrowest numeric type when combining numeric columns #602

Conversation

TrevorBergeron commented Apr 10, 2024

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment