Skip to content

fix: infer narrowest numeric type when combining numeric columns #602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 17, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

Sorry, something went wrong.

@TrevorBergeron TrevorBergeron requested review from a team as code owners April 10, 2024 16:37
@TrevorBergeron TrevorBergeron requested a review from tswast April 10, 2024 16:37
@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Apr 10, 2024
TrevorBergeron and others added 6 commits April 10, 2024 16:45

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few questions though.

Also, since this makes some user-visible changes, refactor: isn't the right prefix as that will be hidden. Let's do fix: retain integer column type when aggregating over integer columns or similar.

@@ -354,9 +354,6 @@ def unpivot(
*,
passthrough_columns: typing.Sequence[str] = (),
index_col_ids: typing.Sequence[str] = ["index"],
dtype: typing.Union[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall this was added for matrix multiplication, is that right? Can you help me understand what this was for and why it's now safe to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added so we could mix compatible numerics together in a single df.sum() or similar aggregation. You would manually pass the float dtype here and it would cast everything to that type. However, this isn't necessary as we can find the common supertype of the inputs and coerce to that automatically.

@@ -372,15 +369,62 @@ def unpivot(
Returns:
ArrayValue: The unpivoted ArrayValue
"""
# 1. construct array_value from row_labels, with offsets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble understanding this function. Any chance we could break it up into at least 3 private functions based on these three steps?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, I see below that this is taking the list of row labels and creating a local array. I think that was the part I was missing from this: the fact that row labels is data we have locally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factored out 2 private functions. and yes, we have the row labels locally here. usually, these were the former column names that are being stacked.

pa.Table.from_pylist(rows), session=self.session
).promote_offsets(unpivot_offset_id)

# 2. cross join labels_array with main table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand why this join is necessary -- ideally in those docstring comments when we split into functions. I'm guessing it's because we are decreasing the number of columns and increasing the number of rows, so the number of cells (rows X columns) should remain about the same?

else:
joined_array = labels_array.join(self, join_def=join)

# 3. Build output columns by switching between input columns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help to include the fact that we are going from many columns and fewer rows -> fewer columns and more rows here too.

@@ -1346,6 +1365,25 @@ def clip_op(
)


@scalar_op_compiler.register_nary_op(ops.switch_op)
def switch_op(*cases_and_outputs: ibis_types.Value) -> ibis_types.Value:
# ibis can handle most type coercions, but we need to force bool -> int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth leaving them an issue to allow this implicit cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe? even pandas isn't really consistent on whether bools are numeric or not.

@@ -148,6 +145,21 @@ def order_preserving(self) -> bool:
return False


@dataclasses.dataclass(frozen=True)
class NaryOp:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making this a superclass to Unary, Binary, Ternary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, done

@@ -664,6 +676,46 @@ def output_type(self, *input_types: dtypes.ExpressionType) -> dtypes.ExpressionT

clip_op = ClipOp()


class SwitchOp(NaryOp):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@TrevorBergeron TrevorBergeron requested a review from tswast April 16, 2024 16:15
@TrevorBergeron TrevorBergeron changed the title refactor: Remove unpivot node fix: retain integer column type when aggregating over integer columns Apr 16, 2024
@TrevorBergeron TrevorBergeron changed the title fix: retain integer column type when aggregating over integer columns fix: infer narrowest numeric type when combining numeric columns Apr 16, 2024
@TrevorBergeron TrevorBergeron enabled auto-merge (squash) April 17, 2024 21:39
@TrevorBergeron TrevorBergeron merged commit 8f9ece6 into main Apr 17, 2024
15 of 16 checks passed
@TrevorBergeron TrevorBergeron deleted the no_unpivot_node branch April 17, 2024 22:48
gcf-merge-on-green bot pushed a commit that referenced this pull request Apr 22, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
🤖 I have created a release *beep* *boop*
---


## [1.3.0](https://togithub.com/googleapis/python-bigquery-dataframes/compare/v1.2.0...v1.3.0) (2024-04-22)


### Features

* Add `Series.struct.dtypes` property ([#599](https://togithub.com/googleapis/python-bigquery-dataframes/issues/599)) ([d924ec2](https://togithub.com/googleapis/python-bigquery-dataframes/commit/d924ec2937c158644b5d1bbae4f82476de2c1655))
* Add fine tuning `fit()` for Palm2TextGenerator ([#616](https://togithub.com/googleapis/python-bigquery-dataframes/issues/616)) ([9c106bd](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9c106bd24482620ef5ff3c85f94be9da76c49716))
* Add quantile statistic ([#613](https://togithub.com/googleapis/python-bigquery-dataframes/issues/613)) ([bc82804](https://togithub.com/googleapis/python-bigquery-dataframes/commit/bc82804da43c03c2311cd56f47a2316d3aae93d2))
* Expose `max_batching_rows` in `remote_function` ([#622](https://togithub.com/googleapis/python-bigquery-dataframes/issues/622)) ([240a1ac](https://togithub.com/googleapis/python-bigquery-dataframes/commit/240a1ac6fa914550bb6216cd5d179a36009f2657))
* Support primary key(s) in `read_gbq` by using as the `index_col` by default ([#625](https://togithub.com/googleapis/python-bigquery-dataframes/issues/625)) ([75bb240](https://togithub.com/googleapis/python-bigquery-dataframes/commit/75bb2409532e80de742030d05ffcbacacf5ffba2))
* Warn if location is set to unknown location ([#609](https://togithub.com/googleapis/python-bigquery-dataframes/issues/609)) ([3706b4f](https://togithub.com/googleapis/python-bigquery-dataframes/commit/3706b4f9dde65788b5e6343a6428fb1866499461))


### Bug Fixes

* Address technical writers fb ([#611](https://togithub.com/googleapis/python-bigquery-dataframes/issues/611)) ([9f8f181](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9f8f181279133abdb7da3aa045df6fa278587013))
* Infer narrowest numeric type when combining numeric columns ([#602](https://togithub.com/googleapis/python-bigquery-dataframes/issues/602)) ([8f9ece6](https://togithub.com/googleapis/python-bigquery-dataframes/commit/8f9ece6d13f57f02d677bf0e3fea97dea94ae240))
* Use exact median implementation by default ([#619](https://togithub.com/googleapis/python-bigquery-dataframes/issues/619)) ([9d205ae](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9d205aecb77f35baeec82a8f6e1b72c2d852ca46))


### Documentation

* Fix rendering of examples for multiple apis ([#620](https://togithub.com/googleapis/python-bigquery-dataframes/issues/620)) ([9665e39](https://togithub.com/googleapis/python-bigquery-dataframes/commit/9665e39ef288841f03a9d823bd2210ef58394ad3))
* Set `index_cols` in `read_gbq` as a best practice ([#624](https://togithub.com/googleapis/python-bigquery-dataframes/issues/624)) ([70015b7](https://togithub.com/googleapis/python-bigquery-dataframes/commit/70015b79e8cff16ff1b36c5e3f019fe099750a9d))

---
This PR was generated with [Release Please](https://togithub.com/googleapis/release-please). See [documentation](https://togithub.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants