Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filters not being pushed down to pyarrow dataset #16248

Open
2 tasks done
adriangb opened this issue May 15, 2024 · 11 comments
Open
2 tasks done

Filters not being pushed down to pyarrow dataset #16248

adriangb opened this issue May 15, 2024 · 11 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@adriangb
Copy link

adriangb commented May 15, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime, timezone
import shutil

import datafusion
import polars
import pyarrow
import pyarrow.compute
import pyarrow.dataset
import pyarrow.fs


shutil.rmtree('data', ignore_errors=True)  # proof that there's no data in the data directory

format = pyarrow.dataset.ParquetFileFormat()
filesystem = pyarrow.fs.SubTreeFileSystem('data', pyarrow.fs.LocalFileSystem())
fragments = [
    # note that the partition_expression is totally wrong
    format.make_fragment(
        '1.parquet',
        filesystem=filesystem,
        partition_expression=(pyarrow.dataset.field('a') <= pyarrow.compute.scalar(1)),
    )
]

dataset = pyarrow.dataset.FileSystemDataset(
    fragments, pyarrow.schema([pyarrow.field('a', pyarrow.int64())]), format, filesystem
)

fragments = list(dataset.get_fragments(pyarrow.dataset.field('a') == pyarrow.scalar(2)))
assert fragments == []
# using polars as the query engine but should be the same for any other engine
df = polars.scan_pyarrow_dataset(dataset)
df = df.filter(polars.col('a') == 2)
assert df.collect().shape[0] == 0

ctx = datafusion.SessionContext()
ctx.register_dataset('dataset', dataset)

df = ctx.sql('SELECT * FROM dataset WHERE a = 2')
assert df.collect() == [], fragments


format = pyarrow.dataset.ParquetFileFormat()
filesystem = pyarrow.fs.SubTreeFileSystem('data', pyarrow.fs.LocalFileSystem())
fragments = [
    # note that the partition_expression is totally wrong
    format.make_fragment(
        '1.parquet',
        filesystem=filesystem,
        partition_expression=(
            pyarrow.dataset.field('a') == pyarrow.scalar(datetime(2000, 1, 1, tzinfo=timezone.utc), pyarrow.timestamp('ns', '+00:00'))
        ),
    )
]

dataset = pyarrow.dataset.FileSystemDataset(
    fragments, pyarrow.schema([pyarrow.field('a', pyarrow.timestamp('ns', '+00:00'))]), format, filesystem
)

fragments = list(
    dataset.get_fragments(
        pyarrow.dataset.field('a')
        == pyarrow.scalar(datetime(2024, 1, 1, tzinfo=timezone.utc), pyarrow.timestamp('ns', '+00:00'))
    )
)
assert fragments == [], fragments

df = polars.scan_pyarrow_dataset(dataset)
df = df.filter(polars.col('a') == datetime(2024, 1, 1, tzinfo=timezone.utc))
print(df.explain())
assert df.collect().shape[0] == 0

ctx = datafusion.SessionContext()
ctx.register_dataset('dataset', dataset)

df = ctx.sql("SELECT * FROM dataset WHERE a = '2024-01-01T00:00:00+00:00'")
assert df.collect() == []

Log output

No response

Issue description

The integer filter is pushed down correctly but the datetime/timestamp filter is not (polars attempts to read the file).

Expected behavior

The datetime filter is pushed down.

Installed versions

--------Version info---------
Polars:               0.20.26
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            0.17.4
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              16.0.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>```

</details>
@adriangb adriangb added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 15, 2024
@adriangb
Copy link
Author

Looking at the source code it seems like this should be handled:

AnyValue::Datetime(v, tu, tz) => Some(to_py_datetime(v, &tu, tz.as_ref())),
. I'm not familiar enough with the polars codebase to diagnose further.

@ion-elgreco
Copy link
Contributor

What do you see when you do df.explain() on the polars plan

@adriangb
Copy link
Author

adriangb commented May 18, 2024

FILTER [(col("a").cast(Datetime(Microseconds, Some("UTC")))) == (2024-01-01 00:00:00.dt.replace_time_zone([String(earliest)]))] FROM

  PYTHON SCAN 
  PROJECT */1 COLUMNS

(by the way it's a self contained example, no external data, you can run it and play with it)

@ion-elgreco
Copy link
Contributor

ion-elgreco commented May 18, 2024

Maybe the feature flags are just missing, I can't see them in the Python cargo.toml. Neither in the polars crate cargo.toml can I see polars-plan/dtype-datetime

@adriangb
Copy link
Author

Isn’t this it?

"dtype-full",

@ion-elgreco
Copy link
Contributor

I only see it through here:

"polars-plan/temporal",
"polars-expr/temporal",

@adriangb
Copy link
Author

I guess the easiest thing to do is to compile with a panic in there and see if it's even being hit.

@ion-elgreco
Copy link
Contributor

@adriangb I found a couple issues:

  • filter(expr, expr) are not pushed down, but filter(expr & expr) is
  • filter(pl.col('foo') == pl.date(1970,1,1)) is not pushed down because it hits a Cast in the rhs
  • filter(pl.col('foo') == datetime.date) works
  • filter(pl.col('foo') == datetime.datetime) doesn't work

@ion-elgreco
Copy link
Contributor

datetime fails because LHS is hitting a Cast as well

@adriangb
Copy link
Author

Thank you for the investigation @ion-elgreco! Would you like to work on this or should I tag previous committers of crates/polars-plan/src/logical_plan/pyarrow.rs?

@ion-elgreco
Copy link
Contributor

@adriangb I started on it, but didn't have time to finish it. Also won't have time to finish it soon, so feel free to pick it from there:

#16500

The cast arm still needs to be wrapped in a .cast(pyarrow.dtype)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants