Support pyarrow #1058

dhirschfeld · 2020-03-09T03:00:53Z

pyarrow is a high performance (zero copy) serialisation library designed for multi-language interop which has very fast conversion to/from pandas DataFrames.

It would be great if psycopg3 could also support returning pyarrow.Table objects as well as Python tuples.

The text was updated successfully, but these errors were encountered:

dhirschfeld · 2020-03-09T03:07:47Z

turbodbc is an existing library which allows returning pyarrow.Table objects however it is built on top of the ODBC interface and it particularly calls out the postgres implementation for not being performant:

In my own testing I've seen nearly an order of magnitude performance improvement when reading large datasets from SQL Server / Oracle into pandas DataFrames and I'd love to see the same benefits from a postgres native driver.

dvarrazzo · 2020-03-09T07:53:05Z

Hello, thank you for the use cases input.

I guess first let's start from the pandas dataframes. I'm no pandas expert: what does people want to store: a binary blob or the data frame becomes a table?

If you can provide some docs link that could be helpful.

In general, adding specific adapters or cursor types should be supported (as it is for psycopg2 already). I think the biggest win here would be to pass binary data to/from the db, but I'd like to have first an idea of what data.

dhirschfeld · 2020-03-10T12:20:21Z

what does people want to store: a binary blob or the data frame becomes a table?

The DataFrame (pyarrow.Table) would be inserted as rows into a postgres table (with a compatible schema) with the benefit being improved serialisation performance.

For me, a huge drawcard is the performance improvement when reading data from a database as you can construct the arrow.Table in C/C++ and pass it directly to Python without having to construct PyObjects for every value in the result set.

e.g. with turbodbc

I guess ideally an arrow.Table would be constructed in C/C++ from the binary output of libpq and passed to Python. I'm not at all sure how postgres works though so maybe it might be required to re-implement some functionality in libpq? 🤔

Some more background on the performance improvements:

Documentation on the arrow columnar format and its C-data interface:

dhirschfeld · 2020-03-10T12:24:54Z

As mentioned, I'm just a layperson (with an interest in database query performance!)

You'll get more informed opinions/advice if @wesm or @xhochy were inclined to comment...

xhochy · 2020-03-10T12:30:10Z

I have working code that uses COPY BINARY with psycopg2 to load data from Postgres to/from Arrow/Pandas. My plan is to opensource this in the next 1-3 months. Maybe, you can get better performance by directly integrating this with libpq but I wanted to have a working implementation which doesn't have a binary dependency on psycopg2.

dhirschfeld · 2020-03-10T12:38:08Z

That's very exciting news @xochy! If @dvarrazzo does decide to support pyarrow in psycopg3 it should hopefully provide some inspiration as to what is required.

mangecoeur · 2020-05-10T17:40:31Z

Would love to see this! Did some experimentation a over a year ago (https://github.com/mangecoeur/pgarrow - no guarantee that anything still works today, plus I'm rubbish at C/Cython) and there's at least a 3x performance improvement to be had. I think there is value in building into psycopg3 since bulk load/read is a big use-case these days.

There is also https://github.com/heterodb/pg2arrow (in C) which converts between arrow and libpq (it was already proposed in this issue https://issues.apache.org/jira/browse/ARROW-1106?src=confmacro). Perhaps that could be adapted into psycopg with a nice python API?

wesm · 2020-05-10T20:30:39Z

FYI: the Arrow C interface is a perfect match for a libpq bridge that doesn't require taking on any C++ library dependencies

http://arrow.apache.org/blog/2020/05/03/introducing-arrow-c-data-interface/

Some other database engines are starting to look at this as a way to pass simple datasets to pyarrow at C call sites

kylebarron · 2020-08-30T20:02:47Z

@xhochy did you end up open sourcing that code?

bopindux · 2021-09-24T10:09:30Z

I have working code that uses COPY BINARY with psycopg2 to load data from Postgres to/from Arrow/Pandas. My plan is to opensource this in the next 1-3 months. Maybe, you can get better performance by directly integrating this with libpq but I wanted to have a working implementation which doesn't have a binary dependency on psycopg2.

Hi @xhochy, I would be interested in your code as well. Any news on that?

char101 · 2022-12-14T07:59:09Z

Current methods to get pandas/polars/arrow dataframe from postgresql query result

conventional method: pandas read_sql/read_sql_query or create pandas dataframe from dict from psycopg query result
connectorx
duckdb via postgres_scanner

Problems with the typical conventional approach:

row oriented results which need to be stored into a column oriented memory store
conversion from c byte to python object to byte for each columns in each row

What psycopg can greatly help in this case is to skip the c value to python object conversion. The basic approach would be

create a static buffer (by first querying the results length) or an expandable buffer for each columns
for each column value in each row, write the value into the buffer
cast the buffers into a pandas dataframe/arrow table data structure.

SamuelMarks · 2023-02-16T06:34:35Z

FYI: I'm working on a Parquet -> SQLalchemy tool that also uses pyarrow to batch insert data with COPY FROM: https://github.com/offscale/cdd-python-gae

judahrand · 2023-09-05T20:39:38Z

The Postgres ADBC driver might also be useful in this space.

SamuelMarks · 2023-09-06T02:37:44Z

Looks nice. I'll wait for it to have JSON, timezoned datetime, and array support before switching though. In the meantime I'm using my own repo, which underneath the hood uses PostgreSQL's binary protocol by way of https://github.com/altaurog/pgcopy

dvarrazzo added this to the psycopg3 milestone Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pyarrow #1058

Support pyarrow #1058

dhirschfeld commented Mar 9, 2020

dhirschfeld commented Mar 9, 2020

dvarrazzo commented Mar 9, 2020

dhirschfeld commented Mar 10, 2020

dhirschfeld commented Mar 10, 2020

xhochy commented Mar 10, 2020

dhirschfeld commented Mar 10, 2020 •

edited

mangecoeur commented May 10, 2020

wesm commented May 10, 2020

kylebarron commented Aug 30, 2020

bopindux commented Sep 24, 2021

char101 commented Dec 14, 2022

SamuelMarks commented Feb 16, 2023

judahrand commented Sep 5, 2023

SamuelMarks commented Sep 6, 2023

Support pyarrow #1058

Support pyarrow #1058

Comments

dhirschfeld commented Mar 9, 2020

dhirschfeld commented Mar 9, 2020

dvarrazzo commented Mar 9, 2020

dhirschfeld commented Mar 10, 2020

dhirschfeld commented Mar 10, 2020

xhochy commented Mar 10, 2020

dhirschfeld commented Mar 10, 2020 • edited

mangecoeur commented May 10, 2020

wesm commented May 10, 2020

kylebarron commented Aug 30, 2020

bopindux commented Sep 24, 2021

char101 commented Dec 14, 2022

SamuelMarks commented Feb 16, 2023

judahrand commented Sep 5, 2023

SamuelMarks commented Sep 6, 2023

dhirschfeld commented Mar 10, 2020 •

edited