Inefficient statement-level batching #527

jrudolph · 2022-07-05T10:25:26Z

Using Statement.add() and bind to execute batches is significantly slower than using Connection.createBatch(). Using bindings should be faster than recreating individual textual SQL statements and sending them over the connection.

As far as I have debugged it with wireshark, it looks like Connection.createBatch will send all the (mostly duplicated) SQL statements in a single packet back-to-back. On the other hand, statement batches do not send all of the bind/execute etc command in one go, but each statement is run one by one. Compare this to how JDBC statement-level batching works where all the binds and execute commands are also sent back-to-back in one go. For this reason (but not only), batched JDBC statements are also quite a bit faster than using r2dbc.

(In akka/akka-persistence-r2dbc@2ea4121 we were considering actual going to connection-level batching but issuing statements without bind (and creating individual textual SQL statements) seems to big of a risk / anti-pattern to work around this issue in the driver.)

The text was updated successfully, but these errors were encountered:

jrudolph · 2022-07-05T10:26:51Z

(I can try to create another reproduction including wireshark output if you need it, but it should be pretty easy to reproduce)

patriknw · 2022-07-05T11:25:49Z

I think it is running the bindings one-by-one here:

r2dbc-postgresql/src/main/java/io/r2dbc/postgresql/PostgresqlStatement.java

Lines 223 to 229 in a32a679

    
           return bindings.asFlux() 
        
               .map(it -> { 
        
                   Flux<BackendMessage> messages = 
        
                       collectBindingParameters(it).flatMapMany(values -> ExtendedFlowDelegate.runQuery(this.resources, factory, sql, it, values, this.fetchSize)).doOnComplete(() -> tryNextBinding(iterator, bindings, canceled)); 
        
                   return PostgresqlResult.toResult(this.resources, messages, factory); 
        
               })

mp911de · 2022-07-05T11:34:15Z

The performance you experience pretty much depends on the input variance. Parametrized queries use the extended flow to prepare a query on the server side and then run the prepared query with the bindings provided to the statement. Queries that do not change and are called with the same parameter types should typically yield a similar performance profile as inline-parametrized (static) statements.

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor). Therefore, we cannot pipeline server requests and send all bindings at once, but must resort to one-by-one processing. On the other hand, Batch collects all statements and sends the entire SQL text (;-concatenated SQL) to the server which is just one call instead of a multi-roundtrip message exchange.

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

jrudolph · 2022-07-05T11:49:41Z

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor).

Good point, I guess, batch statements mostly make sense for updates where you don't care about the result, so that might be difference? Also, there's a difference in API where JDBC only allows to consume the data from the first statement while r2dbc seems to allow consumption of all results?

davecramer · 2022-07-05T11:54:26Z

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

I haven't looked at this in a while. But if @jrudolph is correct we send all the requests at once which minimizes the round trips.

mp911de · 2022-07-06T05:54:02Z

r2dbc seems to allow consumption of all results?

Yes, this is true. I need to think about proposals on how we could optimize here.

Right now, I can imagine that pipelining would work if the fetch size is zero (fetch everything) so we can request all data (send all the requests within a small number of packets) at once without employing a multi-round trip conversation that takes backpressure into account.

patriknw · 2022-07-06T06:04:18Z

I need to think about proposals on how we could optimize here.

If it matters, in my opinion it's most important to optimize this for batch inserts. For inserts you don't care about the result value. I guess it's always 1. If one statement fails the entire batch (transaction) fails.

davecramer · 2022-07-06T12:00:30Z

What we (pgjdbc) do is rewrite the inserts into one insert using multivalue writes ie

insert into x values (1)
insert into x values (2)

becomes

insert into x values (1),(2)

mp911de · 2022-07-06T13:27:20Z

Interesting. R2DBC drivers refrain from rewriting SQL in any form as we do not want to get ourselves into SQL parsing (sigh) or rewriting business. In any case, a multi-value insert can be passed into the driver directly because it is SQL after all.

davecramer · 2022-07-06T13:42:52Z

We don't parse it either. this is one case where we rewrite it though.
It can also be rewritten as an array
insert into x values (unnest(array[1,2,3,4]));

patriknw · 2022-07-06T13:51:52Z

a multi-value insert can be passed into the driver directly because it is SQL after all

I understand the examples above with literal values, but could you give some hints of how to do that with bind parameters? Would it be just one bind with very many bind parameters? If that is the case, the number of bind parameters would also depend on how many rows to insert, which probably isn't great for prepared statement caching?

Or would that be used with only one bind parameter and the array syntax you showed?

mp911de · 2022-07-06T14:02:35Z

connection.createStatement("insert into x values ($1),($2)").bind("$1", …).bind("$2", …).execute() should do.

davecramer · 2022-07-06T14:07:42Z

prepared statement caching doesn't buy you that much in Postgres. I expect 1 round trip which would be more than madeup for by doing the inserts all in one statement

patriknw · 2022-07-06T14:16:11Z

Taking the example from https://www.postgresql.org/docs/current/dml-insert.html

INSERT INTO products (product_no, name, price) VALUES
    ($1, $2, $3),
    ($4, $5, $6),
    ($7, $8, $9);

would require 9 bind parameters for 3 rows, 12 for 4, and so on? Or did I misunderstand the syntax?

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

davecramer · 2022-07-06T14:17:32Z

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

which is why the JDBC driver does it for you

jrudolph added the status: waiting-for-triage An issue we've not yet triaged label Jul 5, 2022

jrudolph mentioned this issue Jul 5, 2022

Improve performance of batch saveOffset akka/akka-projection#897

Closed

SanjayVas mentioned this issue Feb 27, 2024

Replace one insert call per row with one insert call for multiple rows world-federation-of-advertisers/cross-media-measurement#1502

Merged

This was referenced Mar 6, 2024

Add ValuesListBoundStatement for handling Values list statements world-federation-of-advertisers/common-jvm#234

Merged

Replace one update call per row with one update call for multiple rows world-federation-of-advertisers/cross-media-measurement#1518

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient statement-level batching #527

Inefficient statement-level batching #527

jrudolph commented Jul 5, 2022

jrudolph commented Jul 5, 2022

patriknw commented Jul 5, 2022

mp911de commented Jul 5, 2022

jrudolph commented Jul 5, 2022

davecramer commented Jul 5, 2022

mp911de commented Jul 6, 2022

patriknw commented Jul 6, 2022 •

edited

davecramer commented Jul 6, 2022

mp911de commented Jul 6, 2022

davecramer commented Jul 6, 2022

patriknw commented Jul 6, 2022

mp911de commented Jul 6, 2022

davecramer commented Jul 6, 2022

patriknw commented Jul 6, 2022

davecramer commented Jul 6, 2022

Inefficient statement-level batching #527

Inefficient statement-level batching #527

Comments

jrudolph commented Jul 5, 2022

jrudolph commented Jul 5, 2022

patriknw commented Jul 5, 2022

mp911de commented Jul 5, 2022

jrudolph commented Jul 5, 2022

davecramer commented Jul 5, 2022

mp911de commented Jul 6, 2022

patriknw commented Jul 6, 2022 • edited

davecramer commented Jul 6, 2022

mp911de commented Jul 6, 2022

davecramer commented Jul 6, 2022

patriknw commented Jul 6, 2022

mp911de commented Jul 6, 2022

davecramer commented Jul 6, 2022

patriknw commented Jul 6, 2022

davecramer commented Jul 6, 2022

patriknw commented Jul 6, 2022 •

edited