Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient statement-level batching #527

Open
jrudolph opened this issue Jul 5, 2022 · 15 comments
Open

Inefficient statement-level batching #527

jrudolph opened this issue Jul 5, 2022 · 15 comments
Labels
status: waiting-for-triage An issue we've not yet triaged

Comments

@jrudolph
Copy link

jrudolph commented Jul 5, 2022

Using Statement.add() and bind to execute batches is significantly slower than using Connection.createBatch(). Using bindings should be faster than recreating individual textual SQL statements and sending them over the connection.

As far as I have debugged it with wireshark, it looks like Connection.createBatch will send all the (mostly duplicated) SQL statements in a single packet back-to-back. On the other hand, statement batches do not send all of the bind/execute etc command in one go, but each statement is run one by one. Compare this to how JDBC statement-level batching works where all the binds and execute commands are also sent back-to-back in one go. For this reason (but not only), batched JDBC statements are also quite a bit faster than using r2dbc.

(In akka/akka-persistence-r2dbc@2ea4121 we were considering actual going to connection-level batching but issuing statements without bind (and creating individual textual SQL statements) seems to big of a risk / anti-pattern to work around this issue in the driver.)

@jrudolph
Copy link
Author

jrudolph commented Jul 5, 2022

(I can try to create another reproduction including wireshark output if you need it, but it should be pretty easy to reproduce)

@patriknw
Copy link
Contributor

patriknw commented Jul 5, 2022

I think it is running the bindings one-by-one here:

return bindings.asFlux()
.map(it -> {
Flux<BackendMessage> messages =
collectBindingParameters(it).flatMapMany(values -> ExtendedFlowDelegate.runQuery(this.resources, factory, sql, it, values, this.fetchSize)).doOnComplete(() -> tryNextBinding(iterator, bindings, canceled));
return PostgresqlResult.toResult(this.resources, messages, factory);
})

@mp911de
Copy link
Collaborator

mp911de commented Jul 5, 2022

The performance you experience pretty much depends on the input variance. Parametrized queries use the extended flow to prepare a query on the server side and then run the prepared query with the bindings provided to the statement. Queries that do not change and are called with the same parameter types should typically yield a similar performance profile as inline-parametrized (static) statements.

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor). Therefore, we cannot pipeline server requests and send all bindings at once, but must resort to one-by-one processing. On the other hand, Batch collects all statements and sends the entire SQL text (;-concatenated SQL) to the server which is just one call instead of a multi-roundtrip message exchange.

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

@jrudolph
Copy link
Author

jrudolph commented Jul 5, 2022

The catch with Statement.add(…) is that each Statement may result in a number of results that are fetched individually (open, consume, close cursor).

Good point, I guess, batch statements mostly make sense for updates where you don't care about the result, so that might be difference? Also, there's a difference in API where JDBC only allows to consume the data from the first statement while r2dbc seems to allow consumption of all results?

@davecramer
Copy link
Member

I'm actually wondering why the JDBC driver would yield a better performance profile as it should work exactly the same way. Both drivers have the same amount of information and should be able to have a similar throughput. Maybe @davecramer has some more insights.

I haven't looked at this in a while. But if @jrudolph is correct we send all the requests at once which minimizes the round trips.

@mp911de
Copy link
Collaborator

mp911de commented Jul 6, 2022

r2dbc seems to allow consumption of all results?

Yes, this is true. I need to think about proposals on how we could optimize here.

Right now, I can imagine that pipelining would work if the fetch size is zero (fetch everything) so we can request all data (send all the requests within a small number of packets) at once without employing a multi-round trip conversation that takes backpressure into account.

@patriknw
Copy link
Contributor

patriknw commented Jul 6, 2022

I need to think about proposals on how we could optimize here.

If it matters, in my opinion it's most important to optimize this for batch inserts. For inserts you don't care about the result value. I guess it's always 1. If one statement fails the entire batch (transaction) fails.

@davecramer
Copy link
Member

What we (pgjdbc) do is rewrite the inserts into one insert using multivalue writes ie

insert into x values (1)
insert into x values (2)

becomes

insert into x values (1),(2) 

@mp911de
Copy link
Collaborator

mp911de commented Jul 6, 2022

Interesting. R2DBC drivers refrain from rewriting SQL in any form as we do not want to get ourselves into SQL parsing (sigh) or rewriting business. In any case, a multi-value insert can be passed into the driver directly because it is SQL after all.

@davecramer
Copy link
Member

We don't parse it either. this is one case where we rewrite it though.
It can also be rewritten as an array
insert into x values (unnest(array[1,2,3,4]));

@patriknw
Copy link
Contributor

patriknw commented Jul 6, 2022

a multi-value insert can be passed into the driver directly because it is SQL after all

I understand the examples above with literal values, but could you give some hints of how to do that with bind parameters? Would it be just one bind with very many bind parameters? If that is the case, the number of bind parameters would also depend on how many rows to insert, which probably isn't great for prepared statement caching?

Or would that be used with only one bind parameter and the array syntax you showed?

@mp911de
Copy link
Collaborator

mp911de commented Jul 6, 2022

connection.createStatement("insert into x values ($1),($2)").bind("$1", …).bind("$2", …).execute() should do.

@davecramer
Copy link
Member

prepared statement caching doesn't buy you that much in Postgres. I expect 1 round trip which would be more than madeup for by doing the inserts all in one statement

@patriknw
Copy link
Contributor

patriknw commented Jul 6, 2022

Taking the example from https://www.postgresql.org/docs/current/dml-insert.html

INSERT INTO products (product_no, name, price) VALUES
    ($1, $2, $3),
    ($4, $5, $6),
    ($7, $8, $9);

would require 9 bind parameters for 3 rows, 12 for 4, and so on? Or did I misunderstand the syntax?

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

@davecramer
Copy link
Member

Thanks for the hint, but it feels like an extreme workaround for such a common use case as batch inserts. I hope it can be solved in a better way in the driver itself.

which is why the JDBC driver does it for you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

No branches or pull requests

4 participants