fix: return session to the pool after commit/rollback #10673

olavloite · 2023-07-20T12:28:48Z

Transactions would hold on to the session it was using after having been committed or rolled back, and would wait before returning it to the pool until the transaction was disposed. This could cause a higher than expected use of sessions, and could in theory also cause the session pool to be exhausted if someone would execute a large number of transactions in a loop without disposing those transactions in that loop.

olavloite · 2023-07-20T12:59:15Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data.Snippets/SpannerConnectionSnippets.cs

@@ -287,6 +287,7 @@ public async Task CommitTimestampAsync()
                    // Insert a second row
                    cmd.Parameters["Id"].Value = 11L;
                    cmd.Parameters["Name"].Value = "Demo 2";
+                    cmd.Transaction = transaction;


This sample did not work as intended. The command did not actually use the retriable transaction in this block, which meant that:

The command would use the transaction from the previous block.

But as the command that is being executed in this transaction is a Mutation, it did not really execute any command, as mutations are buffered in the client and included in the commit call.

The Commit RPC would use the transaction ID from the previous transaction block. That transaction had already been committed on Cloud Spanner, which would cause Cloud Spanner to just return the commit response from that transaction.

amanda-tarafa

I've left a few minor comments.

But, this is a breaking change, right now calling code can change the DisposeBehaviour of the transaction right before disposing, and that will have an effect on how the session is returned to the pool, etc.
With this change, users would have to change the DisposeBehaviour before commiting or rollbacking for the same effects.

I don't think this would be a common use case, but we have no way of knowing.
@jskeet what do you think.

My opinion is that this change is desirable (it's really not a bug, it's a bad API surface), but being breaking needs to wait for a new mayor version. We have other breaking changes to make around transaction and session disposal for the next mayor version, and this could be part of that.

amanda-tarafa · 2023-08-04T22:17:47Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+                return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
+            }
+            CheckNotHasCommittedOrRolledBack();
+            _hasCommittedOrRolledBack = true;


I think you should set this if the commit or rollback succeeded? If this returns say, a 500, users could retry the commit, righ? And same for the rollback implementation.

I'm not sure. There are two competing things here:

Marking it as committed/rolled back here will allow us to dispose it directly after the RPC invocation, regardless whether the invocation succeeded or not, but won't allow any manual retries of the Commit method invocation.

Waiting with marking it committed until the RPC has succeeded means that we can't dispose the transaction if the RPC fails. That would also include errors that happened because the application tried to insert data into a non-existing table.

The reason that I think it's better to aggressively mark it as committed and dispose it is:

Most transient errors, e.g. UNAVAILABLE, are retried transparently by Gax, and so won't bubble up here if the retry succeeded. And if it bubbles up, it is normally not something that should be retried.

Most errors that will bubble up here will not be transient, but should either cause the entire transaction to be retried (e.g. ABORTED) or just indicate that the transaction cannot succeed (e.g. ALREADY_EXISTS errors if the transaction tried to insert a duplicate row).

The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

But even if we were to accept the breaking change, as a user, I would still find it unexpected to have my transaction rolledback in the presence of transient errors, even if those were already retried some by the library.
I think it would be fine to rollback in the presence of non-transient errors like the ALREADY_EXIST example you mentioned earlier. Although even this would be a breaking change.

In general I agree with the aim to release the session as soon as possible. But I also think that we should be cautious and continue to support some reasonable use cases like users wanting to retry further some transient errors. We really haven't gotten any reports stemming from sessions not beeing released soon enough, so I'd rather we don't go to the other extreme.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

Yes, I agree that this is a breaking change.

...continue to support some reasonable use cases like users wanting to retry further some transient errors.

I would prefer to keep this as simple as possible, and accept that we don't support that. Currently, the client library itself does not determine whether an error is transient or not. Instead, we have the following layering:

If a gRPC invocation fails, Gax determines whether an error is transient and retryable. If it is, and the retry settings allow it to be retried (e.g. the deadline has not been exceeded), then Gax retries it. This is all transparent to the client library.

Any error that escapes 1, either because Gax did not consider it retryable or because the retry settings disallowed it to be retried further, is bubbled up to the client application.

The only error that the client library handles specifically in some cases are Aborted errors. These cause the entire transaction to be retried when a RetryableTransaction is used.

If we were to support that the user can retry transient errors after Gax has given up, then we need some way to determine what constitutes a transient error in the client library, and add that as an extra layer between the two mentioned above. Do we consider all error codes that have been registered as retryable for the Commit RPC retryable? E.g. should a DEADLINE_EXCEEDED error be considered transient in this case if it is registered as a retryable error code, and Gax has given up because the total timeout has been exceeded?

We really haven't gotten any reports stemming from sessions not beeing released soon enough

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

I still prefer not to rollback at all on failure, even on any failure, given that Spanner itself does not rollback failed transactions. But I do see your points. If we go with "we rollback on any failure approach" we need to document that very clearly, not just because it's currently a breaking change, but also because in my opinion, it'll be very unexpected behaviour for users.

Also, note that even if we don't rollback (and release) failed transactions, with this change, succesfull or explicitly rollbacked transactions (hopefully way more common than failed transactions) will still be released sooner. My point, it's likely that the already scarced issues on "sessions not being released soon enough" should be greatly improved even if we don't rollback failed transactions.

Also, should Spanner consider automatically rolling back unrecoverabled transactions and send a signal back to clients.

@jskeet for your thoughts on this

Also, should Spanner consider automatically rolling back unrecoverabled transactions and send a signal back to clients.

Spanner actually does that (and my example with an ALREADY_EXISTS error was a bad one in this case, as Spanner actually considers that an unrecoverable error). So simple errors, like for example syntax errors, allow you to continue with the transaction as long as you catch them in application code. Other errors, like a constraint violation for an insert statement, are treated by Spanner as uncrecoverable errors. Trying to commit the transaction after catching such an error in application code will still fail with the same error.

Spanner does however not include any information in the error that indicates whether it is an unrecoverable error or not. You will only know that if you still try to continue with the transaction.

Honestly, that makes me think that the appropiate thing would be for Spanner to signal back to the client which transactions have been rolled back. For those we can release the session back to the pool and not for the others. It's the backend who can make that decision safely I think.

Given that, regardless, this is breaking change (because of the dispose behaviour aspect), so we'd want to include it on the next major version that won't happen inmediately, do you think Spanner will consider sending that kind of signal back to clients?

I'd definitely like more information from the Spanner server.

One aspect to drill down on:

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

I don't remember the details, but if that was due to a customer bug (e.g. not disposing of SpannerTransaction or similar) then I'm less convinced it's needed - at least for that reason. Typically, papering over the symptoms of a customer bug just means some other symptom for the same bug comes up later.

That doesn't mean it's not worth doing - but I wouldn't want "it masks one impact of a customer bug" to be a justification.

amanda-tarafa · 2023-08-04T22:19:42Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+                        // Disposing the transaction will return the session to the pool.
+                        // Retriable transactions reuse the same session for a new transaction if the transaction is aborted
+                        // by Cloud Spanner, and will make sure the session is returned to the pool when done.
+                        if (!_isRetriable)


Dispose already checks for this. No need to repeat that here and in rollback.

amanda-tarafa · 2023-08-04T22:22:15Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+            // timestamp, and there's no need for us to repeat the RPC here when we already know the
+            // result. This also allows us to more aggressively return the session to the pool after
+            // committing/rolling back a transaction.
+            if (_hasCommittedOrRolledBack && _commitResponse != null)


Here maybe just check that commitResponse is not null, that's just possible if there was a commit.

amanda-tarafa · 2023-08-04T22:24:28Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+                        // by Cloud Spanner, and will make sure the session is returned to the pool when done.
+                        if (!_isRetriable)
+                        {
+                            Dispose();


And if you set the _hasCommitedOrRollbackedFlag as I'm suggesting, you need to dispose of the transaction on only on success.

Agree, but see my comment above. I don't think manually retrying the Commit-method is something that we need to support, especially as it would mean that we would not be able to dispose the transaction and return the session to the pool after a valid error message, like ALREADY_EXISTS if the user tried to insert a duplicate row using mutations (mutations are sent together with the Commit RPC).

amanda-tarafa

I've replied to your comments. But also highlighting this from my previous review:

But, this is a breaking change, right now calling code can change the DisposeBehaviour of the transaction right before disposing, and that will have an effect on how the session is returned to the pool, etc.
With this change, users would have to change the DisposeBehaviour before commiting or rollbacking for the same effects.

I don't think this would be a common use case, but we have no way of knowing.
@jskeet what do you think.

My opinion is that this change is desirable (it's really not a bug, it's a bad API surface), but being breaking needs to wait for a new mayor version. We have other breaking changes to make around transaction and session disposal for the next mayor version, and this could be part of that.

amanda-tarafa · 2023-09-05T21:47:59Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+                return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
+            }
+            CheckNotHasCommittedOrRolledBack();
+            _hasCommittedOrRolledBack = true;


The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

But even if we were to accept the breaking change, as a user, I would still find it unexpected to have my transaction rolledback in the presence of transient errors, even if those were already retried some by the library.
I think it would be fine to rollback in the presence of non-transient errors like the ALREADY_EXIST example you mentioned earlier. Although even this would be a breaking change.

In general I agree with the aim to release the session as soon as possible. But I also think that we should be cautious and continue to support some reasonable use cases like users wanting to retry further some transient errors. We really haven't gotten any reports stemming from sessions not beeing released soon enough, so I'd rather we don't go to the other extreme.

jskeet

I'm still in two minds about the change here. I've added comments as far as I'm confident doing so, but I don't have enough knowledge of the details to provide more guidance :(

jskeet · 2023-09-12T10:04:49Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data.IntegrationTests/TransactionTests.cs

+            await connection.OpenAsync();
+
+            using var tx = await connection.BeginTransactionAsync();
+            using var cmd = connection.CreateSelectCommand($"SELECT Int64Value FROM {_fixture.TableName} WHERE K=@k");


Is it relevant that we're not disposing of cmd until the end of the method? (I love the new simplified using statement, but I wonder whether for this and the above test, it would be worth using the "old style" approach to make it clear when Dispose is called.

jskeet · 2023-09-12T10:05:28Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data.IntegrationTests/TransactionTests.cs

@@ -144,7 +226,7 @@ public async Task AbortedThrownCorrectly()
            // connection 2 reads again -- abort should be thrown.

            // Note: deeply nested using statements to ensure that we dispose of everything even in the case of failure,
-            // but we manually dispose of both tx1 and connection1. 
+            // but we manually dispose of both tx1 and connection1.


I'll create a separate PR to separate all the whitespace changes from actual changes.

I've merged the change - if you rebase to HEAD now, the whitespace changes should go away.

jskeet · 2023-09-12T10:12:11Z

apis/Google.Cloud.Spanner.Data/Google.Cloud.Spanner.Data/SpannerTransaction.cs

+                return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
+            }
+            CheckNotHasCommittedOrRolledBack();
+            _hasCommittedOrRolledBack = true;


I'd definitely like more information from the Spanner server.

One aspect to drill down on:

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

I don't remember the details, but if that was due to a customer bug (e.g. not disposing of SpannerTransaction or similar) then I'm less convinced it's needed - at least for that reason. Typically, papering over the symptoms of a customer bug just means some other symptom for the same bug comes up later.

That doesn't mean it's not worth doing - but I wouldn't want "it masks one impact of a customer bug" to be a justification.

olavloite mentioned this pull request Jul 20, 2023

test: [wip] test transaction commit behavior #10467

Closed

fix: command did not use transaction

1c340e7

olavloite marked this pull request as ready for review July 20, 2023 12:56

olavloite requested a review from a team as a code owner July 20, 2023 12:56

olavloite commented Jul 20, 2023

View reviewed changes

amanda-tarafa self-assigned this Jul 21, 2023

amanda-tarafa reviewed Aug 4, 2023

View reviewed changes

chore: address review comments

e6b5da8

olavloite requested a review from amanda-tarafa September 5, 2023 06:49

amanda-tarafa reviewed Sep 5, 2023

View reviewed changes

jskeet reviewed Sep 12, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: return session to the pool after commit/rollback #10673

fix: return session to the pool after commit/rollback #10673

olavloite commented Jul 20, 2023

olavloite Jul 20, 2023

amanda-tarafa left a comment

amanda-tarafa Aug 4, 2023

olavloite Aug 29, 2023

amanda-tarafa Sep 5, 2023

olavloite Sep 7, 2023

amanda-tarafa Sep 7, 2023

olavloite Sep 11, 2023

amanda-tarafa Sep 11, 2023

jskeet Sep 12, 2023

amanda-tarafa Aug 4, 2023

olavloite Aug 29, 2023

amanda-tarafa Aug 4, 2023

olavloite Aug 29, 2023

amanda-tarafa Aug 4, 2023

olavloite Aug 29, 2023

amanda-tarafa left a comment

amanda-tarafa Sep 5, 2023

jskeet left a comment

jskeet Sep 12, 2023

jskeet Sep 12, 2023

jskeet Sep 12, 2023

jskeet Sep 12, 2023

fix: return session to the pool after commit/rollback #10673

Are you sure you want to change the base?

fix: return session to the pool after commit/rollback #10673

Conversation

olavloite commented Jul 20, 2023

Choose a reason for hiding this comment

amanda-tarafa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amanda-tarafa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jskeet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment