Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: return session to the pool after commit/rollback #10673

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

olavloite
Copy link
Contributor

Transactions would hold on to the session it was using after having been committed or rolled back, and would wait before returning it to the pool until the transaction was disposed. This could cause a higher than expected use of sessions, and could in theory also cause the session pool to be exhausted if someone would execute a large number of transactions in a loop without disposing those transactions in that loop.

Transactions would hold on to the session it was using after having been
committed or rolled back, and would wait before returning it to the pool
until the transaction was disposed. This could cause a higher than expected
use of sessions, and could in theory also cause the session pool to be
exhausted if someone would execute a large number of transactions in a
loop without disposing those transactions in that loop.
@olavloite olavloite marked this pull request as ready for review July 20, 2023 12:56
@olavloite olavloite requested a review from a team as a code owner July 20, 2023 12:56
@@ -287,6 +287,7 @@ public async Task CommitTimestampAsync()
// Insert a second row
cmd.Parameters["Id"].Value = 11L;
cmd.Parameters["Name"].Value = "Demo 2";
cmd.Transaction = transaction;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sample did not work as intended. The command did not actually use the retriable transaction in this block, which meant that:

  1. The command would use the transaction from the previous block.
  2. But as the command that is being executed in this transaction is a Mutation, it did not really execute any command, as mutations are buffered in the client and included in the commit call.
  3. The Commit RPC would use the transaction ID from the previous transaction block. That transaction had already been committed on Cloud Spanner, which would cause Cloud Spanner to just return the commit response from that transaction.

@amanda-tarafa amanda-tarafa self-assigned this Jul 21, 2023
Copy link
Contributor

@amanda-tarafa amanda-tarafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few minor comments.

But, this is a breaking change, right now calling code can change the DisposeBehaviour of the transaction right before disposing, and that will have an effect on how the session is returned to the pool, etc.
With this change, users would have to change the DisposeBehaviour before commiting or rollbacking for the same effects.

I don't think this would be a common use case, but we have no way of knowing.
@jskeet what do you think.

My opinion is that this change is desirable (it's really not a bug, it's a bad API surface), but being breaking needs to wait for a new mayor version. We have other breaking changes to make around transaction and session disposal for the next mayor version, and this could be part of that.

return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
}
CheckNotHasCommittedOrRolledBack();
_hasCommittedOrRolledBack = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should set this if the commit or rollback succeeded? If this returns say, a 500, users could retry the commit, righ? And same for the rollback implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. There are two competing things here:

  1. Marking it as committed/rolled back here will allow us to dispose it directly after the RPC invocation, regardless whether the invocation succeeded or not, but won't allow any manual retries of the Commit method invocation.
  2. Waiting with marking it committed until the RPC has succeeded means that we can't dispose the transaction if the RPC fails. That would also include errors that happened because the application tried to insert data into a non-existing table.

The reason that I think it's better to aggressively mark it as committed and dispose it is:

  1. Most transient errors, e.g. UNAVAILABLE, are retried transparently by Gax, and so won't bubble up here if the retry succeeded. And if it bubbles up, it is normally not something that should be retried.
  2. Most errors that will bubble up here will not be transient, but should either cause the entire transaction to be retried (e.g. ABORTED) or just indicate that the transaction cannot succeed (e.g. ALREADY_EXISTS errors if the transaction tried to insert a duplicate row).

The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

But even if we were to accept the breaking change, as a user, I would still find it unexpected to have my transaction rolledback in the presence of transient errors, even if those were already retried some by the library.
I think it would be fine to rollback in the presence of non-transient errors like the ALREADY_EXIST example you mentioned earlier. Although even this would be a breaking change.

In general I agree with the aim to release the session as soon as possible. But I also think that we should be cautious and continue to support some reasonable use cases like users wanting to retry further some transient errors. We really haven't gotten any reports stemming from sessions not beeing released soon enough, so I'd rather we don't go to the other extreme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

Yes, I agree that this is a breaking change.

...continue to support some reasonable use cases like users wanting to retry further some transient errors.

I would prefer to keep this as simple as possible, and accept that we don't support that. Currently, the client library itself does not determine whether an error is transient or not. Instead, we have the following layering:

  1. If a gRPC invocation fails, Gax determines whether an error is transient and retryable. If it is, and the retry settings allow it to be retried (e.g. the deadline has not been exceeded), then Gax retries it. This is all transparent to the client library.
  2. Any error that escapes 1, either because Gax did not consider it retryable or because the retry settings disallowed it to be retried further, is bubbled up to the client application.

The only error that the client library handles specifically in some cases are Aborted errors. These cause the entire transaction to be retried when a RetryableTransaction is used.

If we were to support that the user can retry transient errors after Gax has given up, then we need some way to determine what constitutes a transient error in the client library, and add that as an extra layer between the two mentioned above. Do we consider all error codes that have been registered as retryable for the Commit RPC retryable? E.g. should a DEADLINE_EXCEEDED error be considered transient in this case if it is registered as a retryable error code, and Gax has given up because the total timeout has been exceeded?

We really haven't gotten any reports stemming from sessions not beeing released soon enough

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer not to rollback at all on failure, even on any failure, given that Spanner itself does not rollback failed transactions. But I do see your points. If we go with "we rollback on any failure approach" we need to document that very clearly, not just because it's currently a breaking change, but also because in my opinion, it'll be very unexpected behaviour for users.

Also, note that even if we don't rollback (and release) failed transactions, with this change, succesfull or explicitly rollbacked transactions (hopefully way more common than failed transactions) will still be released sooner. My point, it's likely that the already scarced issues on "sessions not being released soon enough" should be greatly improved even if we don't rollback failed transactions.

Also, should Spanner consider automatically rolling back unrecoverabled transactions and send a signal back to clients.

@jskeet for your thoughts on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should Spanner consider automatically rolling back unrecoverabled transactions and send a signal back to clients.

Spanner actually does that (and my example with an ALREADY_EXISTS error was a bad one in this case, as Spanner actually considers that an unrecoverable error). So simple errors, like for example syntax errors, allow you to continue with the transaction as long as you catch them in application code. Other errors, like a constraint violation for an insert statement, are treated by Spanner as uncrecoverable errors. Trying to commit the transaction after catching such an error in application code will still fail with the same error.

Spanner does however not include any information in the error that indicates whether it is an unrecoverable error or not. You will only know that if you still try to continue with the transaction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, that makes me think that the appropiate thing would be for Spanner to signal back to the client which transactions have been rolled back. For those we can release the session back to the pool and not for the others. It's the backend who can make that decision safely I think.

Given that, regardless, this is breaking change (because of the dispose behaviour aspect), so we'd want to include it on the next major version that won't happen inmediately, do you think Spanner will consider sending that kind of signal back to clients?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like more information from the Spanner server.

One aspect to drill down on:

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

I don't remember the details, but if that was due to a customer bug (e.g. not disposing of SpannerTransaction or similar) then I'm less convinced it's needed - at least for that reason. Typically, papering over the symptoms of a customer bug just means some other symptom for the same bug comes up later.

That doesn't mean it's not worth doing - but I wouldn't want "it masks one impact of a customer bug" to be a justification.

// Disposing the transaction will return the session to the pool.
// Retriable transactions reuse the same session for a new transaction if the transaction is aborted
// by Cloud Spanner, and will make sure the session is returned to the pool when done.
if (!_isRetriable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dispose already checks for this. No need to repeat that here and in rollback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// timestamp, and there's no need for us to repeat the RPC here when we already know the
// result. This also allows us to more aggressively return the session to the pool after
// committing/rolling back a transaction.
if (_hasCommittedOrRolledBack && _commitResponse != null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe just check that commitResponse is not null, that's just possible if there was a commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// by Cloud Spanner, and will make sure the session is returned to the pool when done.
if (!_isRetriable)
{
Dispose();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you set the _hasCommitedOrRollbackedFlag as I'm suggesting, you need to dispose of the transaction on only on success.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but see my comment above. I don't think manually retrying the Commit-method is something that we need to support, especially as it would mean that we would not be able to dispose the transaction and return the session to the pool after a valid error message, like ALREADY_EXISTS if the user tried to insert a duplicate row using mutations (mutations are sent together with the Commit RPC).

Copy link
Contributor

@amanda-tarafa amanda-tarafa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied to your comments. But also highlighting this from my previous review:

But, this is a breaking change, right now calling code can change the DisposeBehaviour of the transaction right before disposing, and that will have an effect on how the session is returned to the pool, etc.
With this change, users would have to change the DisposeBehaviour before commiting or rollbacking for the same effects.

I don't think this would be a common use case, but we have no way of knowing.
@jskeet what do you think.

My opinion is that this change is desirable (it's really not a bug, it's a bad API surface), but being breaking needs to wait for a new mayor version. We have other breaking changes to make around transaction and session disposal for the next mayor version, and this could be part of that.

return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
}
CheckNotHasCommittedOrRolledBack();
_hasCommittedOrRolledBack = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that we are blocking with this would be if someone has added manual retries of actual transient errors, instead of relying on the standard retry mechanism built into the library.

This and users who continue to retry transient errors past the standard retry mechanism. Both would constitute breaking changes, so if we were to this we'd need to include it in a new major version. Note that this is not the only breaking change this PR introduces, see my notes on the previous review.

But even if we were to accept the breaking change, as a user, I would still find it unexpected to have my transaction rolledback in the presence of transient errors, even if those were already retried some by the library.
I think it would be fine to rollback in the presence of non-transient errors like the ALREADY_EXIST example you mentioned earlier. Although even this would be a breaking change.

In general I agree with the aim to release the session as soon as possible. But I also think that we should be cautious and continue to support some reasonable use cases like users wanting to retry further some transient errors. We really haven't gotten any reports stemming from sessions not beeing released soon enough, so I'd rather we don't go to the other extreme.

Copy link
Collaborator

@jskeet jskeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in two minds about the change here. I've added comments as far as I'm confident doing so, but I don't have enough knowledge of the details to provide more guidance :(

await connection.OpenAsync();

using var tx = await connection.BeginTransactionAsync();
using var cmd = connection.CreateSelectCommand($"SELECT Int64Value FROM {_fixture.TableName} WHERE K=@k");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it relevant that we're not disposing of cmd until the end of the method? (I love the new simplified using statement, but I wonder whether for this and the above test, it would be worth using the "old style" approach to make it clear when Dispose is called.

@@ -144,7 +226,7 @@ public async Task AbortedThrownCorrectly()
// connection 2 reads again -- abort should be thrown.

// Note: deeply nested using statements to ensure that we dispose of everything even in the case of failure,
// but we manually dispose of both tx1 and connection1.
// but we manually dispose of both tx1 and connection1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a separate PR to separate all the whitespace changes from actual changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged the change - if you rebase to HEAD now, the whitespace changes should go away.

return Task.FromResult(_commitResponse.CommitTimestamp.ToDateTime());
}
CheckNotHasCommittedOrRolledBack();
_hasCommittedOrRolledBack = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like more information from the Spanner server.

One aspect to drill down on:

There's suspicion that this was a contributing factor in a case where the application got itself completely stuck. (Not the error handling, but the late returnal of sessions to the pool in general)

I don't remember the details, but if that was due to a customer bug (e.g. not disposing of SpannerTransaction or similar) then I'm less convinced it's needed - at least for that reason. Typically, papering over the symptoms of a customer bug just means some other symptom for the same bug comes up later.

That doesn't mean it's not worth doing - but I wouldn't want "it masks one impact of a customer bug" to be a justification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants