Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres Storage Backend Connection Failure #12415

Closed
mdgreenfield opened this issue Aug 24, 2021 · 2 comments · Fixed by #12413
Closed

Postgres Storage Backend Connection Failure #12415

mdgreenfield opened this issue Aug 24, 2021 · 2 comments · Fixed by #12413
Labels

Comments

@mdgreenfield
Copy link
Contributor

mdgreenfield commented Aug 24, 2021

Describe the bug

We've encountered an issue when using TLS enabled postgres connections using Azure Database for PostgreSQL w/ golang 1.15 or newer and lib/pq versions prior to v1.9.0. When this issue occurs, connections will fail, but not be purged from the lib/pq pool and will continue to be used by the Vault unsuccessfully until Vault is restarted.

1 error occurred:
	* failed to persist lease entry: write tcp <vaultLeaderNode>:58492-><postgresNode>:5432: write: connection reset by peer

When this error occures we're able to SSH into the vaultLeaderNode and successfully establish connections to the postgresNode:5432 showing that new connections can be established without issue.

$ nc -zvw3 <postgresNode> 5432
Connection to <postgresNode> 5432 port [tcp/postgresql] succeeded!

Expected behavior

The connection pool should recover from failed connections.

Environment:

  • Vault Server Version (retrieve with vault status): 1.7.2

Vault server configuration file(s):

backend "postgresql" {

  connection_url = "host=<%= @postgres_host %> user=<%= @postgres_user %> password=<%= @postgres_password %> dbname=<%= @postgres_dbname %> sslmode=<%= @sslmode -%>"
}

Additional context
It may be prudent to also allow operators to configure the Postgres SetConnMaxLifetime similar to how SetMaxOpenConns and SetMaxIdleConns can be configured.

@doctork9999
Copy link

doctork9999 commented Sep 24, 2021

I ran into the same issue here quite strange that it happens everytime when the Azure Postgres Database is restarted, the broken connections are lingering around and not evicted. I ended up testing the same scenario locally as well as other cloud providers, i.e. AWS and don't see this symptom. The PR seems to explain the part that the new TLS error was introduced and the additional error handling in the lib/pq but why the symptom did not happen everywhere else?

@doctork9999
Copy link

doctork9999 commented Sep 24, 2021

Running some tests and now I can see why it only occurs in Azure, it looks like when the connections are closed in Azure, they are closed in quite an odd way, there was no FIN message to notify client to close the connections.

Screenshot from 2021-09-24 19-25-50

Screenshot from 2021-09-24 19-26-22

I see that this commit still serves its purpose to introduce the permamentError type to the error handling in the tls module. Dig deeper and find that when the connections are closed normally with the FIN message notifying back to the clients over the network sockets, they appear to be closed off normally and when tls.conn attempts to write data to those connections that are already closed, the error it gets is not of the type net.Error, instead, it is io.EOF, so it falls into the else condition of function setErrorLocked and returns back to the lib/pq with error type io.EOF instead of the wrapper permamentError.

golang/go@614a713

In the scenario where FIN message is received and the read function returns with the error type of io.EOF, the lib/pq in the latest version of Vault, which is still at 1.8.0, is able to handle the error OK and the closed connection is marked as bad for eviction.

https://github.com/lib/pq/blob/v1.8.0/error.go#L498-L503

It receives io.EOF on read function because the write function on the closed connection does not fail, nor returning any error and the implementation of the simpleExec and simpleQuery continues to read data from the closed connection which eventually receives io.EOF that is handled by the errRecover.

https://github.com/lib/pq/blob/v1.8.0/conn.go#L609-L637
https://github.com/lib/pq/blob/v1.8.0/conn.go#L639-L698

It is also an interesting note that when the connections are closed off with FIN, the send function in this PR will never fail, as per above details, the extra error handling logic is probably not required (it is still good to have for different scenario).

https://github.com/lib/pq/pull/1013/files

It is odd when I see Azure Postgres keeps closing the connections this way, hope that someone will let them know where the underlying cause of this odd symptom is from exactly and get it fixed but the update to the newer version of lib/pq should be able to cover this flaw from the lower level as well so hope it comes out soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants