Vault does not recover Postgres Connection #12194

adrianliechti · 2021-07-28T08:51:30Z

Hi Vault Team,

We use Vault using an Azure Postgres DB (11) and Azure Vault as Unsealing. Vault itself runs on a Kubernetes Cluster as single container.
After some days (usually the same rhythm on our prod and staging platform), Vault doesn't work anymore.

2021-07-28T08:29:00.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:29:30.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:30:00.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:30:30.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"

Describe the bug

Vault seems to loose the connection to DB - and does not recover anymore itself.
A restart of Vault usually helps.

To Reproduce

We were able to have the same experience using this steps:

Run Vault as normal
Restart the Azure Postgres DB in their Portal
Check Logs...

Is this a known issue?
If there are some workarounds using connection strings tweaks - we are happy to implement and test them

YBDC-ZhangYubo · 2021-08-05T07:30:55Z

Updated re investigation 2021/8/11:
If Vault is using a locally Postgres DB, everything is fine.
If Vault is using an Azure Postgres DB, this issue occurs.

adrianliechti · 2021-08-11T11:02:42Z

found this gist and referenced azure post.

we might give that a try

https://gist.github.com/moraisaugusto/0a5870c8751649708e5a3c6e0f154113

YBDC-ZhangYubo · 2021-08-12T02:48:20Z

found this gist and referenced azure post.

we might give that a try

https://gist.github.com/moraisaugusto/0a5870c8751649708e5a3c6e0f154113

Checked up configure of our Azure Postgres DB, both statement_timeout and idle_in_transaction_session_timeout are zero.
And in our local Postgres DB those are zero too, cannot find valuable differance.

adrianliechti · 2021-08-23T09:31:21Z

it happened again. close to the same moment on two totally separated platforms.
this really could mean it happens after a specific time

would it be possible to enhance the readiness probe in the helm chart?
to something like a test write / test read? because currently the deployment thinks it's alright

or to include a config option to crash on loosing connection

hsimon-hashicorp · 2021-09-10T19:03:16Z

Thanks @adrianliechti - I'll check with the team to see if we can't get to the bottom of this.

hsimon-hashicorp · 2021-09-10T19:41:28Z

Hi @adrianliechti - first, this is a "community-supported" backend, meaning that developers in the community created and maintain it. Also, it would be good to see Vault logs and diagnostics from the time of the PostgresDB failure, including information about open connections to PostgresDB. Have you set max_idle_connections on your PostgresDB? If it's a non-zero number, you may be hitting a cap there. Hope this helps!

adrianliechti · 2021-09-10T19:51:39Z

@hsimon-hashicorp we really appreciate!

I will double-check the logs & recordings and ask my colleague to gather some more insights in the moment it happens.
As far as I know, it can be reproduced (or at least get the same issue) by simple stop/start the database instance in azure... So it should be easy to get this info on our dev environment too

elsewise we have to wait quite for some time (~ 2 weeks) each time we try something :)

fyi: for the moment we created a oldschool watchdog...

package main

import (
	"context"
	"errors"
	"log"
	"os"
	"path/filepath"
	"strings"
	"time"

	"github.com/hashicorp/vault/api"

	core "k8s.io/api/core/v1"
	meta "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
)

func main() {
	vaultNamespace := os.Getenv("VAULT_NAMESPACE")
	vaultInstance := os.Getenv("VAULT_INSTANCE")

	if vaultNamespace == "" || vaultInstance == "" {
		panic("invalid config")
	}

	var config api.Config

	if err := config.ReadEnvironment(); err != nil {
		panic(err)
	}

	client, err := api.NewClient(&config)

	if err != nil {
		panic(err)
	}

	kubectl, err := newClient()

	if err != nil {
		panic(err)
	}

	threshold := 0
	selector := appSelector("vault", vaultInstance)

	time.Sleep(20 * time.Second)

	for {
		time.Sleep(10 * time.Second)

		ready, err := podsReady(kubectl, vaultNamespace, selector)

		if err != nil {
			log.Println("failed to query pod status")
			println(err)

			continue
		}

		if !ready {
			log.Println("pods not ready yet, waiting another 10s")

			continue
		}

		data := map[string]interface{}{
			"data": map[string]interface{}{
				"value": time.Now().UTC().Format(time.RFC3339),
			},
		}

		_, err = client.Logical().Write("devcloud/data/ping", data)

		if err == nil {
			threshold = 0
			continue
		}

		threshold++

		if threshold <= 3 {
			log.Println(err)
			continue
		}

		log.Println("failed to write data, restarting vault")

		if err := kubectl.CoreV1().Pods(vaultNamespace).DeleteCollection(context.Background(), meta.DeleteOptions{}, meta.ListOptions{
			LabelSelector: selector,
		}); err != nil {
			log.Println("failed to delete vault pod")
			log.Println(err)

			continue
		}

		threshold = 0

		log.Println("vault pod deleted, waiting 3 minute")

		time.Sleep(3 * time.Minute)
	}
}

func newClient() (*kubernetes.Clientset, error) {
	if client, err := newClientFromInCluster(); err == nil {
		return client, nil
	}

	if client, err := newClientFromOutOfCluster(); err == nil {
		return client, nil
	}

	return nil, errors.New("unable to initialize kubectl")
}

func newClientFromInCluster() (*kubernetes.Clientset, error) {
	config, err := rest.InClusterConfig()

	if err != nil {
		return nil, err
	}

	clientset, err := kubernetes.NewForConfig(config)

	if err != nil {
		return nil, err
	}

	return clientset, nil
}

func newClientFromOutOfCluster() (*kubernetes.Clientset, error) {
	home, err := os.UserHomeDir()

	if err != nil {
		return nil, err
	}

	kubeconfig := filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)

	if err != nil {
		return nil, err
	}

	clientset, err := kubernetes.NewForConfig(config)

	if err != nil {
		return nil, err
	}

	return clientset, nil
}

func appSelector(name, instance string) string {
	selector := make([]string, 0)

	if name != "" {
		selector = append(selector, "app.kubernetes.io/name="+name)
	}

	if instance != "" {
		selector = append(selector, "app.kubernetes.io/instance="+instance)
	}

	return strings.Join(selector, ",")
}

func podsReady(kubectl *kubernetes.Clientset, namespace, selector string) (bool, error) {
	collection, err := kubectl.CoreV1().Pods(namespace).List(context.Background(), meta.ListOptions{
		LabelSelector: selector,
	})

	if err != nil {
		return false, err
	}

	for _, pod := range collection.Items {
		if pod.Status.Phase != core.PodRunning {
			return false, nil
		}

		for _, c := range pod.Status.Conditions {
			if c.Type == core.ContainersReady && c.Status != core.ConditionTrue {
				return false, nil
			}
		}

		for _, cs := range pod.Status.ContainerStatuses {
			if !cs.Ready {
				return false, nil
			}
		}
	}

	return true, nil
}

hsimon-hashicorp · 2021-09-10T20:13:42Z

Wow! Thanks for sharing that, awesome work for the meantime. I'll check back in periodically to see if you've been able to get that data - vault debug in your dev environment might provide some clues too? You can run it against a running instance, but it'll throw some warnings about not being able to bind to the port (because it's already running). Hopefully together we can figure out a cause and a solution!

hsimon-hashicorp · 2021-10-12T21:49:04Z

Hi @adrianliechti - I just wanted to check in and see if you'd been able to run a vault debug, and if we can get that information. Please let me know how I can be of further assistance. Thanks!

adrianliechti · 2021-10-13T09:23:35Z

Hi @hsimon-hashicorp

Thank you so much for coming back to this!
We had this monday another failure with still the same config

We plan to update to the latest Vault version this evening

==> Vault server configuration:

             Api Address: http://10.240.0.114:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.16.7
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: postgresql (HA disabled)
                 Version: Vault v1.8.3
             Version Sha: 73e85c3c21dfd1e835ded0053f08e3bd73a24ad6

==> Vault server started! Log data will stream in below:

2021-10-06T16:41:55.700Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2021-10-06T16:41:56.513Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-10-06T16:41:56.607Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2021-10-06T16:41:56.608Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2021-10-06T16:41:56.613Z [INFO]  core: post-unseal setup starting
2021-10-06T16:41:56.635Z [INFO]  core: loaded wrapping token key
2021-10-06T16:41:56.635Z [INFO]  core: successfully setup plugin catalog: plugin-directory=""
2021-10-06T16:41:56.649Z [INFO]  core: successfully mounted backend: type=system path=sys/
2021-10-06T16:41:56.649Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2021-10-06T16:41:56.656Z [INFO]  core: successfully mounted backend: type=kv path=devcloud/
2021-10-06T16:41:56.656Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2021-10-06T16:41:56.705Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2021-10-06T16:41:56.723Z [INFO]  rollback: starting rollback manager
2021-10-06T16:41:56.723Z [INFO]  core: restoring leases
2021-10-06T16:41:56.741Z [INFO]  identity: entities restored
2021-10-06T16:41:56.748Z [INFO]  identity: groups restored
2021-10-06T16:41:56.813Z [INFO]  core: usage gauge collection is disabled
2021-10-06T16:41:56.822Z [INFO]  core: post-unseal setup complete
2021-10-06T16:41:56.822Z [INFO]  core: vault is unsealed
2021-10-06T16:41:56.822Z [INFO]  core: unsealed with stored key
2021-10-06T16:41:57.004Z [INFO]  expiration: lease restore complete
2021-10-10T15:01:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:01:56.630Z [ERROR] core: error in barrier auto rotation: error="failed to persist keyring: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:01:56.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:02:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:02:56.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:03:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"

Unfortunately we had a super busy sprint and limited resources for a deep dive into the issue.

But here are some screenshots of the database metrics

active connections are not exhausting. the incident happend after the second drop from 6 -> 4; and the next short step from 6-> 7

the failed connection is 0

and the cpu very bored

for me it looks like the pg driver might fail to detect closed connections... or is not really removing it from the pool and keeps using it

calvn · 2021-10-13T16:35:19Z

Can you try with Vault 1.8.4? We recently upgraded our lib/pq on #12413 that aimed to address this.

adrianliechti · 2021-10-13T19:01:45Z

yes we will. i think the team awaited a new helm chart - but we will overwrite the image version directly and use it that way

thanks a ton for all your interest!

hsimon-hashicorp · 2021-12-10T21:32:40Z

Hi @adrianliechti - wanted to circle back on this and see if you're still having issues. Thanks!

matthiasfehr · 2021-12-15T09:03:37Z

Worked on this issue together with @adrianliechti and I can confirm that we didn't experienced any issues again since we installed version 1.8.4 some weeks ago. Thanks a lot for your superb support!

hsimon-hashicorp · 2021-12-15T18:21:04Z

@matthiasfehr Thanks so much for checking again and getting back to us! :)

hsimon-hashicorp added the storage/postgresql label Jul 28, 2021

hsimon-hashicorp added the storage/azure label Sep 10, 2021

hsimon-hashicorp added the waiting-for-response label Sep 10, 2021

hsimon-hashicorp closed this as completed Dec 15, 2021

hsimon-hashicorp removed the waiting-for-response label Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vault does not recover Postgres Connection #12194

Vault does not recover Postgres Connection #12194

adrianliechti commented Jul 28, 2021

YBDC-ZhangYubo commented Aug 5, 2021 •

edited

adrianliechti commented Aug 11, 2021

YBDC-ZhangYubo commented Aug 12, 2021

adrianliechti commented Aug 23, 2021

hsimon-hashicorp commented Sep 10, 2021

hsimon-hashicorp commented Sep 10, 2021

adrianliechti commented Sep 10, 2021

hsimon-hashicorp commented Sep 10, 2021

hsimon-hashicorp commented Oct 12, 2021

adrianliechti commented Oct 13, 2021

calvn commented Oct 13, 2021

adrianliechti commented Oct 13, 2021

hsimon-hashicorp commented Dec 10, 2021

matthiasfehr commented Dec 15, 2021

hsimon-hashicorp commented Dec 15, 2021

Vault does not recover Postgres Connection #12194

Vault does not recover Postgres Connection #12194

Comments

adrianliechti commented Jul 28, 2021

YBDC-ZhangYubo commented Aug 5, 2021 • edited

adrianliechti commented Aug 11, 2021

YBDC-ZhangYubo commented Aug 12, 2021

adrianliechti commented Aug 23, 2021

hsimon-hashicorp commented Sep 10, 2021

hsimon-hashicorp commented Sep 10, 2021

adrianliechti commented Sep 10, 2021

hsimon-hashicorp commented Sep 10, 2021

hsimon-hashicorp commented Oct 12, 2021

adrianliechti commented Oct 13, 2021

calvn commented Oct 13, 2021

adrianliechti commented Oct 13, 2021

hsimon-hashicorp commented Dec 10, 2021

matthiasfehr commented Dec 15, 2021

hsimon-hashicorp commented Dec 15, 2021

YBDC-ZhangYubo commented Aug 5, 2021 •

edited