Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault does not recover Postgres Connection #12194

Closed
adrianliechti opened this issue Jul 28, 2021 · 15 comments
Closed

Vault does not recover Postgres Connection #12194

adrianliechti opened this issue Jul 28, 2021 · 15 comments

Comments

@adrianliechti
Copy link

Hi Vault Team,

We use Vault using an Azure Postgres DB (11) and Azure Vault as Unsealing. Vault itself runs on a Kubernetes Cluster as single container.
After some days (usually the same rhythm on our prod and staging platform), Vault doesn't work anymore.

2021-07-28T08:29:00.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:29:30.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:30:00.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"
2021-07-28T08:30:30.866Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.2.42:58132->xxxxxxx:5432: write: connection reset by peer"

Describe the bug

Vault seems to loose the connection to DB - and does not recover anymore itself.
A restart of Vault usually helps.

To Reproduce

We were able to have the same experience using this steps:

  • Run Vault as normal
  • Restart the Azure Postgres DB in their Portal
  • Check Logs...

Is this a known issue?
If there are some workarounds using connection strings tweaks - we are happy to implement and test them

@YBDC-ZhangYubo
Copy link

YBDC-ZhangYubo commented Aug 5, 2021

Updated re investigation 2021/8/11:
If Vault is using a locally Postgres DB, everything is fine.
If Vault is using an Azure Postgres DB, this issue occurs.

@adrianliechti
Copy link
Author

found this gist and referenced azure post.

we might give that a try

https://gist.github.com/moraisaugusto/0a5870c8751649708e5a3c6e0f154113

@YBDC-ZhangYubo
Copy link

found this gist and referenced azure post.

we might give that a try

https://gist.github.com/moraisaugusto/0a5870c8751649708e5a3c6e0f154113

Checked up configure of our Azure Postgres DB, both statement_timeout and idle_in_transaction_session_timeout are zero.
And in our local Postgres DB those are zero too, cannot find valuable differance.

@adrianliechti
Copy link
Author

it happened again. close to the same moment on two totally separated platforms.
this really could mean it happens after a specific time

would it be possible to enhance the readiness probe in the helm chart?
to something like a test write / test read? because currently the deployment thinks it's alright

or to include a config option to crash on loosing connection

@hsimon-hashicorp
Copy link
Contributor

Thanks @adrianliechti - I'll check with the team to see if we can't get to the bottom of this.

@hsimon-hashicorp
Copy link
Contributor

Hi @adrianliechti - first, this is a "community-supported" backend, meaning that developers in the community created and maintain it. Also, it would be good to see Vault logs and diagnostics from the time of the PostgresDB failure, including information about open connections to PostgresDB. Have you set max_idle_connections on your PostgresDB? If it's a non-zero number, you may be hitting a cap there. Hope this helps!

@adrianliechti
Copy link
Author

@hsimon-hashicorp we really appreciate!

I will double-check the logs & recordings and ask my colleague to gather some more insights in the moment it happens.
As far as I know, it can be reproduced (or at least get the same issue) by simple stop/start the database instance in azure... So it should be easy to get this info on our dev environment too

elsewise we have to wait quite for some time (~ 2 weeks) each time we try something :)

fyi: for the moment we created a oldschool watchdog...

package main

import (
	"context"
	"errors"
	"log"
	"os"
	"path/filepath"
	"strings"
	"time"

	"github.com/hashicorp/vault/api"

	core "k8s.io/api/core/v1"
	meta "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/rest"
	"k8s.io/client-go/tools/clientcmd"
)

func main() {
	vaultNamespace := os.Getenv("VAULT_NAMESPACE")
	vaultInstance := os.Getenv("VAULT_INSTANCE")

	if vaultNamespace == "" || vaultInstance == "" {
		panic("invalid config")
	}

	var config api.Config

	if err := config.ReadEnvironment(); err != nil {
		panic(err)
	}

	client, err := api.NewClient(&config)

	if err != nil {
		panic(err)
	}

	kubectl, err := newClient()

	if err != nil {
		panic(err)
	}

	threshold := 0
	selector := appSelector("vault", vaultInstance)

	time.Sleep(20 * time.Second)

	for {
		time.Sleep(10 * time.Second)

		ready, err := podsReady(kubectl, vaultNamespace, selector)

		if err != nil {
			log.Println("failed to query pod status")
			println(err)

			continue
		}

		if !ready {
			log.Println("pods not ready yet, waiting another 10s")

			continue
		}

		data := map[string]interface{}{
			"data": map[string]interface{}{
				"value": time.Now().UTC().Format(time.RFC3339),
			},
		}

		_, err = client.Logical().Write("devcloud/data/ping", data)

		if err == nil {
			threshold = 0
			continue
		}

		threshold++

		if threshold <= 3 {
			log.Println(err)
			continue
		}

		log.Println("failed to write data, restarting vault")

		if err := kubectl.CoreV1().Pods(vaultNamespace).DeleteCollection(context.Background(), meta.DeleteOptions{}, meta.ListOptions{
			LabelSelector: selector,
		}); err != nil {
			log.Println("failed to delete vault pod")
			log.Println(err)

			continue
		}

		threshold = 0

		log.Println("vault pod deleted, waiting 3 minute")

		time.Sleep(3 * time.Minute)
	}
}

func newClient() (*kubernetes.Clientset, error) {
	if client, err := newClientFromInCluster(); err == nil {
		return client, nil
	}

	if client, err := newClientFromOutOfCluster(); err == nil {
		return client, nil
	}

	return nil, errors.New("unable to initialize kubectl")
}

func newClientFromInCluster() (*kubernetes.Clientset, error) {
	config, err := rest.InClusterConfig()

	if err != nil {
		return nil, err
	}

	clientset, err := kubernetes.NewForConfig(config)

	if err != nil {
		return nil, err
	}

	return clientset, nil
}

func newClientFromOutOfCluster() (*kubernetes.Clientset, error) {
	home, err := os.UserHomeDir()

	if err != nil {
		return nil, err
	}

	kubeconfig := filepath.Join(home, ".kube", "config")
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)

	if err != nil {
		return nil, err
	}

	clientset, err := kubernetes.NewForConfig(config)

	if err != nil {
		return nil, err
	}

	return clientset, nil
}

func appSelector(name, instance string) string {
	selector := make([]string, 0)

	if name != "" {
		selector = append(selector, "app.kubernetes.io/name="+name)
	}

	if instance != "" {
		selector = append(selector, "app.kubernetes.io/instance="+instance)
	}

	return strings.Join(selector, ",")
}

func podsReady(kubectl *kubernetes.Clientset, namespace, selector string) (bool, error) {
	collection, err := kubectl.CoreV1().Pods(namespace).List(context.Background(), meta.ListOptions{
		LabelSelector: selector,
	})

	if err != nil {
		return false, err
	}

	for _, pod := range collection.Items {
		if pod.Status.Phase != core.PodRunning {
			return false, nil
		}

		for _, c := range pod.Status.Conditions {
			if c.Type == core.ContainersReady && c.Status != core.ConditionTrue {
				return false, nil
			}
		}

		for _, cs := range pod.Status.ContainerStatuses {
			if !cs.Ready {
				return false, nil
			}
		}
	}

	return true, nil
}

@hsimon-hashicorp
Copy link
Contributor

Wow! Thanks for sharing that, awesome work for the meantime. I'll check back in periodically to see if you've been able to get that data - vault debug in your dev environment might provide some clues too? You can run it against a running instance, but it'll throw some warnings about not being able to bind to the port (because it's already running). Hopefully together we can figure out a cause and a solution!

@hsimon-hashicorp
Copy link
Contributor

Hi @adrianliechti - I just wanted to check in and see if you'd been able to run a vault debug, and if we can get that information. Please let me know how I can be of further assistance. Thanks!

@adrianliechti
Copy link
Author

Hi @hsimon-hashicorp

Thank you so much for coming back to this!
We had this monday another failure with still the same config

We plan to update to the latest Vault version this evening

==> Vault server configuration:

             Api Address: http://10.240.0.114:8200
                     Cgo: disabled
         Cluster Address: https://vault-0.vault-internal:8201
              Go Version: go1.16.7
              Listener 1: tcp (addr: "[::]:8200", cluster address: "[::]:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: postgresql (HA disabled)
                 Version: Vault v1.8.3
             Version Sha: 73e85c3c21dfd1e835ded0053f08e3bd73a24ad6

==> Vault server started! Log data will stream in below:

2021-10-06T16:41:55.700Z [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""
2021-10-06T16:41:56.513Z [INFO]  core: stored unseal keys supported, attempting fetch
2021-10-06T16:41:56.607Z [INFO]  core.cluster-listener.tcp: starting listener: listener_address=[::]:8201
2021-10-06T16:41:56.608Z [INFO]  core.cluster-listener: serving cluster requests: cluster_listen_address=[::]:8201
2021-10-06T16:41:56.613Z [INFO]  core: post-unseal setup starting
2021-10-06T16:41:56.635Z [INFO]  core: loaded wrapping token key
2021-10-06T16:41:56.635Z [INFO]  core: successfully setup plugin catalog: plugin-directory=""
2021-10-06T16:41:56.649Z [INFO]  core: successfully mounted backend: type=system path=sys/
2021-10-06T16:41:56.649Z [INFO]  core: successfully mounted backend: type=identity path=identity/
2021-10-06T16:41:56.656Z [INFO]  core: successfully mounted backend: type=kv path=devcloud/
2021-10-06T16:41:56.656Z [INFO]  core: successfully mounted backend: type=cubbyhole path=cubbyhole/
2021-10-06T16:41:56.705Z [INFO]  core: successfully enabled credential backend: type=token path=token/
2021-10-06T16:41:56.723Z [INFO]  rollback: starting rollback manager
2021-10-06T16:41:56.723Z [INFO]  core: restoring leases
2021-10-06T16:41:56.741Z [INFO]  identity: entities restored
2021-10-06T16:41:56.748Z [INFO]  identity: groups restored
2021-10-06T16:41:56.813Z [INFO]  core: usage gauge collection is disabled
2021-10-06T16:41:56.822Z [INFO]  core: post-unseal setup complete
2021-10-06T16:41:56.822Z [INFO]  core: vault is unsealed
2021-10-06T16:41:56.822Z [INFO]  core: unsealed with stored key
2021-10-06T16:41:57.004Z [INFO]  expiration: lease restore complete
2021-10-10T15:01:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:01:56.630Z [ERROR] core: error in barrier auto rotation: error="failed to persist keyring: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:01:56.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:02:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:02:56.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56270->104.40.169.187:5432: write: connection reset by peer"
2021-10-10T15:03:26.823Z [ERROR] core: writing request counters to barrier: err="failed to save request counters: write tcp 10.240.0.114:56260->104.40.169.187:5432: write: connection reset by peer"

Unfortunately we had a super busy sprint and limited resources for a deep dive into the issue.

But here are some screenshots of the database metrics

active connections are not exhausting. the incident happend after the second drop from 6 -> 4; and the next short step from 6-> 7

Screenshot 2021-10-13 at 11 08 23

the failed connection is 0

Screenshot 2021-10-13 at 11 08 57

and the cpu very bored

Screenshot 2021-10-13 at 11 09 32

for me it looks like the pg driver might fail to detect closed connections... or is not really removing it from the pool and keeps using it

@calvn
Copy link
Member

calvn commented Oct 13, 2021

Can you try with Vault 1.8.4? We recently upgraded our lib/pq on #12413 that aimed to address this.

@adrianliechti
Copy link
Author

yes we will. i think the team awaited a new helm chart - but we will overwrite the image version directly and use it that way

thanks a ton for all your interest!

@hsimon-hashicorp
Copy link
Contributor

Hi @adrianliechti - wanted to circle back on this and see if you're still having issues. Thanks!

@matthiasfehr
Copy link

Worked on this issue together with @adrianliechti and I can confirm that we didn't experienced any issues again since we installed version 1.8.4 some weeks ago. Thanks a lot for your superb support!

@hsimon-hashicorp
Copy link
Contributor

@matthiasfehr Thanks so much for checking again and getting back to us! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants