Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

obourdon · 2022-10-29T09:40:02Z

Describe the bug
A fully functional HA Vault cluster whose backend is DynamoDB can be updated from 1.9.4 up to 1.10.7 without any problem
Upgrading one more time to 1.11.0 just make the cluster non functional

To Reproduce
Steps to reproduce the behavior:

Run curl http://127.0.0.1:8200/v1/sys/leader

{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"","leader_cluster_address":"","performance_standby":false,"performance_standby_last_remote_wal":0}

Run curl -sk --header "X-Vault-Token: XXXX" https://vault.internal.mycluster.io/v1/sys/ha-status | jq -rS .

{
  "errors": [
    "local node not active but active cluster node not found"
  ]
}

Expected behavior
With version 1.10.7 and below we get:

{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"https://vault.internal.mycluster.io","leader_cluster_address":"https://vault.internal.mycluster.io:444","performance_standby":false,"performance_standby_last_remote_wal":0}

and

{
  "auth": null,
  "data": {
    "nodes": [
      {
        "active_node": true,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-14-90",
        "last_echo": null
      },
      {
        "active_node": false,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-32-220",
        "last_echo": "2022-10-29T09:30:43.564133091Z"
      }
    ]
  },
  "lease_duration": 0,
  "lease_id": "",
  "nodes": [
    {
      "active_node": true,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-14-90",
      "last_echo": null
    },
    {
      "active_node": false,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-32-220",
      "last_echo": "2022-10-29T09:30:43.564133091Z"
    }
  ],
  "renewable": false,
  "request_id": "5de2e940-d6de-9b06-a6da-2293c5aa0e79",
  "warnings": null,
  "wrap_info": null
}

Environment:

Vault Server Version (retrieve with vault status): 1.11.0
Vault CLI Version (retrieve with vault version): 1.11.0
Server Operating System/Architecture: Amazon Linux or any Linux based system

Vault server configuration file(s):

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = true
}

# Mandatory for enabling HA enabled DynamoDB storage backend
api_addr = "https://vault.internal.mycluster.io"

storage "dynamodb" {
  ha_enabled = "true"
  table      = "mycluster-vault-data"
  region     = "eu-west-1"
}

seal "awskms" {
  kms_key_id = "XXXXXXXXXXXX"
}

ui = true

Additional context
The functional cluster was originally running Vault 1.9.4 and I tried to upgrade it directly to 1.12.0 which failed with the behaviour described above
I therefore decided to rollback to 1.9.4 and to upgrade one version at a time until it breaks.
The following versions were tested successfully:

1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10
1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7

1.11.0 and upwards just breaks

It is to be noted also that the vault cluster as it is residing within an AWS infrastructure is behind a Network LoadBalancer properly configured as can be seen in the attached files

Putting the Vault processes in debug/trace mode did not help finding any valuable information which could solve the issue

I could not see anything either in the changelog of 1.11.0 which could help

The text was updated successfully, but these errors were encountered:

obourdon · 2022-10-29T09:41:08Z

obourdon · 2022-10-31T13:48:59Z

Doing a git diff tag-v1.10.7..tag-v1.11.0 -- vault/ha.go show the following code which seems to be the source of the issue

with the corresponding commit 106f548a41c5aa3a7310e1b53f2196f989913eff

Commenting out the return line just makes the cluster functional again

@ncabatoff any insights on this please ?

obourdon · 2022-10-31T14:00:57Z

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked because behind a load balancer like in AWS the addresses checked are "statically" defined in Route53, only the load balancer distribute the request evenly across all nodes :-(

This explains why the test is always true in AWS behind a LB and therefore why the Vault HA cluster is not functional any more

obourdon · 2022-10-31T14:06:21Z

$ dig +short vault.internal.mycluster.io
10.0.15.206
10.0.36.214
10.0.19.203

and therefore

RedirectAddr:https://vault.internal.mycluster.io ClusterAddr:https://vault.internal.mycluster.io:444

in the received JSON advertisement structure and the same "addresses" are in the Core structure

obourdon · 2022-10-31T14:34:35Z

As a first suggestion of what could be a more complete test which might solve the issue is to do a
url.ParseRequestURI on the string. If this fails, then comparison can go on as is else
net.LookupIP(u.host) can be checked for the number of records it contains ...

My very basic 2 cents

obourdon · 2022-10-31T17:02:01Z

@ccapurso many thanks for categorising this

obourdon · 2022-11-01T17:54:30Z

AFAICT I do not think that having DynamoDB as the vault backend has anything to do with the issue. I am pretty much sure I can reproduce this with Raft backend. Willl try to demonstrate this if time permits

ArchiFleKs · 2022-11-03T13:43:22Z

Hi @obourdon we saw the same issue today on our cluster running 1.11.3 with Dyanamodb backend and AWS behind an NLB.

Do you know how to reproduce this bug ? We have been running 1.11.X for the past month and only noticed it today.

obourdon · 2022-11-03T16:40:16Z

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command
curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response
Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

ArchiFleKs · 2022-11-03T22:08:40Z

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

Thanks for your inputs. Currently in the process of downgrading, what about 1.10.8 released 2 days ago ?

obourdon · 2022-11-04T10:58:28Z

@ArchiFleKs seems like it should be working perfectly like 1.10.7 (at least it does not include the failing 1.11.0 code)

obourdon · 2022-11-16T15:13:14Z

I do not see any activity on this huge regression issue. Can someone please take care of this ?

obourdon · 2022-11-28T09:42:51Z

Still no feedback on this ????

ccapurso · 2022-12-06T20:35:39Z

Hi, @obourdon. Thank you for your patience.

The DynamoDB storage backend is community-supported and thus in-depth knowledge is a bit limited on the Vault team. You mentioned potentially attempting to reproduce this issue using Integrated Storage (raft). Were you able to accomplish that?

ncabatoff · 2022-12-12T14:28:53Z

@obourdon I think the reason this change is impacting you is because every node has the same cluster_address, https://vault.internal.mycluster.io:444. Do you know what that might be? I recognize that cluster_addr isn't needed for DynamoDB-based clusters, but I believe it should default to a unique value for each node if you don't configure it, specifically, listenerIP:apiPort+1.

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked

We know we're not the active node because at the top of the function we return if !c.standby. An OSS Vault cluster node is either active or standby, so if it's not standby it must be active.

ArchiFleKs · 2022-12-16T09:55:42Z

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

ncabatoff · 2022-12-16T15:07:40Z

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

obourdon · 2022-12-19T17:39:13Z

I must admit that I do not get some of the points described above

AFAIU the only way to describe the cluster address when behind a load balancer using AWS ASG is to specify the LB address as you do not know in advance what would be your instances IPs and if one instance is failing and a new one respawned by the ASG you should not have to change your Vault cluster configuration files

I indeed do not think that the issue is related to the fact that the backend is based on DynamoDB by looking at the location of the failing code and I still do not understand why this change was done in the first place and what it is susposed to fix

@ArchiFleKs

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now
do you mean that you have the same issue even though you have a different cluster_addr in each vault cluster node config file ? Your sentence is quite confusing because of the cannot reproduce right now at the end of it

ArchiFleKs · 2022-12-19T17:43:34Z

@obourdon Yes I encountered the same issue at the same time you reported this bug, but since then I kept Vault > 1.10 running in our dev environment and I was not able to reproduce the issue nor did I encounter it a second time.

And yes we are using a different cluster_addr which is auto computed when a new instance boot up on the ASG, so our configuration looks like this

ArchiFleKs · 2022-12-19T17:45:05Z

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

@ncabatoff you're right that's what lead me to believe it might not have been the same issue despite using the same backend (dynamoDB). And it might have been just a transitive issue. Because I was not able to reproduce it in the end.

obourdon · 2022-12-20T09:38:02Z

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

obourdon · 2023-01-02T20:33:51Z

still no activity on this one

ncabatoff · 2023-01-03T13:49:49Z

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

We wound up doing it in #9109.

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

It was to prevent attempting to connect to ourself, which was causing problems in some tests.

obourdon · 2023-01-19T10:42:38Z

@ncabatoff could you be a bit more specific and point us to the failing tests please and how to potentially run them locally on our development machines
I am really wondering if this fix is necessary as it induces a huge regression and may be the tests mentioned are the ones which should be corrected (it sometimes happens on my side that the tests I wrote are bad/badly designed in the first place)

obourdon · 2023-03-06T09:25:18Z

Any update on this ?

ncabatoff · 2023-03-24T15:03:54Z

I missed the release window on this one, please note that the fix won't be in the next releases, but rather the ones next month.

ccapurso added bug Used to indicate a potential bug storage/dynamodb core/ha specific to high-availability labels Oct 31, 2022

obourdon changed the title ~~Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster~~ Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster Nov 16, 2022

ncabatoff added the waiting-for-response label Dec 14, 2022

ncabatoff removed the waiting-for-response label Jan 3, 2023

ncabatoff mentioned this issue Mar 23, 2023

Address regression introduced by #15493 for non-raft storage backends. #19721

Merged

ncabatoff closed this as completed in #19721 Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

obourdon commented Oct 29, 2022

obourdon commented Oct 29, 2022

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022

obourdon commented Nov 1, 2022 •

edited

ArchiFleKs commented Nov 3, 2022

obourdon commented Nov 3, 2022

ArchiFleKs commented Nov 3, 2022

obourdon commented Nov 4, 2022

obourdon commented Nov 16, 2022 •

edited

obourdon commented Nov 28, 2022

ccapurso commented Dec 6, 2022

ncabatoff commented Dec 12, 2022

ArchiFleKs commented Dec 16, 2022

ncabatoff commented Dec 16, 2022

obourdon commented Dec 19, 2022

ArchiFleKs commented Dec 19, 2022

ArchiFleKs commented Dec 19, 2022

obourdon commented Dec 20, 2022

obourdon commented Jan 2, 2023

ncabatoff commented Jan 3, 2023

obourdon commented Jan 19, 2023 •

edited

obourdon commented Mar 6, 2023

ncabatoff commented Mar 24, 2023

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

Comments

obourdon commented Oct 29, 2022

obourdon commented Oct 29, 2022

obourdon commented Oct 31, 2022 • edited

obourdon commented Oct 31, 2022 • edited

obourdon commented Oct 31, 2022 • edited

obourdon commented Oct 31, 2022 • edited

obourdon commented Oct 31, 2022

obourdon commented Nov 1, 2022 • edited

ArchiFleKs commented Nov 3, 2022

obourdon commented Nov 3, 2022

ArchiFleKs commented Nov 3, 2022

obourdon commented Nov 4, 2022

obourdon commented Nov 16, 2022 • edited

obourdon commented Nov 28, 2022

ccapurso commented Dec 6, 2022

ncabatoff commented Dec 12, 2022

ArchiFleKs commented Dec 16, 2022

ncabatoff commented Dec 16, 2022

obourdon commented Dec 19, 2022

ArchiFleKs commented Dec 19, 2022

ArchiFleKs commented Dec 19, 2022

obourdon commented Dec 20, 2022

obourdon commented Jan 2, 2023

ncabatoff commented Jan 3, 2023

obourdon commented Jan 19, 2023 • edited

obourdon commented Mar 6, 2023

ncabatoff commented Mar 24, 2023

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Oct 31, 2022 •

edited

obourdon commented Nov 1, 2022 •

edited

obourdon commented Nov 16, 2022 •

edited

obourdon commented Jan 19, 2023 •

edited