Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster #17737

Closed
obourdon opened this issue Oct 29, 2022 · 26 comments · Fixed by #19721
Labels
bug Used to indicate a potential bug core/ha specific to high-availability storage/dynamodb

Comments

@obourdon
Copy link

Describe the bug
A fully functional HA Vault cluster whose backend is DynamoDB can be updated from 1.9.4 up to 1.10.7 without any problem
Upgrading one more time to 1.11.0 just make the cluster non functional

To Reproduce
Steps to reproduce the behavior:

  1. Run curl http://127.0.0.1:8200/v1/sys/leader
{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"","leader_cluster_address":"","performance_standby":false,"performance_standby_last_remote_wal":0}
  1. Run curl -sk --header "X-Vault-Token: XXXX" https://vault.internal.mycluster.io/v1/sys/ha-status | jq -rS .
{
  "errors": [
    "local node not active but active cluster node not found"
  ]
}

Expected behavior
With version 1.10.7 and below we get:

{"ha_enabled":true,"is_self":false,"active_time":"0001-01-01T00:00:00Z","leader_address":"https://vault.internal.mycluster.io","leader_cluster_address":"https://vault.internal.mycluster.io:444","performance_standby":false,"performance_standby_last_remote_wal":0}

and

{
  "auth": null,
  "data": {
    "nodes": [
      {
        "active_node": true,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-14-90",
        "last_echo": null
      },
      {
        "active_node": false,
        "api_address": "https://vault.internal.mycluster.io",
        "cluster_address": "https://vault.internal.mycluster.io:444",
        "hostname": "vault-10-0-32-220",
        "last_echo": "2022-10-29T09:30:43.564133091Z"
      }
    ]
  },
  "lease_duration": 0,
  "lease_id": "",
  "nodes": [
    {
      "active_node": true,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-14-90",
      "last_echo": null
    },
    {
      "active_node": false,
      "api_address": "https://vault.internal.mycluster.io",
      "cluster_address": "https://vault.internal.mycluster.io:444",
      "hostname": "vault-10-0-32-220",
      "last_echo": "2022-10-29T09:30:43.564133091Z"
    }
  ],
  "renewable": false,
  "request_id": "5de2e940-d6de-9b06-a6da-2293c5aa0e79",
  "warnings": null,
  "wrap_info": null
}

Environment:

  • Vault Server Version (retrieve with vault status): 1.11.0
  • Vault CLI Version (retrieve with vault version): 1.11.0
  • Server Operating System/Architecture: Amazon Linux or any Linux based system

Vault server configuration file(s):

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = true
}

# Mandatory for enabling HA enabled DynamoDB storage backend
api_addr = "https://vault.internal.mycluster.io"

storage "dynamodb" {
  ha_enabled = "true"
  table      = "mycluster-vault-data"
  region     = "eu-west-1"
}

seal "awskms" {
  kms_key_id = "XXXXXXXXXXXX"
}

ui = true

Additional context
The functional cluster was originally running Vault 1.9.4 and I tried to upgrade it directly to 1.12.0 which failed with the behaviour described above
I therefore decided to rollback to 1.9.4 and to upgrade one version at a time until it breaks.
The following versions were tested successfully:

  • 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10
  • 1.10.0, 1.10.1, 1.10.2, 1.10.3, 1.10.4, 1.10.5, 1.10.6, 1.10.7

1.11.0 and upwards just breaks

It is to be noted also that the vault cluster as it is residing within an AWS infrastructure is behind a Network LoadBalancer properly configured as can be seen in the attached files

Putting the Vault processes in debug/trace mode did not help finding any valuable information which could solve the issue

I could not see anything either in the changelog of 1.11.0 which could help

@obourdon
Copy link
Author

Screenshot 2022-10-29 at 11 37 10
Screenshot 2022-10-29 at 11 37 23

@obourdon
Copy link
Author

obourdon commented Oct 31, 2022

Doing a git diff tag-v1.10.7..tag-v1.11.0 -- vault/ha.go show the following code which seems to be the source of the issue

with the corresponding commit 106f548a41c5aa3a7310e1b53f2196f989913eff

Commenting out the return line just makes the cluster functional again

@ncabatoff any insights on this please ?

@obourdon
Copy link
Author

obourdon commented Oct 31, 2022

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked because behind a load balancer like in AWS the addresses checked are "statically" defined in Route53, only the load balancer distribute the request evenly across all nodes :-(

This explains why the test is always true in AWS behind a LB and therefore why the Vault HA cluster is not functional any more

@obourdon
Copy link
Author

obourdon commented Oct 31, 2022

$ dig +short vault.internal.mycluster.io
10.0.15.206
10.0.36.214
10.0.19.203

and therefore

RedirectAddr:https://vault.internal.mycluster.io ClusterAddr:https://vault.internal.mycluster.io:444

in the received JSON advertisement structure and the same "addresses" are in the Core structure

@obourdon
Copy link
Author

obourdon commented Oct 31, 2022

As a first suggestion of what could be a more complete test which might solve the issue is to do a
url.ParseRequestURI on the string. If this fails, then comparison can go on as is else
net.LookupIP(u.host) can be checked for the number of records it contains ...

My very basic 2 cents

@ccapurso ccapurso added bug Used to indicate a potential bug storage/dynamodb core/ha specific to high-availability labels Oct 31, 2022
@obourdon
Copy link
Author

@ccapurso many thanks for categorising this

@obourdon
Copy link
Author

obourdon commented Nov 1, 2022

AFAICT I do not think that having DynamoDB as the vault backend has anything to do with the issue. I am pretty much sure I can reproduce this with Raft backend. Willl try to demonstrate this if time permits

@ArchiFleKs
Copy link

Hi @obourdon we saw the same issue today on our cluster running 1.11.3 with Dyanamodb backend and AWS behind an NLB.

Do you know how to reproduce this bug ? We have been running 1.11.X for the past month and only noticed it today.

@obourdon
Copy link
Author

obourdon commented Nov 3, 2022

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command
curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response
Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

@ArchiFleKs
Copy link

@ArchiFleKs on my side with a working HA cluster (Vault <=1.10.7) I just replace the vault binary on any one node by any version >= 1.11.0 and relaunch the associated Linux service then run the command curl http://127.0.0.1:8200/v1/sys/leader on the modified node and I get the answer without any info (aka empty string in leader_address and leader_cluster_address of the JSON response Note that other nodes are still functional (at least they are in my setup)

I pretty much also guess that this might change if the node you replace the Vault binary on is the current leader/active node. Any Vault call which will be targeted at the now modified node will be failing (in my case 1 out of 3 times because cluster has 3 nodes and LB dispatch requests evenly)

The situation can completely be reversed even with an entirely failing cluster by replacing Vault by any version <= 1.10.7 on all failing nodes (at least it was on my setup were I now have everything back to normal

Thanks for your inputs. Currently in the process of downgrading, what about 1.10.8 released 2 days ago ?

@obourdon
Copy link
Author

obourdon commented Nov 4, 2022

@ArchiFleKs seems like it should be working perfectly like 1.10.7 (at least it does not include the failing 1.11.0 code)

@obourdon
Copy link
Author

obourdon commented Nov 16, 2022

I do not see any activity on this huge regression issue. Can someone please take care of this ?

@obourdon obourdon changed the title Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster Huge regression: Upgrading HA DynamoDB backend based Vault cluster to 1.11.0 and upwards breaks functional cluster Nov 16, 2022
@obourdon
Copy link
Author

Still no feedback on this ????

@ccapurso
Copy link
Contributor

ccapurso commented Dec 6, 2022

Hi, @obourdon. Thank you for your patience.

The DynamoDB storage backend is community-supported and thus in-depth knowledge is a bit limited on the Vault team. You mentioned potentially attempting to reproduce this issue using Integrated Storage (raft). Were you able to accomplish that?

@ncabatoff
Copy link
Collaborator

@obourdon I think the reason this change is impacting you is because every node has the same cluster_address, https://vault.internal.mycluster.io:444. Do you know what that might be? I recognize that cluster_addr isn't needed for DynamoDB-based clusters, but I believe it should default to a unique value for each node if you don't configure it, specifically, listenerIP:apiPort+1.

May be the test if adv.ClusterAddr == c.ClusterAddr() && adv.RedirectAddr == c.redirectAddr { is just incomplete as in the comment it is said that if we are the active node but this does not seem to be really checked

We know we're not the active node because at the top of the function we return if !c.standby. An OSS Vault cluster node is either active or standby, so if it's not standby it must be active.

@ArchiFleKs
Copy link

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

@ncabatoff
Copy link
Collaborator

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

@obourdon
Copy link
Author

I must admit that I do not get some of the points described above

AFAIU the only way to describe the cluster address when behind a load balancer using AWS ASG is to specify the LB address as you do not know in advance what would be your instances IPs and if one instance is failing and a new one respawned by the ASG you should not have to change your Vault cluster configuration files

I indeed do not think that the issue is related to the fact that the backend is based on DynamoDB by looking at the location of the failing code and I still do not understand why this change was done in the first place and what it is susposed to fix

@ArchiFleKs

I encountered the same issue but we are using specifc cluster_addr per node and cannot reproduce right now
do you mean that you have the same issue even though you have a different cluster_addr in each vault cluster node config file ? Your sentence is quite confusing because of the cannot reproduce right now at the end of it

@ArchiFleKs
Copy link

@obourdon Yes I encountered the same issue at the same time you reported this bug, but since then I kept Vault > 1.10 running in our dev environment and I was not able to reproduce the issue nor did I encounter it a second time.

And yes we are using a different cluster_addr which is auto computed when a new instance boot up on the ASG, so our configuration looks like this

@ArchiFleKs
Copy link

If you're referring to the error "local node not active but active cluster node not found", I would not assume that's the same issue as this one. That's a whole class of HA failures, if you search for it you'll see it comes up in various circumstances. It can be a perfectly legitimate intermittent error even in a mostly healthy cluster, I would only worry about it if it persists or recurs frequently.

@ncabatoff you're right that's what lead me to believe it might not have been the same issue despite using the same backend (dynamoDB). And it might have been just a transitive issue. Because I was not able to reproduce it in the end.

@obourdon
Copy link
Author

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

@obourdon
Copy link
Author

obourdon commented Jan 2, 2023

still no activity on this one

@ncabatoff
Copy link
Collaborator

@ArchiFleKs many thanks for all the details above an indeed we also currently keep the Vault version strictly under 1.11 to make sure that the source code does not contain the issue

It would be very nice if Vault could support the same GetPrivateIP Golang templating configuration option like Consul does but it seems that this idea was rejected :-(

We wound up doing it in #9109.

@ncabatoff if you could spend some time explaining why this code was changed in the first place in more details as it is leading to a definitely NOT transient but permanent and definitive Vault cluster failure, we would greatly appreciate

It was to prevent attempting to connect to ourself, which was causing problems in some tests.

@obourdon
Copy link
Author

obourdon commented Jan 19, 2023

@ncabatoff could you be a bit more specific and point us to the failing tests please and how to potentially run them locally on our development machines
I am really wondering if this fix is necessary as it induces a huge regression and may be the tests mentioned are the ones which should be corrected (it sometimes happens on my side that the tests I wrote are bad/badly designed in the first place)

@obourdon
Copy link
Author

obourdon commented Mar 6, 2023

Any update on this ?

@ncabatoff
Copy link
Collaborator

I missed the release window on this one, please note that the fix won't be in the next releases, but rather the ones next month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/ha specific to high-availability storage/dynamodb
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants