Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete snapshots returned by snapshot save command #12168

Closed
needle82 opened this issue Jul 26, 2021 · 17 comments · Fixed by #12388
Closed

Incomplete snapshots returned by snapshot save command #12168

needle82 opened this issue Jul 26, 2021 · 17 comments · Fixed by #12388
Labels
bug Used to indicate a potential bug storage/raft

Comments

@needle82
Copy link

Describe the bug
On Openshift environment I have Vault 1.7.2 running, every night I take a snapshot with REST call. I wanted to restore this snapshot to another environment with posting it into snapshot-force, but getting {"errors":["1 error occurred:\n\t* failed to read snapshot file: failed to read or write snapshot data: unexpected EOF\n\n"]} all the time. I tried different backups and found that old of them are recoverable. I'm not able to restore files that are taken manually by CLI too.

To Reproduce
Create new StatefulSet with Vault 1.7.2 which is configured to use Raft. Initialize it with vault operator init. Vault becomes unsealed with autounseal schema. Set port forwarding for serving 8200:8200 and call from local machine curl -vvv -L --header "X-Vault-Token: token" -X POST --data-binary @backup.snap http://localhost:8200/v1/sys/storage/raft/snapshot-force

Expected behavior
Vault is rewritten with old data.

Environment:

  • Vault 1.7.2
  • Openshift 4, Kubernetes v1.20.0+87cc9a4

Vault server configuration file(s):

storage "raft" {
  path = "/vault/data"
}

listener "tcp" {
  address     = "[::]:8200"
  tls_disable = 1
}

disable_mlock = true
ui = true

seal "transit" {
          disable_renewal = "false"
          key_name = "autounseal"
          mount_path = "transit/"
          tls_skip_verify = "true"
}

autopilot {
    cleanup_dead_servers = "true"
    last_contact_threshold = "200ms"
    last_contact_failure_threshold = "10m"
    max_trailing_logs = 250
    min_quorum = 3
    server_stabilization_time = "10s"
}

Additional context
No problems while getting snapshot in logs:
2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948

@ncabatoff
Copy link
Collaborator

Hi @needle82,

It's possible you're timing out, hence the EOF. Have you tried setting VAULT_CLIENT_TIMEOUT as per https://www.vaultproject.io/api/system/storage/raft#large-raft-snapshots-1 ?

@needle82
Copy link
Author

Hi @ncabatoff , not yet, the snapshot is pretty small, just 2.2 Mb, I didn't expect timeout here. If it could be a rootcause, what timeout is your proposal? Thank you!

@ncabatoff
Copy link
Collaborator

The default timeout is 60s. Are your restore attempts failing after almost exactly one minute? If so, this is the likely cause. Try something much bigger, e.g. 30m. I don't expect it to take anywhere near that long, but this should serve to either confirm or deny this as the source of the problem.

@needle82
Copy link
Author

Yes, I confirm that the question is in couple of seconds in both directions. No any chance that I expected timeouts here. From my perspective it looks like with data grow we got some bad symbols while packing, because backups before 19th of July are able to be restored this way (they were for about 1.4 Mb weight).

@ncabatoff
Copy link
Collaborator

Hmm, that's odd. Is there anything in the logs of the server receiving the restore request?

If you look inside the snapshot file (it's a gzipped tar file), does it contain all these files?

$ tar ztvf snap.bin 
-rw-------  0 0      0         307 26 Jul 11:34 meta.json
-rw-------  0 0      0      121514 26 Jul 11:34 state.bin
-rw-------  0 0      0         152 26 Jul 11:34 SHA256SUMS
-rw-------  0 0      0         185 26 Jul 11:34 SHA256SUMS.sealed

@needle82
Copy link
Author

needle82 commented Jul 26, 2021

Hello @ncabatoff , it looks like the archive is damaged.

$ tar tzvf temp_.snap
-rw------- 0/0 530 2021-07-26 16:50 meta.json
-rw------- 0/0 2714007 2021-07-26 16:50 state.bin

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

@ncabatoff
Copy link
Collaborator

Ah, so maybe you need to be increasing the client timeout when taking the snapshots, in addition to when restoring them.

@ncabatoff
Copy link
Collaborator

Also note that this:

2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948

is an internal operation used to manage the storage on disk. It's not in response to a snapshot API call.

@needle82
Copy link
Author

needle82 commented Jul 26, 2021

And one more online proof from raft leader:

image

It is really took no more than two seconds to create snapshot. But I'm going to check it right now

UPD: Didn't help :(

@needle82
Copy link
Author

Also note that this:

2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948

is an internal operation used to manage the storage on disk. It's not in response to a snapshot API call.

I can see this three strings only when calling snapshot command (with CLI or REST, as far as I understand it uses the same REST calls):

2021-07-26T19:09:01.058Z [INFO] storage.raft: starting snapshot up to: index=107713
2021-07-26T19:09:01.061Z [INFO] storage.raft: no logs to truncate
2021-07-26T19:09:01.061Z [INFO] storage.raft: snapshot complete up to: index=107713

@ncabatoff
Copy link
Collaborator

I can see this three strings only when calling snapshot command (with CLI or REST, as far as I understand it uses the same REST calls):

My mistake! The log messages can occur in response to a snapshot save request. But they can also occur without an external trigger.

@ncabatoff
Copy link
Collaborator

Back to the issue at hand: the vault operator raft snapshot save command shouldn't be writing incomplete snapshot files without reporting an error. I'm not sure at this point whether the problem exists on the server side or the client side. I gather you're not seeing any errors in the server log?

@ncabatoff ncabatoff added bug Used to indicate a potential bug and removed waiting-for-response labels Jul 26, 2021
@needle82
Copy link
Author

Yes, exactly. On CLI side I can see nothing (exitcode is 0), on server side I can see only this three INFO messages like above.

@ncabatoff ncabatoff changed the title Not able to restore raft snapshot Incomplete snapshots returned by snapshot save command Jul 26, 2021
@ncabatoff
Copy link
Collaborator

I think I can see how this might happen. I don't know why it's happening in your case, but I have a plausible code path that could lead to an error being swallowed. If true, this is a very serious bug, so I'm going to start work on reproducing and fixing it immediately.

If you have an audit log, I'd be interested to see what the response audit record looks like. If you don't have an audit log, I'd be much obliged if you set one up temporarily. It might yield some additional clues.

@randeepkhalsa randeepkhalsa pinned this issue Jul 26, 2021
@needle82
Copy link
Author

needle82 commented Jul 26, 2021

Hi @ncabatoff . I think I solved my issue, but it is still a bug. I enabled audit and got following:
{"time":"2021-07-26T19:50:40.003307588Z","type":"response","auth":{"client_token":"hmac-sha256:","accessor":"hmac-sha256:","display_name":"token","policies":["root"],"token_policies":["root"],"token_type":"service","token_issue_time":"2021-06-23T07:18:00Z"},"request":{"id":"68f836d5-91d2-6bb8-573e-fdb5ff98b384","operation":"read","mount_type":"system","client_token":"hmac-sha256:","client_token_accessor":"","namespace":{"id":"root"},"path":"sys/storage/raft/snapshot","remote_address":"127.0.0.1"},"response":{"mount_type":"system"},"error":"1 error occurred:\n\t* failed to write snapshot file: failed to seal snapshot hashes: Error making API request.\n\nURL: PUT https://vault-unseal-test.remote:443/v1/transit/encrypt/autounseal\nCode: 403. Errors:\n\n* permission denied\n\n"}

So, autounseal token wasn't autorenewed (I don't have any idea why) and I got an error sealing snapshot.
I reissued autounseal token and recreated pods step by step waiting for elections. After I got all pods with new autounseal token I went into leader container and did consistent backup.

So, the issue is that Vault exits successfully having error while sealing. And could you please recommend me how to prevent autounseal token expiration? I created it in usual way and expected autorenewal.

Thank you for your help!

@ncabatoff
Copy link
Collaborator

Yeah, I had a hunch it was a seal-related issue, because the two missing files are the SHA256SUMS file, which is encrypted by the seal, and another file that contains the sha-256 sum of the SHA256SUMs file.

There's still a bug in the snapshot API: we should never be returning an incomplete snapshot without error. I'm keeping this ticket open to address that.

It's debatable whether an unusable seal should provoke Vault to do something. We typically get people asking for the opposite: they don't want failures resulting from transient issues reaching the service backing their seal. And depending on the seal type they might not want any avoidable traffic from going to it.

As to your autounseal token expiration: I don't know offhand, but it doesn't sound like a bug, at least not yet. You'll have to do some investigation to see why renewal isn't occurring (maybe the role that generated the token isn't configured to allow token renewal?) If you need help in that area I suggest taking your questions to https://discuss.hashicorp.com/c/vault/30, since we prefer to reserve github issues for bug reports and feature requests.

@needle82
Copy link
Author

Thanks a lot for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug storage/raft
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants