Incomplete snapshots returned by snapshot save command #12168

needle82 · 2021-07-26T14:40:38Z

Describe the bug
On Openshift environment I have Vault 1.7.2 running, every night I take a snapshot with REST call. I wanted to restore this snapshot to another environment with posting it into snapshot-force, but getting {"errors":["1 error occurred:\n\t* failed to read snapshot file: failed to read or write snapshot data: unexpected EOF\n\n"]} all the time. I tried different backups and found that old of them are recoverable. I'm not able to restore files that are taken manually by CLI too.

To Reproduce
Create new StatefulSet with Vault 1.7.2 which is configured to use Raft. Initialize it with vault operator init. Vault becomes unsealed with autounseal schema. Set port forwarding for serving 8200:8200 and call from local machine curl -vvv -L --header "X-Vault-Token: token" -X POST --data-binary @backup.snap http://localhost:8200/v1/sys/storage/raft/snapshot-force

Expected behavior
Vault is rewritten with old data.

Environment:

Vault 1.7.2
Openshift 4, Kubernetes v1.20.0+87cc9a4

Vault server configuration file(s):

storage "raft" {
  path = "/vault/data"
}

listener "tcp" {
  address     = "[::]:8200"
  tls_disable = 1
}

disable_mlock = true
ui = true

seal "transit" {
          disable_renewal = "false"
          key_name = "autounseal"
          mount_path = "transit/"
          tls_skip_verify = "true"
}

autopilot {
    cleanup_dead_servers = "true"
    last_contact_threshold = "200ms"
    last_contact_failure_threshold = "10m"
    max_trailing_logs = 250
    min_quorum = 3
    server_stabilization_time = "10s"
}

Additional context
No problems while getting snapshot in logs:
2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948

The text was updated successfully, but these errors were encountered:

ncabatoff · 2021-07-26T14:58:57Z

Hi @needle82,

It's possible you're timing out, hence the EOF. Have you tried setting VAULT_CLIENT_TIMEOUT as per https://www.vaultproject.io/api/system/storage/raft#large-raft-snapshots-1 ?

needle82 · 2021-07-26T15:03:23Z

Hi @ncabatoff , not yet, the snapshot is pretty small, just 2.2 Mb, I didn't expect timeout here. If it could be a rootcause, what timeout is your proposal? Thank you!

ncabatoff · 2021-07-26T15:15:12Z

The default timeout is 60s. Are your restore attempts failing after almost exactly one minute? If so, this is the likely cause. Try something much bigger, e.g. 30m. I don't expect it to take anywhere near that long, but this should serve to either confirm or deny this as the source of the problem.

needle82 · 2021-07-26T15:20:24Z

Yes, I confirm that the question is in couple of seconds in both directions. No any chance that I expected timeouts here. From my perspective it looks like with data grow we got some bad symbols while packing, because backups before 19th of July are able to be restored this way (they were for about 1.4 Mb weight).

ncabatoff · 2021-07-26T16:21:03Z

Hmm, that's odd. Is there anything in the logs of the server receiving the restore request?

If you look inside the snapshot file (it's a gzipped tar file), does it contain all these files?

$ tar ztvf snap.bin 
-rw-------  0 0      0         307 26 Jul 11:34 meta.json
-rw-------  0 0      0      121514 26 Jul 11:34 state.bin
-rw-------  0 0      0         152 26 Jul 11:34 SHA256SUMS
-rw-------  0 0      0         185 26 Jul 11:34 SHA256SUMS.sealed

needle82 · 2021-07-26T18:59:19Z

Hello @ncabatoff , it looks like the archive is damaged.

$ tar tzvf temp_.snap
-rw------- 0/0 530 2021-07-26 16:50 meta.json
-rw------- 0/0 2714007 2021-07-26 16:50 state.bin

gzip: stdin: unexpected end of file
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

ncabatoff · 2021-07-26T19:06:40Z

Ah, so maybe you need to be increasing the client timeout when taking the snapshots, in addition to when restoring them.

ncabatoff · 2021-07-26T19:07:25Z

Also note that this:

2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948

is an internal operation used to manage the storage on disk. It's not in response to a snapshot API call.

needle82 · 2021-07-26T19:07:53Z

And one more online proof from raft leader:

It is really took no more than two seconds to create snapshot. But I'm going to check it right now

UPD: Didn't help :(

needle82 · 2021-07-26T19:12:07Z

Also note that this:
2021-07-26T13:50:06.361Z [INFO] storage.raft: starting snapshot up to: index=106948
2021-07-26T13:50:06.362Z [INFO] storage.raft: compacting logs: from=96701 to=96708
2021-07-26T13:50:06.364Z [INFO] storage.raft: snapshot complete up to: index=106948
is an internal operation used to manage the storage on disk. It's not in response to a snapshot API call.

I can see this three strings only when calling snapshot command (with CLI or REST, as far as I understand it uses the same REST calls):

2021-07-26T19:09:01.058Z [INFO] storage.raft: starting snapshot up to: index=107713
2021-07-26T19:09:01.061Z [INFO] storage.raft: no logs to truncate
2021-07-26T19:09:01.061Z [INFO] storage.raft: snapshot complete up to: index=107713

ncabatoff · 2021-07-26T19:14:30Z

I can see this three strings only when calling snapshot command (with CLI or REST, as far as I understand it uses the same REST calls):

My mistake! The log messages can occur in response to a snapshot save request. But they can also occur without an external trigger.

ncabatoff · 2021-07-26T19:16:12Z

Back to the issue at hand: the vault operator raft snapshot save command shouldn't be writing incomplete snapshot files without reporting an error. I'm not sure at this point whether the problem exists on the server side or the client side. I gather you're not seeing any errors in the server log?

needle82 · 2021-07-26T19:23:45Z

Yes, exactly. On CLI side I can see nothing (exitcode is 0), on server side I can see only this three INFO messages like above.

ncabatoff · 2021-07-26T19:43:37Z

I think I can see how this might happen. I don't know why it's happening in your case, but I have a plausible code path that could lead to an error being swallowed. If true, this is a very serious bug, so I'm going to start work on reproducing and fixing it immediately.

If you have an audit log, I'd be interested to see what the response audit record looks like. If you don't have an audit log, I'd be much obliged if you set one up temporarily. It might yield some additional clues.

needle82 · 2021-07-26T20:18:53Z

Hi @ncabatoff . I think I solved my issue, but it is still a bug. I enabled audit and got following:
{"time":"2021-07-26T19:50:40.003307588Z","type":"response","auth":{"client_token":"hmac-sha256:","accessor":"hmac-sha256:","display_name":"token","policies":["root"],"token_policies":["root"],"token_type":"service","token_issue_time":"2021-06-23T07:18:00Z"},"request":{"id":"68f836d5-91d2-6bb8-573e-fdb5ff98b384","operation":"read","mount_type":"system","client_token":"hmac-sha256:","client_token_accessor":"","namespace":{"id":"root"},"path":"sys/storage/raft/snapshot","remote_address":"127.0.0.1"},"response":{"mount_type":"system"},"error":"1 error occurred:\n\t* failed to write snapshot file: failed to seal snapshot hashes: Error making API request.\n\nURL: PUT https://vault-unseal-test.remote:443/v1/transit/encrypt/autounseal\nCode: 403. Errors:\n\n* permission denied\n\n"}

So, autounseal token wasn't autorenewed (I don't have any idea why) and I got an error sealing snapshot.
I reissued autounseal token and recreated pods step by step waiting for elections. After I got all pods with new autounseal token I went into leader container and did consistent backup.

So, the issue is that Vault exits successfully having error while sealing. And could you please recommend me how to prevent autounseal token expiration? I created it in usual way and expected autorenewal.

Thank you for your help!

ncabatoff · 2021-07-26T20:29:27Z

Yeah, I had a hunch it was a seal-related issue, because the two missing files are the SHA256SUMS file, which is encrypted by the seal, and another file that contains the sha-256 sum of the SHA256SUMs file.

There's still a bug in the snapshot API: we should never be returning an incomplete snapshot without error. I'm keeping this ticket open to address that.

It's debatable whether an unusable seal should provoke Vault to do something. We typically get people asking for the opposite: they don't want failures resulting from transient issues reaching the service backing their seal. And depending on the seal type they might not want any avoidable traffic from going to it.

As to your autounseal token expiration: I don't know offhand, but it doesn't sound like a bug, at least not yet. You'll have to do some investigation to see why renewal isn't occurring (maybe the role that generated the token isn't configured to allow token renewal?) If you need help in that area I suggest taking your questions to https://discuss.hashicorp.com/c/vault/30, since we prefer to reserve github issues for bug reports and feature requests.

needle82 · 2021-07-26T21:09:06Z

Thanks a lot for your help!

hsimon-hashicorp added storage/raft waiting-for-response labels Jul 26, 2021

ncabatoff added bug Used to indicate a potential bug and removed waiting-for-response labels Jul 26, 2021

ncabatoff changed the title ~~Not able to restore raft snapshot~~ Incomplete snapshots returned by snapshot save command Jul 26, 2021

randeepkhalsa pinned this issue Jul 26, 2021

HridoyRoy unpinned this issue Jul 27, 2021

ncabatoff mentioned this issue Aug 25, 2021

Add code to api.RaftSnapshot to detect incomplete snapshots #12388

Merged

hsimon-hashicorp linked a pull request Sep 3, 2021 that will close this issue

Add code to api.RaftSnapshot to detect incomplete snapshots #12388

Merged

ncabatoff closed this as completed in #12388 Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete snapshots returned by snapshot save command #12168

Incomplete snapshots returned by snapshot save command #12168

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 •

edited

ncabatoff commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 •

edited

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 •

edited

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

Incomplete snapshots returned by snapshot save command #12168

Incomplete snapshots returned by snapshot save command #12168

Comments

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 • edited

ncabatoff commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 • edited

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021 • edited

ncabatoff commented Jul 26, 2021

needle82 commented Jul 26, 2021

needle82 commented Jul 26, 2021 •

edited

needle82 commented Jul 26, 2021 •

edited

needle82 commented Jul 26, 2021 •

edited