Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

Closed
mtrmac opened this issue Feb 16, 2022 · 10 comments · Fixed by #1145 or #1407
Closed

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

mtrmac opened this issue Feb 16, 2022 · 10 comments · Fixed by #1145 or #1407

Comments

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 16, 2022

Consider this sequential sequence of events, with the overlay graph driver.

  • The user initiates pull of an image which contains 2 layers, parentLayer and childLayer
  • While creating parentLayer, the WIP layer object is recorded in layers.json with incompleteFlag.
  • Afterwards, during ApplyDiff, the pull process is forcibly killed (so that it can’t do its own cleanup).
  • Result: layers.json contain a record of the layer, with incompleteFlag; the overlay graph driver contains an incomplete/inconsistent layer, but a $parentLayer/link file and a l/$link symbolic link exist. This is all as expected.

  • The user initiates a pull of the same image again.
  • (Just like the first time), the pull first checks for pre-existing layers in storage, via Store.Layer(parentLayer). This locks the layerStore read-only first. Thus, the first layerStore.ReloadIfChanged does trigger a layerStore.Load(), but that does not clean up incomplete layers. But layerStore.lockFile.lw was updated to match the lock file contents.
  • Consequently, the record of the incomplete layer continues to exist, and Store.Layer reports that parentLayer exists.
  • Pull proceeds, assuming that parentLayer exists, and starts creating childLayer.
  • While creating childLayer, the layerStore is locked read-write, but because nothing has changed on disk and layerStore.lockFile.lw matches (within the same process), layerStore.ReloadIfChanged does nothing, and does not enter layerStore.Load() and the “delete incomplete layers” code is not reached. Consequently, parentLayer continues to exist in incomplete state.
  • This allows creation of childLayer to succeed. $childLayer/lower is created, and includes the short link from parentLayer/link.
  • Result: The whole pull is reported as successful. The image, though, contains an incomplete layer, with incomplete/inconsistent contents.

  • Next, the user does something that doesn’t start with a read-only lock of layerStore. That finally triggers layerStore.Load to delete incomplete layers — and now parentLayer is deleted, resulting in a broken parent link from childLayer to parentLayer.
  • For example, podman run theSameImage works for this purpose. That deletes the layer and fails with Error: layer not known (with a currently unclear call stack).

  • One more podman run theSameImage causes the missing layer to be noticed, with
ERRO[0000] Image theSameImage exists in local storage but may be corrupted: layer not known 
  • … and that triggers a re-pull.
  • This re-pull correctly detects that parentLayer is missing, and creates it afresh, with a new $parentLayer/link value.
  • But, childLayer is not missing, and the previous one is just reused. $childLayer/lower continues to contain the old $parentLayer/link value.
  • Finally, when trying to actually use childLayer, this manifests in
WARN[0093] Can't read link "/var/lib/containers/storage/overlay/l/UDGNJ5CR2MQ2QQDGYYK2W4WCBR" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 
Error: readlink /var/lib/containers/storage/overlay/l/UDGNJ5CR2MQ2QQDGYYK2W4WCBR: no such file or directory
@mtrmac
Copy link
Collaborator Author

mtrmac commented Feb 16, 2022

@nalind @vrothberg This is a locking design issue I’m not immediately sure how to handle.

A cheap kludge would be to have the read-only layerStore.Load pretend that incomplete layers don’t exist at all, but that could only make diagnosing such situations even harder.

Maybe have layerStore maintain a cleanupNotDone bit of state, so that the first write-locked use of layerStore does trigger a cleanup?

@vrothberg
Copy link
Member

Maybe have layerStore maintain a cleanupNotDone bit of state, so that the first write-locked use of layerStore does trigger a cleanup?

It sounds like a new state would help. But I only parsed your summary and did not take a look at code. I'd easily need a full workday to catch up on this issue.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Feb 17, 2022

cleanupNotDone

One concern with this approach is that a single process would still see a layer disappear during its lifetime. Notably a pull would first see a layer present in TryReusingBlob, but trying to commit a child layer on top would, later, fail.

That’s not, in principle, a new concern — layers can always disappear while the store is unlocked. But it would now be more likely to be user-noticeable.

Alternatively, should every process start by taking a read-write lock and cleaning up in-progress layers, even if the operation is purely read-only? That seems vaguely reminiscent of earlier read-write/read-only discussions like #473 (which, to be explicit, I haven’t now re-read).

mtrmac added a commit to mtrmac/storage that referenced this issue Feb 21, 2022
... to help diagnosing later possible broken references
to this layer; compare containers#1136 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
mtrmac added a commit to mtrmac/storage that referenced this issue Feb 21, 2022
... to help diagnosing later possible broken references
to this layer; compare containers#1136 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
mtrmac added a commit to mtrmac/storage that referenced this issue Feb 22, 2022
... to help diagnosing later possible broken references
to this layer; compare containers#1136 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
mtrmac added a commit to mtrmac/storage that referenced this issue Feb 23, 2022
... to help diagnosing later possible broken references
to this layer; compare containers#1136 .

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
@mtrmac
Copy link
Collaborator Author

mtrmac commented Feb 23, 2022

#1145 doesn’t fix this.

@rhatdan
Copy link
Member

rhatdan commented Aug 9, 2022

@nalind what should we do about this?

wking added a commit to wking/cincinnati-graph-data that referenced this issue Sep 2, 2022
1. [1] landed after 4.11.1 and was shipped in 4.11.2, adding a 20s
   timeout to baremetalRuntimeCfgImage 'podman run ...' calls.
2. The baremetalRuntimeCfgImage stuff gets enabled for a number of
   infrastructure providers [2].
3. A Podman bug in TERM handling means that timeout TERMs can result
   in corrupted storage [3].
4. That corruption bubbles up with errors like:

     Can't read link "/var/lib/containers/storage/overlay/..." because it does not exist. A storage corruption might have occurred

   or maybe:

     Image ... exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known

   or maybe both [4].
5. 4.11.3 and later fix the regression by separating the possibly-slow
   image pull from the container run [4].

[1]: https://github.com/openshift/machine-config-operator/pull/3287/files#diff-255f8a4599166f31961853ea8626f969ca4231c55aacbc20a5bb3ceb640f911dR48
[2]: https://github.com/openshift/machine-config-operator/blob/d33d8dc3d2cad2247f67dff5989256315000e2d1/pkg/controller/template/render.go#L512
     Commit from:

       $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.11.2-x86_64 | grep machine-config-operator
         machine-config-operator                        https://github.com/openshift/machine-config-operator                        d33d8dc3d2cad2247f67dff5989256315000e2d1
[3]: containers/storage#1136
[4]: https://issues.redhat.com/browse/OCPBUGS-631
@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 9, 2022

Note also #1322 , about the whole concept of read-only locking the store object.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Sep 15, 2022

See #1332 (comment) for a very vague sketch of how this could be fixed.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Oct 14, 2022

(Warning: Untested:)

An originally unappreciated component of the failure is that before the

(Just like the first time), the pull first checks for pre-existing layers in storage, via Store.Layer(parentLayer).

step, the layerStore is initialized in newLayerStore. At that point, we were calling Load without any lock held at all [which might not be correct WRT concurrent writers, I’m not immediately sure], and Load’s if r.lockfile.Locked() condition caused it to not delete the incomplete layers at that point.

That has recently changed in #1351 : so the simple reproducers above should now, AFAICS, trigger deleting incomplete layers. That needs testing (and if true, we will need a more complex reproducer.)

We still need to fix this in case that incomplete layer is created by a concurrent process during a lifetime of our process (e.g. if two processes are concurrently pulling images that share layers, and one crashes).


TBD: We need a different solution for incomplete layers in read-only additional stores. Maybe just remove the incomplete layer from in-memory state completely, and pretend it doesn’t exist; that might allow the layer to be pulled in the primary store. At least assuming correct writers to the additional store (which is not guaranteed as of today), there should be no child layers or images referring to those incomplete layers.

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2022

That has recently changed in #1351 : so the simple reproducers above should now, AFAICS, trigger deleting incomplete layers. That needs testing

Confirmed.

(and if true, we will need a more complex reproducer.)

  • podman pull a multi-layer image; after it loads the layer store, but before actually starting to pull (e.g. at the very start of store.layersByMappedDigest, before taking any cross-process locks), block the process
  • Launch a concurrent podman pull of that image, let it create an incomplete layer, then terminate that concurrent pull
  • Resume the original podman pull, so that its first access is read-only.

Then it is observable that the incomplete layer is not deleted before a child is created, and the pulled image is corrupt; already the pull complains:

WARNING: Image f643c72bc252 exists in local storage but may be corrupted (remove the image to resolve the issue): size for layer "bacd3af13903e13a43fe87b6944acd1ff21024132aad6e74b4452d984fb1a99a" is unknown, failing getSize()

A subsequent podman pull now does clean up the old layer (i.e. we don’t need a different podman run step to trigger the cleanup), and triggers

ERRO[0031] Image quay.io/libpod/ubuntu exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known 

and that causes the original process to pull the image again.

The ultimate failure is the same: child layers’ /lower files point at an old link/… file that was a part of the incomplete layer and no longer exists:

WARN[0010] Can't read link "/var/lib/containers/storage/overlay/l/XK4MZAZWZI47KAL2UTD7LH4D6P" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 
Error: readlink /var/lib/containers/storage/overlay/l/XK4MZAZWZI47KAL2UTD7LH4D6P: no such file or directory

@mtrmac
Copy link
Collaborator Author

mtrmac commented Nov 4, 2022

Trying the above reproducer, #1407 indeed seems to fix it: The resumed process, on a first read-only access, notices that the store is corrupt and retries with a write lock, deleting the incomplete layer immediately and creating it from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants