Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

mtrmac · 2022-02-16T18:54:52Z

Consider this sequential sequence of events, with the overlay graph driver.

The user initiates pull of an image which contains 2 layers, parentLayer and childLayer
While creating parentLayer, the WIP layer object is recorded in layers.json with incompleteFlag.
Afterwards, during ApplyDiff, the pull process is forcibly killed (so that it can’t do its own cleanup).
Result: layers.json contain a record of the layer, with incompleteFlag; the overlay graph driver contains an incomplete/inconsistent layer, but a $parentLayer/link file and a l/$link symbolic link exist. This is all as expected.

The user initiates a pull of the same image again.
(Just like the first time), the pull first checks for pre-existing layers in storage, via Store.Layer(parentLayer). This locks the layerStore read-only first. Thus, the first layerStore.ReloadIfChanged does trigger a layerStore.Load(), but that does not clean up incomplete layers. But layerStore.lockFile.lw was updated to match the lock file contents.
Consequently, the record of the incomplete layer continues to exist, and Store.Layer reports that parentLayer exists.
Pull proceeds, assuming that parentLayer exists, and starts creating childLayer.
While creating childLayer, the layerStore is locked read-write, but because nothing has changed on disk and layerStore.lockFile.lw matches (within the same process), layerStore.ReloadIfChanged does nothing, and does not enter layerStore.Load() and the “delete incomplete layers” code is not reached. Consequently, parentLayer continues to exist in incomplete state.
This allows creation of childLayer to succeed. $childLayer/lower is created, and includes the short link from parentLayer/link.
Result: The whole pull is reported as successful. The image, though, contains an incomplete layer, with incomplete/inconsistent contents.

Next, the user does something that doesn’t start with a read-only lock of layerStore. That finally triggers layerStore.Load to delete incomplete layers — and now parentLayer is deleted, resulting in a broken parent link from childLayer to parentLayer.
For example, podman run theSameImage works for this purpose. That deletes the layer and fails with Error: layer not known (with a currently unclear call stack).

One more podman run theSameImage causes the missing layer to be noticed, with

ERRO[0000] Image theSameImage exists in local storage but may be corrupted: layer not known

… and that triggers a re-pull.
This re-pull correctly detects that parentLayer is missing, and creates it afresh, with a new $parentLayer/link value.
But, childLayer is not missing, and the previous one is just reused. $childLayer/lower continues to contain the old $parentLayer/link value.
Finally, when trying to actually use childLayer, this manifests in

WARN[0093] Can't read link "/var/lib/containers/storage/overlay/l/UDGNJ5CR2MQ2QQDGYYK2W4WCBR" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 
Error: readlink /var/lib/containers/storage/overlay/l/UDGNJ5CR2MQ2QQDGYYK2W4WCBR: no such file or directory

The text was updated successfully, but these errors were encountered:

mtrmac · 2022-02-16T18:57:22Z

@nalind @vrothberg This is a locking design issue I’m not immediately sure how to handle.

A cheap kludge would be to have the read-only layerStore.Load pretend that incomplete layers don’t exist at all, but that could only make diagnosing such situations even harder.

Maybe have layerStore maintain a cleanupNotDone bit of state, so that the first write-locked use of layerStore does trigger a cleanup?

vrothberg · 2022-02-17T12:10:18Z

Maybe have layerStore maintain a cleanupNotDone bit of state, so that the first write-locked use of layerStore does trigger a cleanup?

It sounds like a new state would help. But I only parsed your summary and did not take a look at code. I'd easily need a full workday to catch up on this issue.

mtrmac · 2022-02-17T14:14:49Z

…cleanupNotDone…

One concern with this approach is that a single process would still see a layer disappear during its lifetime. Notably a pull would first see a layer present in TryReusingBlob, but trying to commit a child layer on top would, later, fail.

That’s not, in principle, a new concern — layers can always disappear while the store is unlocked. But it would now be more likely to be user-noticeable.

Alternatively, should every process start by taking a read-write lock and cleaning up in-progress layers, even if the operation is purely read-only? That seems vaguely reminiscent of earlier read-write/read-only discussions like #473 (which, to be explicit, I haven’t now re-read).

... to help diagnosing later possible broken references to this layer; compare containers#1136 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac · 2022-02-23T13:24:39Z

#1145 doesn’t fix this.

rhatdan · 2022-08-09T14:32:24Z

@nalind what should we do about this?

1. [1] landed after 4.11.1 and was shipped in 4.11.2, adding a 20s timeout to baremetalRuntimeCfgImage 'podman run ...' calls. 2. The baremetalRuntimeCfgImage stuff gets enabled for a number of infrastructure providers [2]. 3. A Podman bug in TERM handling means that timeout TERMs can result in corrupted storage [3]. 4. That corruption bubbles up with errors like: Can't read link "/var/lib/containers/storage/overlay/..." because it does not exist. A storage corruption might have occurred or maybe: Image ... exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known or maybe both [4]. 5. 4.11.3 and later fix the regression by separating the possibly-slow image pull from the container run [4]. [1]: https://github.com/openshift/machine-config-operator/pull/3287/files#diff-255f8a4599166f31961853ea8626f969ca4231c55aacbc20a5bb3ceb640f911dR48 [2]: https://github.com/openshift/machine-config-operator/blob/d33d8dc3d2cad2247f67dff5989256315000e2d1/pkg/controller/template/render.go#L512 Commit from: $ oc adm release info --commits quay.io/openshift-release-dev/ocp-release:4.11.2-x86_64 | grep machine-config-operator machine-config-operator https://github.com/openshift/machine-config-operator d33d8dc3d2cad2247f67dff5989256315000e2d1 [3]: containers/storage#1136 [4]: https://issues.redhat.com/browse/OCPBUGS-631

mtrmac · 2022-09-09T19:54:39Z

Note also #1322 , about the whole concept of read-only locking the store object.

mtrmac · 2022-09-15T03:13:34Z

See #1332 (comment) for a very vague sketch of how this could be fixed.

mtrmac · 2022-10-14T02:42:06Z

(Warning: Untested:)

An originally unappreciated component of the failure is that before the

(Just like the first time), the pull first checks for pre-existing layers in storage, via Store.Layer(parentLayer).

step, the layerStore is initialized in newLayerStore. At that point, we were calling Load without any lock held at all [which might not be correct WRT concurrent writers, I’m not immediately sure], and Load’s if r.lockfile.Locked() condition caused it to not delete the incomplete layers at that point.

That has recently changed in #1351 : so the simple reproducers above should now, AFAICS, trigger deleting incomplete layers. That needs testing (and if true, we will need a more complex reproducer.)

We still need to fix this in case that incomplete layer is created by a concurrent process during a lifetime of our process (e.g. if two processes are concurrently pulling images that share layers, and one crashes).

TBD: We need a different solution for incomplete layers in read-only additional stores. Maybe just remove the incomplete layer from in-memory state completely, and pretend it doesn’t exist; that might allow the layer to be pulled in the primary store. At least assuming correct writers to the additional store (which is not guaranteed as of today), there should be no child layers or images referring to those incomplete layers.

mtrmac · 2022-11-04T17:48:19Z

That has recently changed in #1351 : so the simple reproducers above should now, AFAICS, trigger deleting incomplete layers. That needs testing

Confirmed.

(and if true, we will need a more complex reproducer.)

podman pull a multi-layer image; after it loads the layer store, but before actually starting to pull (e.g. at the very start of store.layersByMappedDigest, before taking any cross-process locks), block the process
Launch a concurrent podman pull of that image, let it create an incomplete layer, then terminate that concurrent pull
Resume the original podman pull, so that its first access is read-only.

Then it is observable that the incomplete layer is not deleted before a child is created, and the pulled image is corrupt; already the pull complains:

WARNING: Image f643c72bc252 exists in local storage but may be corrupted (remove the image to resolve the issue): size for layer "bacd3af13903e13a43fe87b6944acd1ff21024132aad6e74b4452d984fb1a99a" is unknown, failing getSize()

A subsequent podman pull now does clean up the old layer (i.e. we don’t need a different podman run step to trigger the cleanup), and triggers

ERRO[0031] Image quay.io/libpod/ubuntu exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known

and that causes the original process to pull the image again.

The ultimate failure is the same: child layers’ /lower files point at an old link/… file that was a part of the incomplete layer and no longer exists:

WARN[0010] Can't read link "/var/lib/containers/storage/overlay/l/XK4MZAZWZI47KAL2UTD7LH4D6P" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption. 
Error: readlink /var/lib/containers/storage/overlay/l/XK4MZAZWZI47KAL2UTD7LH4D6P: no such file or directory

mtrmac · 2022-11-04T18:17:57Z

Trying the above reproducer, #1407 indeed seems to fix it: The resumed process, on a first read-only access, notices that the store is corrupt and retries with a write lock, deleting the incomplete layer immediately and creating it from scratch.

mtrmac mentioned this issue Feb 17, 2022

RFC: Create an “incomplete” layer record prior to invoking the graph driver #1139

Closed

mtrmac added a commit to mtrmac/storage that referenced this issue Feb 21, 2022

Warn if we are deleting an incomplete layer

caa7d52

... to help diagnosing later possible broken references to this layer; compare containers#1136 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac added a commit to mtrmac/storage that referenced this issue Feb 21, 2022

Warn if we are deleting an incomplete layer

d6b1a49

... to help diagnosing later possible broken references to this layer; compare containers#1136 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac mentioned this issue Feb 21, 2022

RFC: Warn if we are deleting an incomplete layer #1145

Merged

mtrmac added a commit to mtrmac/storage that referenced this issue Feb 22, 2022

Warn if we are deleting an incomplete layer

ece0e83

... to help diagnosing later possible broken references to this layer; compare containers#1136 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

mtrmac mentioned this issue Feb 22, 2022

(Alternative to 1140): Record layers as incomplete before trying to create them #1148

Merged

mtrmac added a commit to mtrmac/storage that referenced this issue Feb 23, 2022

Warn if we are deleting an incomplete layer

5654974

... to help diagnosing later possible broken references to this layer; compare containers#1136 . Signed-off-by: Miloslav Trmač <mitr@redhat.com>

vrothberg closed this as completed in #1145 Feb 23, 2022

mtrmac reopened this Feb 23, 2022

mtrmac mentioned this issue Mar 31, 2022

CI-Flake: bud-multiple-platform-no-run fails: error changing to intended-new-root directory containers/buildah#3710

Closed

luluz66 mentioned this issue Apr 25, 2022

Storage gets corrupted after podman pull is killed containers/podman#14003

Closed

flouthoc mentioned this issue May 11, 2022

Canceling buildah bud during layer pull causes layer not known with unclear resolution containers/buildah#3979

Closed

wking mentioned this issue Sep 2, 2022

blocked-edges/4.11.2-PodmanTermStorageCorruption: Declare regression openshift/cincinnati-graph-data#2450

Merged

mtrmac mentioned this issue Sep 15, 2022

layerStore.Load under layerStore.RLock is racy #1332

Closed

mtrmac mentioned this issue Oct 10, 2022

In-process exclusion of readers and writers #1346

Merged

This was referenced Oct 17, 2022

Move locking responsibility from store.go to the container/image/layer stores #1395

Merged

API break: Remove Lockfile.Locked #1399

Merged

Clean up inconsistent stores also in readers #1407

Merged

idleroamer mentioned this issue Oct 31, 2022

detect podman file-system corruption containers/podman#16362

Closed

rhatdan closed this as completed in #1407 Nov 10, 2022

mtrmac mentioned this issue May 18, 2023

Getting Intermittent ErrImagePull errors which don't go away once they appear on a given node cri-o/cri-o#5793

Open

iain-macdonald mentioned this issue Aug 31, 2023

Remove podman images that hit the input/output error bug buildbuddy-io/buildbuddy#4694

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

mtrmac commented Feb 16, 2022 •

edited

mtrmac commented Feb 16, 2022

vrothberg commented Feb 17, 2022

mtrmac commented Feb 17, 2022 •

edited

mtrmac commented Feb 23, 2022

rhatdan commented Aug 9, 2022

mtrmac commented Sep 9, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Oct 14, 2022 •

edited

mtrmac commented Nov 4, 2022

mtrmac commented Nov 4, 2022

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

Crash resiliency: deleting incomplete layers doesn’t reliably happen #1136

Comments

mtrmac commented Feb 16, 2022 • edited

mtrmac commented Feb 16, 2022

vrothberg commented Feb 17, 2022

mtrmac commented Feb 17, 2022 • edited

mtrmac commented Feb 23, 2022

rhatdan commented Aug 9, 2022

mtrmac commented Sep 9, 2022

mtrmac commented Sep 15, 2022

mtrmac commented Oct 14, 2022 • edited

mtrmac commented Nov 4, 2022

mtrmac commented Nov 4, 2022

mtrmac commented Feb 16, 2022 •

edited

mtrmac commented Feb 17, 2022 •

edited

mtrmac commented Oct 14, 2022 •

edited