Skip to content

Commit

Permalink
*: introduce image_pull_with_sync_fs in CRI
Browse files Browse the repository at this point in the history
It's to ensure the data integrity during unexpected power failure.

Background:

Since release 1.3, in Linux system, containerD unpacks and writes files into
overlayfs snapshot directly. It doesn’t involve any mount-umount operations
so that the performance of pulling image has been improved.

As we know, the umount syscall for overlayfs will force kernel to flush
all the dirty pages into disk. Without umount syscall, the files’ data relies
on kernel’s writeback threads or filesystem's commit setting (for
instance, ext4 filesystem).

The files in committed snapshot can be loss after unexpected power failure.
However, the snapshot has been committed and the metadata also has been
fsynced. There is data inconsistency between snapshot metadata and files
in that snapshot.

We, containerd, received several issues about data loss after unexpected
power failure.

* #5854
* #3369 (comment)

Solution:

* Option 1: SyncFs after unpack

Linux platform provides [syncfs][syncfs] syscall to synchronize just the
filesystem containing a given file.

* Option 2: Fsync directories recursively and fsync on regular file

The fsync doesn't support symlink/block device/char device files. We
need to use fsync the parent directory to ensure that entry is
persisted.

However, based on [xfstest-dev][xfstest-dev], there is no case to ensure
fsync-on-parent can persist the special file's metadata, for example,
uid/gid, access mode.

Checkout [generic/690][generic/690]: Syncing parent dir can persist
symlink. But for f2fs, it needs special mount option. And it doesn't say
that uid/gid can be persisted. All the details are behind the
implemetation.

> NOTE: All the related test cases has `_flakey_drop_and_remount` in
[xfstest-dev].

Based on discussion about [Documenting the crash-recovery guarantees of Linux file systems][kernel-crash-recovery-data-integrity],
we can't rely on Fsync-on-parent.

* Option 1 is winner

This patch is using option 1.

There is test result based on [test-tool][test-tool].
All the networking traffic created by pull is local.

  * Image: docker.io/library/golang:1.19.4 (992 MiB)
    * Current: 5.446738579s
      * WIOS=21081, WBytes=1329741824, RIOS=79, RBytes=1197056
    * Option 1: 6.239686088s
      * WIOS=34804, WBytes=1454845952, RIOS=79, RBytes=1197056
    * Option 2: 1m30.510934813s
      * WIOS=42143, WBytes=1471397888, RIOS=82, RBytes=1209344

  * Image: docker.io/tensorflow/tensorflow:latest (1.78 GiB, ~32590 Inodes)
    * Current: 8.852718042s
      * WIOS=39417, WBytes=2412818432, RIOS=2673, RBytes=335987712
    * Option 1: 9.683387174s
      * WIOS=42767, WBytes=2431750144, RIOS=89, RBytes=1238016
    * Option 2: 1m54.302103719s
      * WIOS=54403, WBytes=2460528640, RIOS=1709, RBytes=208237568

The Option 1 will increase `wios`. So, the `image_pull_with_sync_fs` is
option in CRI plugin.

[syncfs]: <https://man7.org/linux/man-pages/man2/syncfs.2.html>
[xfstest-dev]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git>
[generic/690]: <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/tests/generic/690?h=v2023.11.19>
[kernel-crash-recovery-data-integrity]: <https://lore.kernel.org/linux-fsdevel/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu/>
[test-tool]: <https://github.com/fuweid/go-dmflakey/blob/a17fb2010db22654b3e54cf506b0dbb5ef7b33ca/contrib/syncfs/containerd/main_test.go#L51>

Signed-off-by: Wei Fu <fuweid89@gmail.com>
(cherry picked from commit 23278c8)
Signed-off-by: Wei Fu <fuweid89@gmail.com>
  • Loading branch information
fuweid committed Feb 6, 2024
1 parent 4caf440 commit ea0a92e
Show file tree
Hide file tree
Showing 12 changed files with 62 additions and 4 deletions.
1 change: 1 addition & 0 deletions contrib/diffservice/service.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ func (s *service) Apply(ctx context.Context, er *diffapi.ApplyRequest) (*diffapi
}
opts = append(opts, diff.WithPayloads(payloads))
}
opts = append(opts, diff.WithSyncFs(er.SyncFs))

ocidesc, err = s.applier.Apply(ctx, desc, mounts, opts...)
if err != nil {
Expand Down
2 changes: 1 addition & 1 deletion diff/apply/apply.go
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ func (s *fsApplier) Apply(ctx context.Context, desc ocispec.Descriptor, mounts [
r: io.TeeReader(processor, digester.Hash()),
}

if err := apply(ctx, mounts, rc); err != nil {
if err := apply(ctx, mounts, rc, config.SyncFs); err != nil {
return emptyDesc, err
}

Expand Down
4 changes: 3 additions & 1 deletion diff/apply/apply_darwin.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ import (
"github.com/containerd/containerd/mount"
)

func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
func apply(ctx context.Context, mounts []mount.Mount, r io.Reader, _sync bool) error {
// We currently do not support mounts nor bind mounts on MacOS in the containerd daemon.
// Using this as an exception to enable native snapshotter and allow further research.
if len(mounts) == 1 && mounts[0].Type == "bind" {
Expand All @@ -38,6 +38,8 @@ func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
path := mounts[0].Source
_, err := archive.Apply(ctx, path, r, opts...)
return err

// TODO: Do we need to sync all the filesystems?
}

return mount.WithTempMount(ctx, mounts, func(root string) error {
Expand Down
30 changes: 29 additions & 1 deletion diff/apply/apply_linux.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,18 @@ import (
"context"
"fmt"
"io"
"os"
"strings"

"github.com/containerd/containerd/archive"
"github.com/containerd/containerd/errdefs"
"github.com/containerd/containerd/mount"
"github.com/containerd/containerd/pkg/userns"

"golang.org/x/sys/unix"
)

func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
func apply(ctx context.Context, mounts []mount.Mount, r io.Reader, sync bool) (retErr error) {
switch {
case len(mounts) == 1 && mounts[0].Type == "overlay":
// OverlayConvertWhiteout (mknod c 0 0) doesn't work in userns.
Expand All @@ -50,6 +53,9 @@ func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
opts = append(opts, archive.WithParents(parents))
}
_, err = archive.Apply(ctx, path, r, opts...)
if err == nil && sync {
err = doSyncFs(path)
}
return err
case len(mounts) == 1 && mounts[0].Type == "aufs":
path, parents, err := getAufsPath(mounts[0].Options)
Expand All @@ -67,6 +73,14 @@ func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
}
_, err = archive.Apply(ctx, path, r, opts...)
return err
case sync && len(mounts) == 1 && mounts[0].Type == "bind":
defer func() {
if retErr != nil {
return
}

retErr = doSyncFs(mounts[0].Source)
}()
}
return mount.WithTempMount(ctx, mounts, func(root string) error {
_, err := archive.Apply(ctx, root, r)
Expand Down Expand Up @@ -130,3 +144,17 @@ func getAufsPath(options []string) (upper string, lower []string, err error) {
}
return
}

func doSyncFs(file string) error {
fd, err := os.Open(file)
if err != nil {
return fmt.Errorf("failed to open %s: %w", file, err)
}
defer fd.Close()

_, _, errno := unix.Syscall(unix.SYS_SYNCFS, fd.Fd(), 0, 0)
if errno != 0 {
return fmt.Errorf("failed to syncfs for %s: %w", file, errno)
}
return nil
}
3 changes: 2 additions & 1 deletion diff/apply/apply_other.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@ import (
"github.com/containerd/containerd/mount"
)

func apply(ctx context.Context, mounts []mount.Mount, r io.Reader) error {
func apply(ctx context.Context, mounts []mount.Mount, r io.Reader, _sync bool) error {
// TODO: for windows, how to sync?
return mount.WithTempMount(ctx, mounts, func(root string) error {
_, err := archive.Apply(ctx, root, r)
return err
Expand Down
10 changes: 10 additions & 0 deletions diff/diff.go
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,8 @@ type Comparer interface {
type ApplyConfig struct {
// ProcessorPayloads specifies the payload sent to various processors
ProcessorPayloads map[string]typeurl.Any
// SyncFs is to synchronize the underlying filesystem containing files
SyncFs bool
}

// ApplyOpt is used to configure an Apply operation
Expand Down Expand Up @@ -133,3 +135,11 @@ func WithSourceDateEpoch(tm *time.Time) Opt {
return nil
}
}

// WithSyncFs sets sync flag to the config.
func WithSyncFs(sync bool) ApplyOpt {
return func(_ context.Context, _ ocispec.Descriptor, c *ApplyConfig) error {
c.SyncFs = sync
return nil
}
}
1 change: 1 addition & 0 deletions diff/proxy/differ.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ func (r *diffRemote) Apply(ctx context.Context, desc ocispec.Descriptor, mounts
Diff: fromDescriptor(desc),
Mounts: fromMounts(mounts),
Payloads: payloads,
SyncFs: config.SyncFs,
}
resp, err := r.client.Apply(ctx, req)
if err != nil {
Expand Down
8 changes: 8 additions & 0 deletions image.go
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,14 @@ func WithUnpackDuplicationSuppressor(suppressor kmutex.KeyedLocker) UnpackOpt {
}
}

// WithUnpackApplyOpts appends new apply options on the UnpackConfig.
func WithUnpackApplyOpts(opts ...diff.ApplyOpt) UnpackOpt {
return func(ctx context.Context, uc *UnpackConfig) error {
uc.ApplyOpts = append(uc.ApplyOpts, opts...)
return nil
}
}

func (i *image) Unpack(ctx context.Context, snapshotterName string, opts ...UnpackOpt) error {
ctx, done, err := i.client.WithLease(ctx)
if err != nil {
Expand Down
3 changes: 3 additions & 0 deletions pkg/cri/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,9 @@ type PluginConfig struct {
//
// For example, the value can be '5h', '2h30m', '10s'.
DrainExecSyncIOTimeout string `toml:"drain_exec_sync_io_timeout" json:"drainExecSyncIOTimeout"`
// ImagePullWithSyncFs is an experimental setting. It's to force sync
// filesystem during unpacking to ensure that data integrity.
ImagePullWithSyncFs bool `toml:"image_pull_with_sync_fs" json:"imagePullWithSyncFs"`
}

// X509KeyPairStreaming contains the x509 configuration for streaming
Expand Down
1 change: 1 addition & 0 deletions pkg/cri/config/config_unix.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,5 +109,6 @@ func DefaultConfig() PluginConfig {
CDISpecDirs: []string{"/etc/cdi", "/var/run/cdi"},
ImagePullProgressTimeout: defaultImagePullProgressTimeoutDuration.String(),
DrainExecSyncIOTimeout: "0s",
ImagePullWithSyncFs: false,
}
}
2 changes: 2 additions & 0 deletions pkg/cri/server/image_pull.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ import (
runtime "k8s.io/cri-api/pkg/apis/runtime/v1"

"github.com/containerd/containerd"
"github.com/containerd/containerd/diff"
"github.com/containerd/containerd/errdefs"
containerdimages "github.com/containerd/containerd/images"
"github.com/containerd/containerd/log"
Expand Down Expand Up @@ -168,6 +169,7 @@ func (c *criService) PullImage(ctx context.Context, r *runtime.PullImageRequest)
containerd.WithImageHandler(imageHandler),
containerd.WithUnpackOpts([]containerd.UnpackOpt{
containerd.WithUnpackDuplicationSuppressor(c.unpackDuplicationSuppressor),
containerd.WithUnpackApplyOpts(diff.WithSyncFs(c.config.ImagePullWithSyncFs)),
}),
}

Expand Down
1 change: 1 addition & 0 deletions services/diff/local.go
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ func (l *local) Apply(ctx context.Context, er *diffapi.ApplyRequest, _ ...grpc.C
}
opts = append(opts, diff.WithPayloads(payloads))
}
opts = append(opts, diff.WithSyncFs(er.SyncFs))

for _, differ := range l.differs {
ocidesc, err = differ.Apply(ctx, desc, mounts, opts...)
Expand Down

0 comments on commit ea0a92e

Please sign in to comment.