Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy: Switch from proxy.ttl=0 to proxy.ttl > 0 leaks blobs that never get garbage collected #4249

Open
ialidzhikov opened this issue Jan 10, 2024 · 2 comments

Comments

@ialidzhikov
Copy link
Contributor

Description

We are providing a service that deploys and manages pull through caches.
We are using the storage.delete.enabled option to denote whether the "garbage collection of images" for the pull through cache is enabled or not.
This setting on our side is mutable - users can control whether the GC (under the hood the storage.delete.enabled option) is enabled or not.
However there is the following case where the migration doesn't behave as expected. If you run a proxy with storage.delete.enabled=false, then it adds entries to the scheduler-state.json file, after ttl is passed, it tries to remove the blobs/manifests. When storage.delete.enabled=false blob doesn't get removed but the corresponding entry from the scheduler-state.json file gets removed. Hence, the corresponding blob could never be garbage collected if proxy is restarted to run with storage.delete.enabled=true.

Reproduce

  1. Run a registry in proxy mode with upstream for and with GC disabled (storage.delete.enabled=false). The following config is used:

  2. Pull the alpine:3.14.0 image from the registry proxy.

# ctr -n k8s.io images pull docker.io/library/alpine:3.14.0

Make sure your scheduler-state.json file looks similar to:

{
    "library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402": {
        "Key": "library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402",
        "ExpiryData": "2024-01-17T10:40:19.747823626Z",
        "EntryType": 0
    },
    "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48": {
        "Key": "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48",
        "ExpiryData": "2024-01-17T10:40:18.48521875Z",
        "EntryType": 1
    },
    "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548": {
        "Key": "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548",
        "ExpiryData": "2024-01-17T10:40:18.959229542Z",
        "EntryType": 1
    },
    "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf": {
        "Key": "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf",
        "ExpiryData": "2024-01-17T10:40:20.352161126Z",
        "EntryType": 0
    }
}
  1. Edit the scheduler-state.json file: set one of the blob layers to expire for by updating the ExpiryData field. Let's do it for the library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402 blob.

Then restart the proxy. When the expiry data is passed, then it should try to delete the blob:

time="2024-01-10T10:45:19.748260501Z" level=error msg="Scheduler error returned from OnExpire(library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402): operation unsupported" go.version=go1.20.8 instance.id=1fceb0ff-4044-4e41-8119-0584e06cdba4 service=registry version=2.8.3

The blob is as expected not removed because of the storage.delete.enabled=false option.
However the corresponding entry gets removed from the scheduler-state.json.

{
    "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48": {
        "Key": "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48",
        "ExpiryData": "2024-01-17T10:40:18.48521875Z",
        "EntryType": 1
    },
    "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548": {
        "Key": "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548",
        "ExpiryData": "2024-01-17T10:40:18.959229542Z",
        "EntryType": 1
    },
    "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf": {
        "Key": "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf",
        "ExpiryData": "2024-01-17T10:40:20.352161126Z",
        "EntryType": 0
    }
}

At this point the blob leaks on the file system and the proxy can no longer garbage collect it if storage.delete.enabled=true gets set to true in the future.
The corresponding source code is

func (ttles *TTLExpirationScheduler) startTimer(entry *schedulerEntry, ttl time.Duration) *time.Timer {
return time.AfterFunc(ttl, func() {
ttles.Lock()
defer ttles.Unlock()
var f expiryFunc
switch entry.EntryType {
case entryTypeBlob:
f = ttles.onBlobExpire
case entryTypeManifest:
f = ttles.onManifestExpire
default:
f = func(reference.Reference) error {
return fmt.Errorf("scheduler entry type")
}
}
ref, err := reference.Parse(entry.Key)
if err == nil {
if err := f(ref); err != nil {
dcontext.GetLogger(ttles.ctx).Errorf("Scheduler error returned from OnExpire(%s): %s", entry.Key, err)
}
} else {
dcontext.GetLogger(ttles.ctx).Errorf("Error unpacking reference: %s", err)
}
delete(ttles.entries, entry.Key)
ttles.indexDirty = true
})
}
.

Expected behavior

I would expect the switch from storage.delete.enabled=false to storage.delete.enabled=true to garbage collect all blobs/manifests on time.Now() + ttl where time.Now() is the switch to storage.delete.enabled=true (enablement of the garbage collection).

registry version

# registry --version
registry github.com/docker/distribution 2.8.3

Additional Info

No response

@ialidzhikov
Copy link
Contributor Author

I also wanted to chat about the storage.delete.enabled=false field. I feel that it is not designed for the purpose we use it. The source code in proxy gives the impression that it is designed to run always with GC enabled.
If proxy has the strict requirement storage.delete.enabled to be always true, then this can be validated on startup and proxy could fail to start if the field is set to false.
On the other hand, I think it is useful to have a setting that allows to control whether GC is enabled or not.

@ialidzhikov
Copy link
Contributor Author

For v3.0.0 where the ttl is configurable the same issue can be translated as:
proxy: Switch from proxy.ttl=0 to proxy.ttl > 0 leaks blobs that never get garbage collected

@ialidzhikov ialidzhikov changed the title proxy: Switch from storage.delete.enabled=false to storage.delete.enabled=true leaks blobs that never get garbage collected proxy: Switch from proxy.ttl=0 to proxy.ttl > 0 leaks blobs that never get garbage collected Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant