[FEATURE]: Deploy `waitFor` option #5790

salotz · 2024-02-28T23:58:55Z

Feature Request

Background / Motivation

TL;DR is that some Kubernetes resources are created immediately and return successfully from kubectl apply but actually require some time to process. Sometimes this is okay since lots of kubernetes is made to wait for things to be available (although it makes for messy logs), but for some things it does not work this way and you absolutely must not deploy a resource until some status on another is to a desired state.

It would be good to be able to express this directly in Garden Deploy action specs.

I am deploying read-only PVCs on GKE, which requires you to first generate a read/write PVC and then somehow create a new derived PVC from this with read-only accessMode. Its seems that the best/most efficient way to do this is with VolumeSnapshot resources in kubernetes (and supported by GKE).

My "big picture" goal is do download data from a bucket that is being used for my test suite that is not in the git repo. I am making this available in the cluster for test runner pods to mount, and because there can be many of these it needs to support multiple readers.

I had the following garden actions to do this (roughly):

---
kind: Deploy
type: kubernetes
name: test-data-init-pvc
description: |
  Persistent volume claim that will be written to. Read only snapshots
  will be generated from this.

spec:
  files:
    - ./dev/k8s/acceptance-tests/test-data-init-pvc.yaml
---
kind: Run
name: rclone-copy-test-data
type: kubernetes-exec
description: |
  Action to actually perform test data copy to in-cluster volume.

dependencies:
  - deploy.test-data-init-pvc
  [...]
  command:
    - rclone
    - sync
    - my-bucket:${var.test_data_bucket_name}
    - /app/test_data

---
kind: Deploy
name: test-data-snapshot
type: kubernetes
description: |
  Deploy the snapshot resource to snapshot our initialized data volume.

disabled: ${environment.name == 'local'}

dependencies:
  - run.rclone-copy-test-data

include:
  - ./dev/k8s/acceptance-tests

spec:
  files:
    - ./dev/k8s/acceptance-tests/test-data-snapshot.yaml


---
kind: Deploy
name: test-data-pvc
type: kubernetes
description: |
  Read only PVC of the test data, built from the snapshot.

disabled: ${environment.name == 'local'}

dependencies:
  - run.test-data-snapshot

include:
  - ./dev/k8s/acceptance-tests
spec:
  files:
    - ./dev/k8s/acceptance-tests/test-data-readonly-pvc.yaml

And the associated manifests (jammed into a single document):

---
apiVersion: v1
kind: PersistentVolumeClaim

metadata:
  name: test-data-pvc-init
  labels:
    app: test-data-init

spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: ${var.test_data_storage_size}

---
# initialize the snapshot class to use
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: test-data-snapshotclass
driver: pd.csi.storage.gke.io
deletionPolicy: Delete

---
# the actual snapshot spec
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-data-snapshot
spec:
  volumeSnapshotClassName: test-data-snapshotclass
  source:
    persistentVolumeClaimName: test-data-pvc-init
      
---
## PVC for read only access to the test data (via snapshot)
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-data-pvc-ro
spec:
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: test-data-snapshot
  accessModes:
    - ReadOnlyMany
  storageClassName: premium-rwo
  resources:
    requests:
      storage: ${var.test_data_storage_size}

The (subtle) problem I ran into is that the test-data-pvc deploy runs immediately after test-data-snapshot. The VolumeSnapshot resource is successfully created, but the backend processing of this to generate VolumeSnapshotContents resource and the snapshot being "ready" takes a while.

If you "restore" the snapshot by binding a PVC to it before the .status.readyToUse is not true then you will create a new volume in an inconsistent state that doesn't correct itself when the snapshot is finished. You must wait until the snapshot is finished before binding the PVC.

At the terminal I can run this to wait:

kubectl -n $ns wait --for=jsonpath='{.status.readyToUse}'=true VolumeSnapshot/my-snapshot

And so I can use a garden Run action to create a gate:

---
kind: Run
name: test-data-wait-snapshot
type: exec
description: |
  Run 'kubectl wait' to wait for the snapshot to be ready before proceeding.

disabled: ${environment.name == 'local'}
dependencies:
  - deploy.test-data-snapshot

spec:
  command:
    - kubectl
    - "-n=${environment.namespace}"
    - wait
    - --for=jsonpath={.status.readyToUse}=true
    - VolumeSnapshot/test-data-snapshot

This works but is pretty verbose and requires using an exec type, which reduces portability.

What should the user be able to do?

As part of a Deploy action you should be able to specify specific values (something similar to kubectl wait) that Garden will wait from apart from the normal waiting it does (not sure exactly how that is implemented).

For instance the above snapshot action might look like:

---
kind: Deploy
name: test-data-snapshot
type: kubernetes
description: |
  Deploy the snapshot resource to snapshot our initialized data volume.

disabled: ${environment.name == 'local'}

dependencies:
  - run.rclone-copy-test-data

include:
  - ./dev/k8s/acceptance-tests

spec:
  files:
    - ./dev/k8s/acceptance-tests/test-data-snapshot.yaml

  waitFor:
    - resource:
        kind: VolumeSnapshot
        name: test-data-snapshot
      for:
        jsonpath: '{.status.readyToUse}'
        targetValue: 'true'

Why do they want to do this? What problem does it solve?

Many kubernetes resources have an extra component of eventual consistency which can make writing reproducible garden configs difficult. This isn't the only example of this I have had over time, just the most difficult.

Suggested Implementation(s)

You could directly use kubectl wait with the bundled kubectl in garden, or you could somehow implement it in-cluster via a pod and query the Kubernetes API directly.

How important is this feature for you/your team?

Somewhere between these two since there is a reasonable workaround:

🌵 Not having this feature makes using Garden painful

🌹 It’s a nice to have, but nice things are nice 🙂

The text was updated successfully, but these errors were encountered:

salotz added the feature request label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Deploy `waitFor` option #5790

[FEATURE]: Deploy `waitFor` option #5790

salotz commented Feb 28, 2024

[FEATURE]: Deploy waitFor option #5790

[FEATURE]: Deploy waitFor option #5790

Comments

salotz commented Feb 28, 2024

Feature Request

Background / Motivation

What should the user be able to do?

Why do they want to do this? What problem does it solve?

Suggested Implementation(s)

How important is this feature for you/your team?

[FEATURE]: Deploy `waitFor` option #5790

[FEATURE]: Deploy `waitFor` option #5790