InstaSlice facilitates the use of Dynamic Resource Allocation (DRA) on Kubernetes clusters for GPU sharing.
For its initial release, InstaSlice facilitates the allocation of MIG slices on NVIDIA A100 GPUs. InstaSlice makes it possible to deploy pods with MIG slice requirements expressed as extended resources to a DRA-enabled cluster. In particular, it enables cluster administrators to transparently replace MIG manager from NVIDIA GPU operator with NVIDIA DRA driver without requiring changes to pod specs.
See this demonstration for a detailed comparison of MIG slicing using MIG manager vs. DRA driver vs. InstaSlice.
InstaSlice implements a mutating webhook for pods that automatically rewrites resource limits on containers into DRA resource claims. For instance, InstaSlice rewrites at creation time the following pod spec:
apiVersion: v1
kind: Pod
metadata:
name: sample
spec:
restartPolicy: Never
containers:
- name: busybox
image: quay.io/project-codeflare/busybox:1.36
command: ["sh", "-c", "sleep 5"]
resources:
limits:
nvidia.com/mig-1g.5gb: 1
into the following pod spec:
apiVersion: v1
kind: Pod
metadata:
name: sample
spec:
containers:
restartPolicy: Never
containers:
- name: busybox
image: quay.io/project-codeflare/busybox:1.36
command: ["sh", "-c", "sleep 5"]
resources:
claims:
- name: ae9a7e7e-e955-4870-859c-12b83927b2bd
resourceClaims:
- name: ae9a7e7e-e955-4870-859c-12b83927b2bd
source:
resourceClaimTemplateName: mig-1g.5gb
The latter spec assumes the following resource claim templates and parameters are already deployed to the pod namespace:
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: MigDeviceClaimParameters
metadata:
name: mig-1g.5gb
spec:
profile: 1g.5gb
---
apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
name: mig-1g.5gb
spec:
spec:
resourceClassName: gpu.nvidia.com
parametersRef:
apiGroup: gpu.resource.nvidia.com
kind: MigDeviceClaimParameters
name: mig-1g.5gb
The deployment instructions below cover this prerequisite.
InstaSlice assumes a DRA-enabled Kubernetes cluster. It has been tested against Kubernetes v1.27.
For development or testing purposes, InstaSlice can run on a cluster without GPUs with a minimal configuration (option 1). In order run pods on MIG slices, a GPU-enabled, DRA-enabled cluster running the NVIDIA DRA driver is necessary (option 2).
A cluster capable of running InstaSlice can be obtained using kind v0.19 with the provided cluster configuration.
kind create cluster --config hack/kind-config.yaml
InstaSlice assumes CRDs from the NDIVIA DRA driver are installed on the cluster:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver/b6c7aae2b87d857668f417689462da752090406f/deployments/helm/k8s-dra-driver/crds/gpu.resource.nvidia.com_migdeviceclaimparameters.yaml
On such a cluster, InstaSlice will be able to rewrite pod specs, but of course the cluster will be unable to satisfy GPU resource claims. Pods will remain forever pending.
In order to dynamically create and destroy MIG slices on NVIDIA GPUs, a
GPU-enabled, DRA-enabled cluster running the NVIDIA DRA driver is necessary.
Please refer to
https://github.com/NVIDIA/k8s-dra-driver
for further instructions. Please note that InstaSlice has been developed and
tested against commit b6c7aae
of this driver.
InstaSlice assumes cert-manager is deployed on the cluster:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.3/cert-manager.yaml
A prebuilt InstaSlice image is available from quay.io/ibm/instaslice.
To build and push an InstaSlice image run:
make docker-build docker-push IMG=<some-registry>/instaslice:<some-tag>
Alternatively, to build and push a multi-architecture InstaSlice image run:
make docker-buildx IMG=<some-registry>/instaslice:<some-tag>
To deploy InstaSlice on the Kubernetes cluster, run the prebuilt image or your own by replacing the image name below:
make deploy IMG=quay.io/ibm/instaslice:latest
InstaSlice relies on preconfigured resource claim templates. These templates must be deployed to each namespace where pods using InstaSlice will be deployed.
To deploy the templates to a given namespace run:
kubectl apply -f hack/mig-profiles.yaml --namespace <some-namespace>
To deploy an example pod on the cluster run:
kubectl apply -f samples/sample.yaml
Check the resulting pod spec using:
kubectl get -o yaml pod sample
Delete the pod with:
kubectl delete -f samples/sample.yaml
To uninstall InstaSlice from the cluster run:
make undeploy
Copyright 2024 IBM Corporation.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.