Serving Large Language Models on Google Kubernetes Engine (GKE) using NVIDIA Triton Inference Server with FasterTransformer
This repository compiles prescriptive guidance and code sample to serve Large Language Models (LLM) such as Unified Language Learner (UL2) on a Google Kubernetes Engine (GKE) cluster with GPUs running NVIDIA Triton Inference Server with FasterTransformer backend.
- NVIDIA Triton Inference Server is an open-source inference serving solution from NVIDIA to simplify and standardize the inference serving process supporting multiple frameworks and optimizations for both CPUs and GPUs.
- NVIDIA FasterTransformer library implements an accelerated engine for the inference of transformer-based models, spanning multiple GPUs and nodes in a distributed manner.
The solution provides a Terraform standardized template to deploy Triton inference server on GKE and integrate with other Google Cloud Managed Services.
The solution demonstrates deploying the UL2 (20B parameter) model on a GKE cluster with GPUs. Assuming, JAX based checkpoints of a pre-trained or fine-tuned UL2 model are available, the workflow has the following steps:
- Set up the environment running Triton server on GKE cluster.
- Convert JAX checkpoint to FasterTransformer checkpoint
- Serve the resultant model on GPUs using NVIDIA Triton Inference server with FasterTransformer backend
- Run evaluation script with test instances to compute model eval metrics
You have following ways to access JAX based checkpoints for UL2:
- You can find pre-trained UL2 checkpoints here.
- We fined-tuned UL2 model with XSum dataset and made checkpoints available on Google Cloud Storage bucket at
gs://se-checkpoints/ul2-xsum/
.
NOTE: You can refer to the following solution accelerator to fine-tune UL2 model with custom datasets using T5X framework that creates JAX based checkpoints.
cd ~
git clone https://github.com/RajeshThallam/fastertransformer-converter
cd ~/fastertransformer-converter
Follow the environment setup guide here to create a GKE cluster running NVIDIA Triton on GPU node pool using Terraform standardized template. The setup performs the following steps:
- Enable APIs
- Run Terraform to provision the required resources
- Deploy Ingress Gateway
- Deploy NVIDIA GPU drivers
- Configure and deploy Triton Inference Server
- Run health check to validate the Triton deployment
As part of the environment setup, you configure following environment variables:
export PROJECT_ID=my-project-id
export REGION=us-central1
export ZONE=us-central1-a
export NETWORK_NAME=my-gke-network
export SUBNET_NAME=my-gke-subnet
export GCS_BUCKET_NAME=my-triton-repository
export GKE_CLUSTER_NAME=my-ft-gke
export TRITON_SA_NAME=triton-sa
export TRITON_NAMESPACE=triton
export MACHINE_TYPE=a2-highgpu-1g
export ACCELERATOR_TYPE=nvidia-tesla-a100
export ACCELERATOR_COUNT=1
After provisioning the GKE cluster, configure access to the cluster:
gcloud container clusters get-credentials ${GKE_CLUSTER_NAME} --project ${PROJECT_ID} --zone ${ZONE}
kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user "$(gcloud config get-value account)"
The checkpoint format conversion from JAX to NVIDIA FasterTransformer is run on GKE cluster as a kubernetes Job on the GPU node pool. The source code for conversion script is located here.
- Create Docker repository in Google Artifact Registry to manage images
# Configure parameters
export DOCKER_ARTIFACT_REPO=llm-inference # <-- Change to your repo name
# Enable API
gcloud services enable artifactregistry.googleapis.com
# Create repository
gcloud artifacts repositories create ${DOCKER_ARTIFACT_REPO} \
--repository-format=docker \
--location={REGION} \
--description="Triton Docker repository"
# Authenticate to repository
gcloud auth configure-docker ${REGION}-docker.pkg.dev --quiet
- Build container image
# Configure container image name
export JAX_TO_FT_IMAGE_NAME="jax-to-fastertransformer"
export JAX_TO_FT_IMAGE_URI=${REGION}"-docker.pkg.dev/"${PROJECT_ID}"/"${DOCKER_ARTIFACT_REPO}"/"${JAX_TO_FT_IMAGE_NAME}
# Change directory
cd ~/fastertransformer-converter
# Run Cloud Build job to build the container image
export FILE_LOCATION="./converter"
gcloud builds submit \
--region ${REGION} \
--config converter/cloudbuild.yaml \
--substitutions _IMAGE_URI=${JAX_TO_FT_IMAGE_URI},_FILE_LOCATION=${FILE_LOCATION} \
--timeout "2h" \
--machine-type=e2-highcpu-32 \
--quiet
- Copy UL2 checkpoints to your Google Cloud Storage bucket
gcloud storage cp -r gs://se-checkpoints/ul2-xsum/ gs://${GCS_BUCKET_NAME}/models/
- Run conversion job start with configuring job parameters
cd ~/fastertransformer-converter/env-setup/kustomize
cat << EOF > ~/fastertransformer-converter/converter/configs.env
ksa=triton-sa
converter_image_uri=$JAX_TO_FT_IMAGE_URI
accelerator_count=1
model_name=ul2
gcs_jax_ckpt=gs://${GCS_BUCKET_NAME}/models/ul2-xsum/
gcs_ft_ckpt=gs://${GCS_BUCKET_NAME}/triton_model_repository/ul2-ft/
EOF
- Deploy the configuration
kubectl kustomize ./
- Run the job
kubectl apply -k ./
- Monitor the job
kubectl logs
- Pull NVIDIA NeMo Inference container with NVIDIA Triton and FasterTransformer
- Sign in to NGC and select organization as
ea-participants
- Get API key from NGC to authorize docker client to access NGC Private Registry
- Authorize docker
- Sign in to NGC and select organization as
docker login nvcr.io
Username: $oauthtoken
Password: Your key
-
- Push the container to your Docker repository in Cloud Artifact Registry
export TRITON_FT_IMAGE_URI=${REGION}"-docker.pkg.dev/"${PROJECT_ID}"/"${DOCKER_ARTIFACT_REPO}"/bignlp-inference:22.08-py3"
docker pull nvcr.io/ea-bignlp/bignlp-inference:22.08-py3
docker tag nvcr.io/ea-bignlp/bignlp-inference:22.08-py3 ${TRITON_FT_IMAGE_URI}
docker push ${TRITON_FT_IMAGE_URI}
- Configure NVIDIA Triton Deployment parameters
cd ~/triton-on-gke-sandbox/env-setup/kustomize
cat << EOF > ~/triton-on-gke-sandbox/env-setup/kustomize/configs.env
model_repository=gs://${GCS_BUCKET_NAME}/triton_model_repository/ul2-ft/
ksa=${TRITON_SA_NAME}
EOF
- Update NVIDIA Triton container image
kustomize edit set image "nvcr.io/nvidia/tritonserver:22.01-py3="${TRITON_FT_IMAGE_URI}
- Deploy the configuration
kubectl kustomize ./
- Deploy to the cluster
kubectl apply -k ./
- Run health check
To validate that NVIDIA Triton Inference Server has been deployed successfully try to access the server's health check API. You will access the server through Istio Ingress Gateway. Start by getting the external IP address of the istio-ingressgateway
service.
kubectl get services -n $TRITON_NAMESPACE
# Invoke the health check API
ISTIO_GATEWAY_IP_ADDRESS=$(kubectl get services -n $TRITON_NAMESPACE \
-o=jsonpath='{.items[?(@.metadata.name=="istio-ingressgateway")].status.loadBalancer.ingress[0].ip}')
curl -v ${ISTIO_GATEWAY_IP_ADDRESS}/v2/health/ready
If the returned status is 200OK
the server is up and accessible through the gateway.
.
├── converter
├── evaluator
└── README.md
/converter
: Source for converting JAX checkpoints to FasterTransformer checkpoints/evaluator
: Source for running model evaluation on validation dataset with model hosted on NVIDIA Triton
If you have any questions or if you found any problems with this repository, please report through GitHub issues.
- XSum
@InProceedings{xsum-emnlp, author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata", title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization", booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ", year = "2018", address = "Brussels, Belgium", }
- UL2
@article{tay2022unifying, title={Unifying Language Learning Paradigms}, author={Yi Tay*, Mostafa Dehghani*, Vinh Q. Tran, Xavier Garcia, Dara Bahr, Tal Schuster, Huaixiu Steven Zheng, Neil Houlsby, and Donald Metzler}, year={2022} }