Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling with multiple metrics does not work #3638

Open
shazinahmed opened this issue Apr 26, 2024 · 3 comments
Open

Autoscaling with multiple metrics does not work #3638

shazinahmed opened this issue Apr 26, 2024 · 3 comments

Comments

@shazinahmed
Copy link

shazinahmed commented Apr 26, 2024

/kind bug

What steps did you take and what happened:
Tried the following

  1. Tried creating memory based autoscaling using knative annotations as below. A CPU based HPA was created instead with averageUtilization set to 80.
autoscaling.knative.dev/class: "hpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "memory"
autoscaling.knative.dev/target: "75"
  1. Tried creating memory based autoscaling using kserve annotations as below. A CPU based HPA was created instead with averageUtilization set to 80.
serving.kserve.io/autoscalerClass: hpa
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/metric: memory
serving.kserve.io/targetUtilizationPercentage: "70"
  1. Tried creating memory based autoscaling by setting predictor.scaleMetric to memory and a memory based HPA was created. Yaay!

In scenario 1 and 2 HPAs are created as expected if the metric is set as cpu.

Now, I want my HPA to be controlled by both memory and CPU. I tried setting predictor.scaleMetric to memory and a corresponding scaleTarget. Also set CPU thresholds using serving.kserve.io/metric: cpu. But only predictor.scaleMetric is respected.

What did you expect to happen:
I want HPA to have both memory and CPU based triggers.

What's the InferenceService yaml:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    serving.kserve.io/autoscalerClass: hpa
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/metric: cpu
    serving.kserve.io/targetUtilizationPercentage: "70"
  name: custom-model
  namespace: dev-xyz-multimodal-ner
spec:
  predictor:
    containers:
    - env:
      - name: AWS_BUCKET
        value: xyz-xyz-xyz-xyz-xyz-xyz
      image: <custom image>
      name: kserve-container
      resources:
        limits:
          cpu: "4"
          memory: 6Gi
        requests:
          cpu: "1"
          memory: 2Gi
    maxReplicas: 4
    nodeSelector:
      nodetype: modelserving
    scaleMetric: memory
    scaleTarget: 75
    tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: modelserving

Anything else you would like to add:
The HPA YAML inferenceservice generates

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  annotations:
    serving.kserve.io/autoscalerClass: hpa
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/metric: cpu
    serving.kserve.io/targetUtilizationPercentage: "80"
  name: custom-model-predictor-default
  namespace: dev-xyz-multimodal-ner
  ownerReferences:
  - apiVersion: serving.kserve.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: InferenceService
    name: custom-model
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
    scaleUp:
      policies:
      - periodSeconds: 15
        type: Pods
        value: 4
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 0
  maxReplicas: 4
  metrics:
  - resource:
      name: memory
      target:
        averageUtilization: 75
        type: Utilization
    type: Resource
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: custom-model-predictor-default

Environment:

  • Istio Version: N/A (Using nginx)
  • Knative Version: N/A (Using Kubernetes deployment)
  • KServe Version: 0.11.0
  • Kubeflow version:
  • Cloud Environment:[AWS EKS]
  • Minikube/Kind version:
  • Kubernetes version: v1.28.7-eks-b9c9ed7
  • OS (e.g. from /etc/os-release): Amazon Linux 2
@yuzisun
Copy link
Member

yuzisun commented May 4, 2024

@shazinahmed Scaling based on multiple metrics is not supported, can you elaborate how you want to scale with both of these metrics?

@shazinahmed
Copy link
Author

shazinahmed commented May 31, 2024

@yuzisun Sorry, I missed this one. I want to have an HPA created with two triggers, one for CPU and one for memory, like we can have it in a normal Kubernetes deployment. This will enable us to scale both on CPU and memory triggers depending on what is over utilized.

@spolti
Copy link
Contributor

spolti commented Jun 3, 2024

Based on the options 1 and 2: it seems that the annotations are not used when the metrics are created, it defaults to CPU and 80% if predictor.scaleMetric | target are not set:
https://github.com/kserve/kserve/blob/master/pkg/controller/v1beta1/inferenceservice/reconcilers/hpa/hpa_reconciler.go#L60

Didn't find a doc link to HPA as well, we might be missing this part.

On the other hand, it seems k8s API supports it:
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#scaling-on-multiple-metrics

Maybe we could evaluate to bring this functionality to KServer when HPA is enabled.

wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants