Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External scaler connection errors ignored, the HPA is missing metrics #5787

Open
vrok opened this issue May 7, 2024 · 4 comments
Open

External scaler connection errors ignored, the HPA is missing metrics #5787

vrok opened this issue May 7, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@vrok
Copy link

vrok commented May 7, 2024

Report

When I install a Helm chart containing both an external scaler GRPC service and a ScaledObject, the resulting HPA has an empty list of metrics (K8s inserts the default 80% CPU utilization metric in that case). It then remains in that state even after the external scaler GRPC service has been initialized (I can manually force it to re-reconcile by editing the ScaledObject).

This is happening because Helm installs the external scaler service and the ScaledObject at the same time. The external scaler's GRPC server isn't available immediately (it takes ~1 sec for the pod to start), and KEDA runs the reconciliation of the ScaledObject before the external scaler is available, ignoring the GRPC connection error.

Expected Behavior

In my opinion, it would probably be better if KEDA were to re-queue the reconciliation request in these situations. For example, Reconcile() in scaledobject_controller.go could be returning ctrl.Result{RequeueAfter: time.Minute} if a GRPC connection error was observed.

Actual Behavior

KEDA doesn't update the HPA even after the external scaler is available.

Steps to Reproduce the Problem

  1. Install a ScaledObject resource using an external scaler
  2. Install the external scaler's GRPC service (1 and 2 should happen roughly at the same time, e.g., by being installed as part of the same Helm chart)
  3. Now notice that the HorizonalPodAutoscaler created by KEDA is missing the metric specified in the ScaledObject

Logs from KEDA operator

No response

KEDA Version

None

Kubernetes Version

None

Platform

Any

Scaler Details

No response

Anything else?

No response

@vrok vrok added the bug Something isn't working label May 7, 2024
@JorTurFer
Copy link
Member

Hello,
What KEDA version are you using? this error shouldn't happen because KEDA tries to reconcille the ScaledObjects automatically. Do you see any error in KEDA ooperator logs?

@vrok
Copy link
Author

vrok commented May 8, 2024

@JorTurFer I'm on 2.14.0 (but I tested the main branch yesterday and the problem occurred too).

This is the ScaledObject definition:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
    scaledobject.keda.sh/name: scaledobject-workers
  name: scaledobject-workers
  namespace: default
spec:
  scaleTargetRef:
    kind: Deployment
    name: scheduler
  triggers:
  - metadata:
      scalerAddress: scheduler-scaler.default.svc.cluster.local:8080
    type: external-push

And this is the HPA that gets created - notice that the list of metrics only contains a CPU-based metric (this is the default one inserted by K8s):

apiVersion: v1
items:
- apiVersion: autoscaling/v2
  kind: HorizontalPodAutoscaler
  metadata:
    annotations:
      meta.helm.sh/release-name: scheduler
      meta.helm.sh/release-namespace: default
    creationTimestamp: "2024-05-08T14:49:33Z"
    labels:
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: keda-hpa-scaledobject-workers
      app.kubernetes.io/part-of: scaledobject-workers
      app.kubernetes.io/version: 2.14.0
      scaledobject.keda.sh/name: scaledobject-workers
    name: keda-hpa-scaledobject-workers
    namespace: default
    ownerReferences:
    - apiVersion: keda.sh/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: ScaledObject
      name: scaledobject-workers
      uid: 1c21176d-71bc-4de2-9740-9fe03f5f66d7
    resourceVersion: "2777064"
    uid: a272a347-f011-499f-92e5-fa08d650f985
  spec:
    maxReplicas: 100
    metrics:
    - resource:
        name: cpu
        target:
          averageUtilization: 80
          type: Utilization
      type: Resource
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: scheduler
  status:
    conditions:
    - lastTransitionTime: "2024-05-08T14:49:48Z"
      message: the HPA controller was able to get the target's current scale
      reason: SucceededGetScale
      status: "True"
      type: AbleToScale
    - lastTransitionTime: "2024-05-08T14:49:48Z"
      message: 'the HPA was unable to compute the replica count: failed to get cpu
        utilization: unable to get metrics for resource cpu: unable to fetch metrics
        from resource metrics API: the server could not find the requested resource
        (get pods.metrics.k8s.io)'
      reason: FailedGetResourceMetric
      status: "False"
      type: ScalingActive
    currentMetrics: null
    currentReplicas: 1
    desiredReplicas: 0
kind: List
metadata:
  resourceVersion: ""

I'm also attaching logs from the operator pod:

keda-operator-logs.txt

Now, for example, if I edit the ScaledObject (with kubectl edit scaledobject ...), KEDA's Reconcile() method in scaledobject_controller.go will be re-run and update the HPA resource with the expected changes. It seems to be happening because the GRPC connection error is ignored when the GRPC service isn't available yet, and when it becomes available, KEDA doesn't retry the GRPC call.

@JorTurFer
Copy link
Member

I'm going to try to reproduce this.
From your example, I understand that I can deploy the ScaledObject and then, after some seconds, the external gRPC server and it'd be almost your use case, right? I want to find where we are hiding the connection error

@vrok
Copy link
Author

vrok commented May 20, 2024

@JorTurFer That's correct, the gRPC server with the external scaler should be down for some time after a ScaledObject is installed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants