Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider Horizontal Pod Autoscaler docs for CephObjectStore #10001

Open
BlaineEXE opened this issue Apr 5, 2022 · 8 comments
Open

Reconsider Horizontal Pod Autoscaler docs for CephObjectStore #10001

BlaineEXE opened this issue Apr 5, 2022 · 8 comments
Assignees
Labels
docs object Object protocol - S3

Comments

@BlaineEXE
Copy link
Member

BlaineEXE commented Apr 5, 2022

I am concerned the documentation for KEDA Horizontal Pod Autoscaler (HPA) documented in the CephObjectStore won't work in production. I believe any time the CephObjectStore is reconciled, Rook will reset the replicas to 1. This could result in disrupting user workloads.

Originally posted by @BlaineEXE in #9202 (review)

We should investigate whether this is a concern and reasses. I believe the example is a good one, but we may need to design a Rook feature to assist with this use-case. If we decide not to proceed with a Rook feature and there is an issue, we may want to revert the commits.

In any scenario, we should remove the KEDA CRDs from the current location in github-action-helper.sh since they are of no use to our CI.

@BlaineEXE BlaineEXE added docs object Object protocol - S3 labels Apr 5, 2022
@thotz
Copy link
Contributor

thotz commented Apr 6, 2022

I am concerned the documentation for KEDA Horizontal Pod Autoscaler (HPA) documented in the CephObjectStore won't work in production. I believe any time the CephObjectStore is reconciled, Rook will reset the replicas to 1. This could result in disrupting user workloads.

I am not quite sure how reconciliation will impact HPA, by default HPA can work with memory and CPU with any special configuration, so I highly doubt reconciliation of resources can affect HPA.
Do the Rook operator restart after HPA scaling validates this use case?

Originally posted by @BlaineEXE in #9202 (review)

We should investigate whether this is a concern and reasses. I believe the example is a good one, but we may need to design a Rook feature to assist with this use-case. If we decide not to proceed with a Rook feature and there is an issue, we may want to revert the commits.

In any scenario, we should remove the KEDA CRDs from the current location in github-action-helper.sh since they are of no use to our CI.

During CI testing it validates all the yamls in deploy/examples since KEDA CRDs were not installed, creation of the keda-rgw.yaml was failing. Other way to work around it via skip-yaml. But @travisn suggested to do it via installing CRDs, please refer #9202 (comment)

@thotz
Copy link
Contributor

thotz commented Apr 14, 2022

I am concerned the documentation for KEDA Horizontal Pod Autoscaler (HPA) documented in the CephObjectStore won't work in production. I believe any time the CephObjectStore is reconciled, Rook will reset the replicas to 1. This could result in disrupting user workloads.

I am not quite sure how reconciliation will impact HPA, by default HPA can work with memory and CPU with any special configuration, so I highly doubt reconciliation of resources can affect HPA. Do the Rook operator restart after HPA scaling validates this use case?

From my rook-operator restart testing, Rook operator resets the replica count 1 as you mentioned and I guess HPA increases it suddenly to previous value replica value for the workload

Originally posted by @BlaineEXE in #9202 (review)
We should investigate whether this is a concern and reasses. I believe the example is a good one, but we may need to design a Rook feature to assist with this use-case. If we decide not to proceed with a Rook feature and there is an issue, we may want to revert the commits.
In any scenario, we should remove the KEDA CRDs from the current location in github-action-helper.sh since they are of no use to our CI.

During CI testing it validates all the yamls in deploy/examples since KEDA CRDs were not installed, creation of the keda-rgw.yaml was failing. Other way to work around it via skip-yaml. But @travisn suggested to do it via installing CRDs, please refer #9202 (comment)

@BlaineEXE
Copy link
Member Author

BlaineEXE commented Apr 14, 2022

We should update our CephObjectStore reconciliation. I think a small design would make sense. Questions I have:

  • Should Rook continue to create one deployment per RGW for its own scaling, or should it update the replica count on a single deployment?
  • Should Rook only set a replica count on RGW deployments only if they don't exist and allow the value to be different for any pre-existing deployments? Or is there a different strategy Rook should use to perform deployment updates without disrupting replicas?
  • Possible other strategy: can KEDA adjust the CephObjectStore server count directly rather than modifying the deployment replica count?

@travisn
Copy link
Member

travisn commented Apr 14, 2022

@thotz KEDA updates the replica count on the rgw deployment, right? So all Rook will really notice during a reconcile is that the deployment replica count is incorrect? In that case, I'm thinking we should do the following to keep it simple to support KEDA:

  • When the object store is initialized, create the rgw deployment with replicas according to the setting in the CR (gateway.instances)
  • During future reconciles, generate the pod spec as we always do, but set the deployment replicas to the replicas of the existing rgw deployment.

Another consideration is a new property to the object CR for allowAutoExpansion: true, which would explicitly allow this behavior. But this setting doesn't really seem necessary if we just document the behavior.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@BlaineEXE
Copy link
Member Author

Not a wontfix, but probably not able to prioritize until 1.11+

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@thotz thotz added keepalive and removed wontfix labels Aug 22, 2022
@thotz thotz reopened this Aug 22, 2022
@travisn travisn removed the keepalive label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs object Object protocol - S3
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants