pg_autoscaler configuration is needed #14075

anthonyeleven · 2024-04-15T13:22:14Z

https://github.com/rook/rook/pull/13766/files

[rook@rook-ceph-tools-5ff8d58445-bxqz7 /]$ ceph mgr module ls
MODULE
balancer              on (always on)
crash                 on (always on)
devicehealth          on (always on)
orchestrator          on (always on)
pg_autoscaler         on (always on)
progress              on (always on)
rbd_support           on (always on)
status                on (always on)
telemetry             on (always on)
volumes               on (always on)
dashboard             on
iostat                on
nfs                   on
prometheus            on
restful               on
alerts                -
cephadm               -
diskprediction_local  -

While the module itself is always on, upstream does not force us to set it to be active. If they did that would likely be the last straw for those who clamor for a StableCeph fork. We very much need the means to configure it. Notably, getting it to do what it's supposed to requires prognostication, and it is very, very prone to undersizing pools at significant performance cost - especially since it is not media-aware.

Default to bulk mode to minimize impactful PG splitting later on
Allow setting osd_pool_default_pg_autoscale_mode on or off
Allow setting off, on, warn for each pool
https://docs.ceph.com/en/latest/rados/operations/placement-groups/

[rook@rook-ceph-tools-5ff8d58445-bxqz7 /]$ ceph config dump | grep autoscale
global                               advanced  osd_pool_default_pg_autoscale_mode            off
[rook@rook-ceph-tools-5ff8d58445-bxqz7 /]$

[rook@rook-ceph-tools-5ff8d58445-bxqz7 /]$ ceph osd dump | grep pool
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1296 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'rbd-nvme-ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 352770 lfor 0/156660/158816 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm lz4 compression_mode aggressive application rbd
pool 12 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64674 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 13 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64679 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 14 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64684 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 15 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 153680 lfor 0/0/153678 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 16 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64695 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 17 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64709 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 18 '.rgw.root' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64705 flags hashpspool stripe_width 0 application rook-ceph-rgw
pool 19 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 6 min_size 5 crush_rule 10 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 352804 lfor 0/156300/165341 flags hashpspool,ec_overwrites stripe_width 16384 application rook-ceph-rgw
pool 21 'ceph-objectstore.rgw.buckets.data.hdd' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile_hdd size 6 min_size 5 crush_rule 11 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 167193 lfor 0/0/164453 flags hashpspool,ec_overwrites stripe_width 16384 application rook-ceph-rgw
pool 22 'rbd-sata-hdd' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 352784 lfor 0/0/344532 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm lz4 compression_mode aggressive application rbd
[rook@rook-ceph-tools-5ff8d58445-bxqz7 /]$

@travisn

Is this a bug report or feature request?

Feature Request

Deviation from expected behavior:

Expected behavior:

How to reproduce it (minimal and precise):

File(s) to submit:

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

The text was updated successfully, but these errors were encountered:

travisn · 2024-04-15T21:10:03Z

The pool-specific settings can be specified today in the parameters of the CephBlockPool spec. Any setting that can be applied to a pool in the toolbox with a command such as ceph osd pool set <pool> <key> <value> can be specified. For example:

spec:
  parameters:
    pg_autoscale_mode: "on"
    bulk: "true"

Setting bulk: true seems to cause the autoscaler to immediately jump to 256 PGs, which seems rather high to consider for the default.

I believe this covers all cases you are suggesting, except the global setting osd_pool_default_pg_autoscale_mode. Since each pool could configure to enable/disable the autoscaler, perhaps this isn't necessary?

anthonyeleven · 2024-04-15T22:14:04Z

Re osd_pool_default_pg_autoscale_mode, there are pools that are auto-created that there aren't Rook-specific places to set it I don't think.

256PGs is not excessive unless there are fewer than 3 OSDs fwiw, modulo the number of coresident pools.

parth-gr · 2024-04-16T07:07:43Z

So do you wan to turn off the autoscale on the auto-created pools?

I think probably you can do it from the rook toolbox.

Or if there is any specific auto created pool that is concerning we can have a env variable setting for it

anthonyeleven · 2024-04-16T13:14:17Z

If one has to do common things from a shell by hand instead of having IaC, why have Rook at all?

parth-gr · 2024-04-16T13:33:07Z

So having such a configuration is as per design to make rook work with more productivity, I don't see any reason to change what we have in internal design.
If it's a common problem I said we can have a common env variable to configure it.

Btw for .mgr pool you can specify here https://github.com/rook/rook/blob/master/deploy/examples/cluster-test.yaml#L57 similar for .rgw pool

anthonyeleven · 2024-04-16T13:36:21Z

I don't see any reason to change what we have in internal design.
I didn't ask for that. I asked for a feature whose lack degrades performance.

What about rgw.buckets.non-ec ? rgw.buckets.index? rgw.otp? .rgw.root?

parth-gr · 2024-04-16T13:40:09Z

Okay I agree we need to have a setting,

Do you prefer to have individual spec field to set for each pre-defined pool, or just a common spec to change the pgs of all predefined pools?

anthonyeleven · 2024-04-16T13:45:27Z

Ideally, to mirror the behavior available in non-Rook Ceph, the ability set both the global config default value and per-pool pg_num for any pool Rook/Ceph will deploy.

travisn · 2024-04-16T15:13:13Z

All of the rgw metadata pools are created with the settings from the CephObjectStore CR under the metadataPool settings. So I expect all of the pools being created can be controlled with CRs today.

anthonyeleven · 2024-04-16T15:14:19Z

We should be sure that's documented.

travisn · 2024-04-16T17:50:21Z

We should be sure that's documented.

Are you suggesting to add the names of all the metadata pools in this comment, or perhaps in this doc? I was hoping the "metadata pools" could be self-explanatory to most users and that level of detail wouldn't be needed in the docs, but it certainly could be added if helpful.

anthonyeleven · 2024-04-17T14:24:49Z

The pool names can vary by release -- we've seen that with RGW -- but we do have the case with Rook that a user might not predict them in advance, so for RGW at least, just documenting like "all of the rgw metadata pools are created with the settings from the CephObjectStore CR under the metadataPool settings. So I expect all of the pools being created can be controlled with CRs today." In this context the index pool is separate I hope? Since it typically warrants individual planning.

travisn · 2024-04-17T19:10:53Z

In this context the index pool is separate I hope? Since it typically warrants individual planning.

You're referring to the .rgw.root pool? This one does get some special treatment, but if you have multiple object stores, they should have the same metadataPool settings or else the .rgw.root pool would attempt to apply the different settings. If there are multiple object stores, at least with v1.14 now this problem can be remedied with the shared pools for object stores where .rgw.root can be explicitly configured.

BlaineEXE · 2024-04-17T19:29:55Z

I believe the features you're requesting are able to be configured in Rook today.

Allow setting osd_pool_default_pg_autoscale_mode on or off

Ideally, to mirror the behavior available in non-Rook Ceph, the ability set both the global config default value and per-pool pg_num for any pool Rook/Ceph will deploy.

This is possible via Rook config options here (https://rook.io/docs/rook/latest-release/CRDs/Cluster/ceph-cluster-crd/#ceph-config), or by using the rook-config-override configmap.

Allow setting off, on, warn for each pool

Default to bulk mode to minimize impactful PG splitting later on

Travis did a much better job of explaining these points here: #14075 (comment)

As added notes, Ceph docs for these pool-focused params are here: https://docs.ceph.com/en/latest/rados/operations/pools/

And Rook docs about using parameters are here:
documented here: https://rook.io/docs/rook/latest-release/CRDs/Block-Storage/ceph-block-pool-crd/#pool-settings

These configs should allow modifying pre-existing pools without the need to use the toolbox CLI, which as you mentioned is a non-ideal workflow in the context that Rook is supposed to be a desired-state system.

And it is a good point that something that might help other Rook users is to add some documentation sections that show users how to configure pools with advanced features like these using the parameters section.

What about rgw.buckets.non-ec ? rgw.buckets.index? rgw.otp? .rgw.root?

I think the best "answer" for these is the object store shared pools feature that was added recently, mentioned here: #14075 (comment)

We very much need the means to configure it. Notably, getting it to do what it's supposed to requires prognostication

For other pools like .mgr, you're also right that it requires oracular foresight to figure out how to make Ceph do things right before runtime. Unfortunately, that seems to be what it takes to do advanced stuff with Ceph. I'm not sure how much more Rook can do to help the situation without taking on obscene code burden.

anthonyeleven · 2024-04-17T19:50:28Z

No, e.g. ceph-objectstore.rgw.buckets.index Insufficient PGs in this pool significantly bottleneck RGW operations -- this is sadly quite common.

…

On Apr 17, 2024, at 15:11, Travis Nielsen ***@***.***> wrote: In this context the index pool is separate I hope? Since it typically warrants individual planning. You're referring to the .rgw.root pool? This one does get some special treatment, but if you have multiple object stores, they should have the same metadataPool settings or else the .rgw.root pool would attempt to apply the different settings. If there are multiple object stores, at least with v1.14 now this problem can be remedied with the shared pools for object stores <https://rook.io/docs/rook/v1.14/Storage-Configuration/Object-Storage-RGW/object-storage/#create-local-object-stores-with-shared-pools> where .rgw.root can be explicitly configured. — Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONEQQ6S6SMBQ742BP5TY53CNHAVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGAZDENRQGU>. You are receiving this because you authored the thread.

BlaineEXE · 2024-04-17T19:54:03Z

No, e.g. ceph-objectstore.rgw.buckets.index

Insufficient PGs in this pool significantly bottleneck RGW operations -- this is sadly quite common.

This is an area I'm curious to hear more about in the context of shared pools. Currently, I think we assume shared pools are broken into 2 categories:

metadata (must be replicated, cannot be erasure coded)
data (can be replica/ec)

But I wondered when we were implementing if there would be any users who need additional breakdowns. Perhaps this could also make sense:

index (replica with count Y, fairly few pgs)
metadata (replica with count X, more pgs) - all non-index metadata
data (whatever you want)

Is this part of what you're expressing, @anthonyeleven ?

anthonyeleven · 2024-04-18T01:22:55Z

No, e.g. ceph-objectstore.rgw.buckets.index Insufficient PGs in this pool significantly bottleneck RGW operations -- this is sadly quite common. This is an area I'm curious to hear more about in the context of shared pools. Currently, I think we assume shared pools are broken into 2 categories: metadata (must be replicated, cannot be erasure coded) data (can be replica/ec)

And hopefully one day we can configure multiple (probably not more than a handful of) data (bucket) pools within a single objectstore, which would probably want to share a single index pool. I guess a shared pool makes sense if someone has like dozens of bucket pools, especially given that Rook creates a CRUSH rule for each and ever pool. I can't see that I personally would ever need to do that.

But I wondered when we were implementing if there would be any users who need additional breakdowns. Perhaps this could also make sense: metadata (replica with count X) index (replica with count Y) data (whatever you want) Is this part of what you're expressing, @anthonyeleven <https://github.com/anthonyeleven> ?

The rgw.index pool stores RGW S3 / Swift bucket indexes. With smaller objects and/or buckets with a lot of objects in them, this is often an RGW service's bottleneck. To work well, the index pool needs: * to be on SSDs * preferably NVMe of course * with a decent number of PGs, since both the OSD and the PG code have serializations that limit performance. On SATA SSDs I'd aim for a PG ratio of 200-250, for NVMe SSDs 300 easily. The pg_autoscaler unless forced will only do a fraction of these numbers. * to be across a decent number of OSDs. 3 isn't a decent number. 12 is maybe a start. As a cluster grows so should the index pool, so OSD nodes that have 1-2 SSDs in them for the index pool scale well, and we use deviceclasses to segregate the OSDs if they aren't all TLC * The SSDs don't have to be big, this is all omap data

…

— Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONGWD4KRL7B3BU3ZW5LY53HPDAVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRSGEYDIMRQGE>. You are receiving this because you were mentioned.

travisn · 2024-04-18T16:51:19Z

And hopefully one day we can configure multiple (probably not more than a handful of) data (bucket) pools within a single objectstore, which would probably want to share a single index pool.

I guess a shared pool makes sense if someone has like dozens of bucket pools, especially given that Rook creates a CRUSH rule for each and ever pool. I can't see that I personally would ever need to do that.

What is the scenario to have a single object store with multiple data pools? Perhaps creating separate object stores that have their own data pools meets the same requirements?

The rgw.index pool stores RGW S3 / Swift bucket indexes. With smaller objects and/or buckets with a lot of objects in them, this is often an RGW service's bottleneck. To work well, the index pool needs:

Thanks for the context on the index pool. Sounds like we need a separate option for its configuration from other metadata pools.

anthonyeleven · 2024-04-18T17:35:14Z

On Apr 18, 2024, at 12:51, Travis Nielsen ***@***.***> wrote: And hopefully one day we can configure multiple (probably not more than a handful of) data (bucket) pools within a single objectstore, which would probably want to share a single index pool. I guess a shared pool makes sense if someone has like dozens of bucket pools, especially given that Rook creates a CRUSH rule for each and ever pool. I can't see that I personally would ever need to do that. What is the scenario to have a single object store with multiple data pools? Perhaps creating separate object stores that have their own data pools meets the same requirements?

It doesn't -- we discussed this in the other issue thread. Multiple Rook objectstores require separate RGW instances and thus endpoints to use different types of storage. By having multiple bucket pools within a single Rook objectstore, and thus available to a common set of RGWs, clients or Lua scripts or RGW per-user config can direct uploads to different S3 storage classes and thus pools/media according to needs. For example, one might have: * default bucket pool on TLC SSDs with 3x replication * bucket pool for large, read-mostly objects on QLC SSDs with replication or EC * bucket pool for large, performance-tolerant objects on HDDs with EC If a given RGW endpoint can't access all of the storageclasses, that limits our ability to steer uploads via Lua, and forces clients who may have mixed use cases to juggle multiple endpoints.

The rgw.index pool stores RGW S3 / Swift bucket indexes. With smaller objects and/or buckets with a lot of objects in them, this is often an RGW service's bottleneck. To work well, the index pool needs: Thanks for the context on the index pool. Sounds like we need a separate option for its configuration from other metadata pools.

Absolutely. And maybe RGW logs too. The other minor RGW pools don't get much data.

…

— Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONFVY4MAYTFSENEP4Z3Y572Z5AVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRUGUZTONJRGI>. You are receiving this because you were mentioned.

travisn · 2024-04-18T19:17:15Z

Thanks for the background on separate data pools for the same store. Looks like we need to consider the storageclasses capability of rgw to allow the placement targets to different data pools.

travisn · 2024-04-18T19:19:34Z

This issue has delved into various different issues. @anthonyeleven would you mind opening new issues for the separate topics? We can keep this issue focused on the small doc clarification for pg management of rgw pools.

anthonyeleven · 2024-04-18T19:19:46Z

#12824 Multiple RGW pools and storageclasses · Issue #12824 · rook/rook github.com

…

On Apr 18, 2024, at 15:17, Travis Nielsen ***@***.***> wrote: Thanks for the background on separate data pools for the same store. Looks like we need to consider the storageclasses <https://docs.ceph.com/en/latest/radosgw/placement/#storage-classes> capability of rgw to allow the placement targets to different data pools. — Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONCYJGE25AZDQNKFI7LY6AL5DAVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGAYTQMRWGY>. You are receiving this because you were mentioned.

anthonyeleven · 2024-04-18T19:21:56Z

#12824 Multiple RGW pools and storageclasses · Issue #12824 · rook/rook github.com already covers the storageclasses.

…

On Apr 18, 2024, at 15:19, Travis Nielsen ***@***.***> wrote: This issue has delved into various different issues. @anthonyeleven <https://github.com/anthonyeleven> would you mind opening new issues for the separate topics? We can keep this issue focused on the small doc clarification for pg management of rgw pools. — Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONE3RO5PSBT3QAXTWH3Y6AMFZAVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGAZDKNRSGQ>. You are receiving this because you were mentioned.

travisn · 2024-04-18T19:24:21Z

already covers the storageclasses.

Thanks, I missed the connection there!

BlaineEXE · 2024-04-18T20:15:01Z

The rgw.index pool stores RGW S3 / Swift bucket indexes. With smaller objects and/or buckets with a lot of objects in them, this is often an RGW service's bottleneck. To work well, the index pool needs:

to be on SSDs

preferably NVMe of course

with a decent number of PGs, since both the OSD and the PG code have serializations that limit performance. On SATA SSDs I'd aim for a PG ratio of 200-250, for NVMe SSDs 300 easily. The pg_autoscaler unless forced will only do a fraction of these numbers.

to be across a decent number of OSDs. 3 isn't a decent number. 12 is maybe a start. As a cluster grows so should the index pool, so OSD nodes that have 1-2 SSDs in them for the index pool scale well, and we use deviceclasses to segregate the OSDs if they aren't all TLC

The SSDs don't have to be big, this is all omap data

Based on my interpretation of these requirements, I don't think they explicitly suggest that the index pool can't also provide storage for other metadata (in a shared pools case). In the interest of simplicity, I think it would make sense for users to configure the "metadata" pool with solid state and many PGs to give the index best performance, and then other metadata can also reap those benefits.

The only reason I can imagine right now that someone might want to separate index from "other metadata" is to save money by buying as few metadata NVMe drives as possible. But I also can't imagine that the other metadata includes a significant percentage of data that it makes much difference.

If that all is correct, then I think what we have today with shared pools can meet these needs. If not, then we can consider splitting index and non-index metadata pools (similar to OSD md and db).

And this is obviously separate from needing to develop and implement support for multiple RGW "s3 storage classes".

anthonyeleven · 2024-04-18T21:28:01Z

Pools can share OSDs, so it's not like anyone would save anything. Trying to lump the minor RGW pools into the index pool would be a bad idea. I dunno what RADOS object names are used, but Ceph for sure will not be expecting that. ***@***.*** ~]# kubectl exec $(kubectl get pods -n rook-ceph | grep rook-ceph-tools | cut -f1 -d\ ) -i -t -n rook-ceph -- /bin/bash ***@***.*** /]$ ceph osd dump | grep pool pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1296 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'rbd-nvme-ssd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 384295 lfor 0/156660/158816 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm lz4 compression_mode aggressive application rbd pool 12 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64674 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 13 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64679 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 14 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64684 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 15 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 153680 lfor 0/0/153678 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 16 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64695 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 17 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64709 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 18 '.rgw.root' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 64705 flags hashpspool stripe_width 0 application rook-ceph-rgw pool 19 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 6 min_size 5 crush_rule 10 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 384299 lfor 0/156300/165341 flags hashpspool,ec_overwrites stripe_width 16384 application rook-ceph-rgw pool 21 'ceph-objectstore.rgw.buckets.data.hdd' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile_hdd size 6 min_size 5 crush_rule 11 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 167193 lfor 0/0/164453 flags hashpspool,ec_overwrites stripe_width 16384 application rook-ceph-rgw pool 22 'rbd-sata-hdd' replicated size 3 min_size 2 crush_rule 12 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 384298 lfor 0/0/344532 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm lz4 compression_mode aggressive application rbd```

…

On Apr 18, 2024, at 16:15, Blaine Gardner ***@***.***> wrote: The rgw.index pool stores RGW S3 / Swift bucket indexes. With smaller objects and/or buckets with a lot of objects in them, this is often an RGW service's bottleneck. To work well, the index pool needs: to be on SSDs preferably NVMe of course with a decent number of PGs, since both the OSD and the PG code have serializations that limit performance. On SATA SSDs I'd aim for a PG ratio of 200-250, for NVMe SSDs 300 easily. The pg_autoscaler unless forced will only do a fraction of these numbers. to be across a decent number of OSDs. 3 isn't a decent number. 12 is maybe a start. As a cluster grows so should the index pool, so OSD nodes that have 1-2 SSDs in them for the index pool scale well, and we use deviceclasses to segregate the OSDs if they aren't all TLC The SSDs don't have to be big, this is all omap data Based on my interpretation of these requirements, I don't think they explicitly suggest that the index pool can't also provide storage for other metadata (in a shared pools case). In the interest of simplicity, I think it would make sense for users to configure the "metadata" pool with solid state and many PGs to give the index best performance, and then other metadata can also reap those benefits. The only reason I can imagine right now that someone might want to separate index from "other metadata" is to save money by buying as few metadata NVMe drives as possible. But I also can't imagine that the other metadata includes a significant percentage of data that it makes much difference. If that all is correct, then I think what we have today with shared pools can meet these needs. If not, then we can consider splitting index and non-index metadata pools (similar to OSD md and db). And this is obviously separate from needing to develop and implement support for multiple RGW "s3 storage classes". — Reply to this email directly, view it on GitHub <#14075 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADTVONGIPVE27GCVTADKRDTY6ASVXAVCNFSM6AAAAABGHLOHGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGIYTGMJSGI>. You are receiving this because you were mentioned.

BlaineEXE · 2024-04-18T21:46:31Z

Trying to lump the minor RGW pools into the index pool would be a bad idea. I dunno what RADOS object names are used, but Ceph for sure will not be expecting that.

The shared pool feature does not "lump pools together" as it seems you are thinking. With shared pools, objects are separated by namespaces (instead of pools) to avoid name collisions. My assertion is that with shared pools, there is no need to separate index and "minor" pools because I can't find evidence of substantial benefit to doing so when both index and minor metadata can share the "index-optimized" pool easily.

anthonyeleven · 2024-04-18T22:01:44Z

I don’t know how one would direct Ceph to do so. On Apr 18, 2024, at 5:46 PM, Blaine Gardner ***@***.***> wrote: Trying to lump the minor RGW pools into the index pool would be a bad idea. I dunno what RADOS object names are used, but Ceph for sure will not be expecting that. The shared pool feature does not "lump pools together" as it seems you are thinking. With shared pools, objects are separated by namespaces (instead of pools) to avoid name collisions. My assertion is that with shared pools, there is no need to separate index and "minor" pools because I can't find evidence of substantial benefit to doing so when both index and minor metadata can share the "index-optimized" pool easily. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

BlaineEXE · 2024-04-18T22:15:22Z

I don’t know how one would direct Ceph to do so.

That is the feature that is provided by shared pools: https://rook.io/docs/rook/v1.14/Storage-Configuration/Object-Storage-RGW/object-storage/#create-local-object-stores-with-shared-pools

anthonyeleven added the bug label Apr 15, 2024

parth-gr added the feature label Apr 16, 2024

parth-gr added this to To do in v1.14 via automation Apr 16, 2024

BlaineEXE removed the bug label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_autoscaler configuration is needed #14075

pg_autoscaler configuration is needed #14075

anthonyeleven commented Apr 15, 2024 •

edited by BlaineEXE

travisn commented Apr 15, 2024 •

edited

anthonyeleven commented Apr 15, 2024

parth-gr commented Apr 16, 2024 •

edited

anthonyeleven commented Apr 16, 2024 •

edited

parth-gr commented Apr 16, 2024

anthonyeleven commented Apr 16, 2024

parth-gr commented Apr 16, 2024 •

edited

anthonyeleven commented Apr 16, 2024

travisn commented Apr 16, 2024

anthonyeleven commented Apr 16, 2024

travisn commented Apr 16, 2024

anthonyeleven commented Apr 17, 2024

travisn commented Apr 17, 2024 •

edited

BlaineEXE commented Apr 17, 2024 •

edited

anthonyeleven commented Apr 17, 2024 via email

BlaineEXE commented Apr 17, 2024 •

edited

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

travisn commented Apr 18, 2024 •

edited

anthonyeleven commented Apr 18, 2024 via email

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

BlaineEXE commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

BlaineEXE commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

BlaineEXE commented Apr 18, 2024

pg_autoscaler configuration is needed #14075

pg_autoscaler configuration is needed #14075

Comments

anthonyeleven commented Apr 15, 2024 • edited by BlaineEXE

travisn commented Apr 15, 2024 • edited

anthonyeleven commented Apr 15, 2024

parth-gr commented Apr 16, 2024 • edited

anthonyeleven commented Apr 16, 2024 • edited

parth-gr commented Apr 16, 2024

anthonyeleven commented Apr 16, 2024

parth-gr commented Apr 16, 2024 • edited

anthonyeleven commented Apr 16, 2024

travisn commented Apr 16, 2024

anthonyeleven commented Apr 16, 2024

travisn commented Apr 16, 2024

anthonyeleven commented Apr 17, 2024

travisn commented Apr 17, 2024 • edited

BlaineEXE commented Apr 17, 2024 • edited

anthonyeleven commented Apr 17, 2024 via email

BlaineEXE commented Apr 17, 2024 • edited

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

travisn commented Apr 18, 2024 • edited

anthonyeleven commented Apr 18, 2024 via email

anthonyeleven commented Apr 18, 2024 via email

travisn commented Apr 18, 2024

BlaineEXE commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

BlaineEXE commented Apr 18, 2024

anthonyeleven commented Apr 18, 2024 via email

BlaineEXE commented Apr 18, 2024

anthonyeleven commented Apr 15, 2024 •

edited by BlaineEXE

travisn commented Apr 15, 2024 •

edited

parth-gr commented Apr 16, 2024 •

edited

anthonyeleven commented Apr 16, 2024 •

edited

parth-gr commented Apr 16, 2024 •

edited

travisn commented Apr 17, 2024 •

edited

BlaineEXE commented Apr 17, 2024 •

edited

BlaineEXE commented Apr 17, 2024 •

edited

travisn commented Apr 18, 2024 •

edited