Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating COSI user causes the object store reconcile to fail several times before finally succeeding #13904

Open
travisn opened this issue Mar 8, 2024 · 4 comments · May be fixed by #14020
Open
Assignees
Labels

Comments

@travisn
Copy link
Member

travisn commented Mar 8, 2024

Is this a bug report or feature request?

  • Bug Report

Deviation from expected behavior:
The COSI user is created with each object store creation. After the object store creation is completed, the controller attempts to create the COSI user and fails the reconcile. After repeated reconciles, finally on the fifth reconcile and a total of 45 seconds, the COSI user is created. This timing is very consistent in my minikube environment every time I create a test object store.

Here are the failure logs that cause the reconcile to restart, then finally succeed. See attached full operator.log

2024-03-08 23:34:27.907685 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/my-store". failed to create object store deployments: failed to get COSI user "cosi": Get "http://rook-ceph-rgw-my-store.rook-ceph.svc:80/admin/user?format=json&uid=cosi": dial tcp 10.108.29.221:80: connect: connection refused
2024-03-08 23:34:40.336048 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/my-store". failed to create object store deployments: failed to get COSI user "cosi": Get "http://rook-ceph-rgw-my-store.rook-ceph.svc:80/admin/user?format=json&uid=cosi": dial tcp 10.108.29.221:80: connect: connection refused
2024-03-08 23:34:52.416211 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/my-store". failed to create object store deployments: failed to get COSI user "cosi": Get "http://rook-ceph-rgw-my-store.rook-ceph.svc:80/admin/user?format=json&uid=cosi": dial tcp 10.108.29.221:80: connect: connection refused
2024-03-08 23:35:03.410629 E | ceph-object-controller: failed to reconcile CephObjectStore "rook-ceph/my-store". failed to create object store deployments: failed to get COSI user "cosi": Get "http://rook-ceph-rgw-my-store.rook-ceph.svc:80/admin/user?format=json&uid=cosi": dial tcp 10.108.29.221:80: connect: connection refused
2024-03-08 23:35:14.378067 I | ceph-object-controller: creating COSI user "cosi"

Expected behavior:
In the common case, the COSI user should be created successfully without causing so many retries. If there is a known reason the user cannot be created for some time after the object store is created, let's add a check for that condition in the reconcile instead of filling the logs with failed reconciles.

How to reproduce it (minimal and precise):

  1. Create a ceph cluster (cluster-test.yaml)
  2. Create an object store (object-test.yaml)
  3. Check the operator logs for the creation of the COSI user
@thotz
Copy link
Contributor

thotz commented Mar 13, 2024

Even rgw pod is up and running it may take few seconds to be ready, hence this request is failing IMO. May be something like this need to be added before the creating the cosi user

@travisn
Copy link
Member Author

travisn commented Mar 13, 2024

Even rgw pod is up and running it may take few seconds to be ready, hence this request is failing IMO. May be something like this need to be added before the creating the cosi user

45s is a "long" time to wait since the rgw pod is already up. We need to understand why so long and if anything can be improved with this. In the normal install flow we want to avoid reconcile failures if possible. I believe other users can be created immediately after that without waiting this long, but will check for sure...

thotz added a commit to thotz/rook that referenced this issue Apr 3, 2024
Objectstore controller creates cosi user before objectstore is ready,
this create unecessary errors logs mentioning cosi user failed to
create.

Fixes: rook#13904

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
@thotz thotz linked a pull request Apr 3, 2024 that will close this issue
6 tasks
thotz added a commit to thotz/rook that referenced this issue Apr 5, 2024
Objectstore controller creates cosi user before objectstore is ready,
it will take sometime to rgw server will up and be ready receive
requests via restapi. So creating cosi will fail until rgw is ready. But
other users like adminops and dashboard are created with help of
`radosgw-admin` command and never fails. So use the same approach for
cosi user.

Fixes: rook#13904

Signed-off-by: Jiffin Tony Thottan <thottanjiffin@gmail.com>
@huv95
Copy link

huv95 commented May 3, 2024

I am running ceph object store with hdd drives, named s3.
To improve speed, I just added some ssd drives to the cluster, and created a new object store called s3-ssd.
But, it is not created successfully, I get the error failed to create object store deployments: failed to create COSI user "cosi".
Do you have any advice?

@travisn
Copy link
Member Author

travisn commented May 3, 2024

@huv95 Does the object store reconcile never succeed? Or you see that message continuously, and even after restarting the operator pod? If it is continuous, there must be some other error configuring the object store. For example, do you have at least three OSDs on different nodes? What does ceph status show? Can you share more of the operator log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants