New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rgw: update period if period does not exist #8828
rgw: update period if period does not exist #8828
Conversation
} | ||
t.Logf("returning default success for command: %s %v", command, args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what commands are the default success returned? There are others besides radosgw-admin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the output...
...
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [config get mon. rgw_rados_pool_pg_num_min --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool get .rgw.root all --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool get my-store.rgw.log all --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
2021-09-24 16:23:46.315080 I | cephclient: setting pool property "pg_num_min" to "" on pool ".rgw.root"
2021-09-24 16:23:46.315086 I | cephclient: setting pool property "pg_num_min" to "" on pool "my-store.rgw.log"
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool set .rgw.root pg_num_min --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
...
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:467: returning default success for command: radosgw-admin [period update --commit --rgw-realm=my-store --rgw-zonegroup=my-store --rgw-zone=my-store --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring]
...
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [config set client.rgw.my.store.a rgw_log_nonexistent_bucket true --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
2021-09-24 16:23:46.315652 I | op-config: successfully set "client.rgw.my.store.a"="rgw_log_nonexistent_bucket"="true" option to the mon configuration database
2021-09-24 16:23:46.315656 I | op-config: setting "client.rgw.my.store.a"="rgw_log_object_name_utc"="true" option to the mon configuration database
...
/Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [dashboard get-rgw-api-access-key --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
...
Also, to be clear, I have only added the log message and not changed the behavior in any other way. In general, I think it's probably a bad practice to return a default success for any commands in these tests.
} | ||
logger.Debugf("updated period for realm %v", objContext.Realm) | ||
// the period will help notify other zones of changes if there are multi-zones | ||
_, err = runAdminCommand(objContext, false, "period", "update", "--commit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there were side effects of updating the period with every reconcile when nothing changed. But that was a long time ago, so maybe something changed. Did the rgw team recommend this change? Or at least would be good to review the expectations around updating the period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. It would be good to get clarity on this. @thotz ?
From what I observed, the command didn't fail, but that doesn't mean there couldn't be a side effect. TBH, I think Rook should be doing this, and if there is a side effect, that should be a Ceph bug to fix. Even with cephadm, Ceph is getting into the idempotent setup game.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of side effects wrt period update
command. May @alimaredia or @cbodley have a more clear idea.
Even if there is no side effect IMHO it is better to check the via period get
to see whether zone/zonegroups are synced before performing the period commit
than performing everytime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each period commit
would generate a new period 'epoch', and all radosgws in that realm would have to pause incoming requests to re-initialize themselves with that new period configuration. that should be avoided if nothing actually changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley is there a straightforward way to know whether we should commit a new period or not? It's going to be tricky to inspect each zone, group, realm and decide to update or not.
Also, looking ahead, is it a reasonable feature request for the RGW to make the commands idempotent and not update the RGWs if there are no changes needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbodley is there a straightforward way to know whether we should commit a new period or not? It's going to be tricky to inspect each zone, group, realm and decide to update or not.
so rook doesn't know whether or not it changed anything? you could presumably run period update
without --commit to generate the staging period, then compare the json output of period get
(the current period) with period get --staging
(the updated period) to decide whether or not to period commit
Also, looking ahead, is it a reasonable feature request for the RGW to make the commands idempotent and not update the RGWs if there are no changes needed?
the command is doing what you told it to do, so i'm not crazy about the idea. but if someone wants to work on that, i'm willing to help with review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the tip about the staging period. For now, I reverted back to our old method of seeing if we needed updates with an addition. It seems that some of our code around multisite configuration always updates realms/zonegroups/zones/periods, and we will have to modify those in the future. The staging
feature should help us do that more easily.
} | ||
logger.Debugf("updated period for realm %v", objContext.Realm) | ||
// the period will help notify other zones of changes if there are multi-zones | ||
_, err = runAdminCommand(objContext, false, "period", "update", "--commit") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not aware of side effects wrt period update
command. May @alimaredia or @cbodley have a more clear idea.
Even if there is no side effect IMHO it is better to check the via period get
to see whether zone/zonegroups are synced before performing the period commit
than performing everytime
As we discussed in huddle, if there is no side effect, I think it is better to update the period every time. This just reduces the number of API calls to Ceph, and it means there is one less command that can become broken if, for example, the output of the command fails. We've had issues with I haven't been able to get in contact with @cbodley or @alimaredia directly, so for now I will proceed with the check-and-set method proposed with the intent to change it to an idempotent |
Actually, I want to backtrack that comment. @alimaredia wrote the code for multisite, and he created two places in the code where it calls |
c995104
to
7e1c5ee
Compare
a308b3b
to
572d5d9
Compare
572d5d9
to
a3e64ae
Compare
|
||
// Mock an executed command with the desired return values. | ||
// STDERR is returned *before* STDOUT. | ||
// | ||
// This will return an error if the given exit code is nonzero. The error return is the primary | ||
// benefit of using this method. | ||
// | ||
// In order for this to work in a `*_test.go` file, you MUST import TestMockExecHelperProcess | ||
// exactly as shown below: | ||
// import exectest "github.com/rook/rook/pkg/util/exec/test" | ||
// // import TestMockExecHelperProcess | ||
// func TestMockExecHelperProcess(t *testing.T) { | ||
// exectest.TestMockExecHelperProcess(t) | ||
// } | ||
// Inspired by: https://github.com/golang/go/blob/master/src/os/exec/exec_test.go | ||
func MockExecCommandReturns(t *testing.T, stdout, stderr string, retcode int) error { | ||
cmd := exec.Command(os.Args[0], "-test.run=TestMockExecHelperProcess") //nolint:gosec //Rook controls the input to the exec arguments | ||
cmd.Env = append(os.Environ(), | ||
"GO_WANT_HELPER_PROCESS=1", | ||
fmt.Sprintf("GO_HELPER_PROCESS_STDOUT=%s", stdout), | ||
fmt.Sprintf("GO_HELPER_PROCESS_STDERR=%s", stderr), | ||
fmt.Sprintf("GO_HELPER_PROCESS_RETCODE=%d", retcode), | ||
) | ||
err := cmd.Run() | ||
return err | ||
} | ||
|
||
// TestHelperProcess isn't a real test. It's used as a helper process for MockExecCommandReturns to | ||
// simulate output from a command. Notably, this can return a realistic os/exec error. | ||
// Inspired by: https://github.com/golang/go/blob/master/src/os/exec/exec_test.go | ||
func TestMockExecHelperProcess(t *testing.T) { | ||
if os.Getenv("GO_WANT_HELPER_PROCESS") != "1" { | ||
return | ||
} | ||
|
||
// test should set these in its environment to control the output of the test commands | ||
fmt.Fprint(os.Stderr, os.Getenv("GO_HELPER_PROCESS_STDERR")) // return stderr before stdout | ||
fmt.Fprint(os.Stdout, os.Getenv("GO_HELPER_PROCESS_STDOUT")) | ||
rc, err := strconv.Atoi(os.Getenv("GO_HELPER_PROCESS_RETCODE")) | ||
if err != nil { | ||
panic(err) | ||
} | ||
os.Exit(rc) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to specifically get an error with a parsable return code from the new unit tests, so I move some code from exec/exec_test.go
that I used for the same purpose to exec/test/mockexec.go
to allow for reuse.
Rook should update the RGW object store's period if the period doesn't yet exist. This protects us from the case where the 'radosgw-admin period update --commit` command fails and the CephObjectStore controller reconciles again. Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
a3e64ae
to
7b92936
Compare
// ENOENT means “No such file or directory” | ||
if err == nil && code == int(syscall.ENOENT) { | ||
// period does not exist and so needs to be created | ||
updatePeriod = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the period doesn't exist, won't updatePeriod
already be set previously since the zone, zonegroup, and realm were also created? Or you're guarding against if the operator dies in the middle of the reconcile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... that's a good question. I assume get
will fail if it wasn't set, but I can't remember if that's what I found out from my testing or if that's an assumption... let me check on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that if period update --commit
hasn't been called, then period get
will fail. Also verified in a manual test that the period is updated on the first reconcile and not on subsequent reconciles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR in the https://bugzilla.redhat.com/show_bug.cgi?id=2002220 setup, the radosgw-admin period get
was working fine with containing only realm data but radowgw-admin period get --staging
was failing with ENOENT
period init failed: (2) No such file or directory
2021-09-28T05:49:50.508+0000 7fd753369380 0 failed reading obj info from .rgw.root:periods.3f5441b6-b2ea-43fe-a64e-8cb78e5fe8ae:staging: (2) No such file or directory
So IMO it is better to add --staging
for the radosgw-admin period get
command from this experience
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation avoids using --staging
at all, but that feedback was useful for adding to this issue: #8879
code, err := exec.ExtractExitCode(err) | ||
// ENOENT means “No such file or directory” | ||
if err == nil && code == int(syscall.ENOENT) { | ||
// period does not exist and so needs to be created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have info or debug log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't really needed since the command for the create below is debug logged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More logging is always better at least as an indication for humans. Anyway, I'm not gonna hold for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it in a follow-up
Addressed, or tracked in a follow-up issue
rgw: update period if period does not exist (backport #8828)
Rook should update the RGW object store's period if the period doesn't
yet exist. This protects us from the case where the
'radosgw-admin period update --commit` command fails and the
CephObjectStore controller reconciles again.
Signed-off-by: Blaine Gardner blaine.gardner@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen
) has been run to update object specifications, if necessary.