Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rgw: update period if period does not exist #8828

Conversation

BlaineEXE
Copy link
Member

@BlaineEXE BlaineEXE commented Sep 24, 2021

Rook should update the RGW object store's period if the period doesn't
yet exist. This protects us from the case where the
'radosgw-admin period update --commit` command fails and the
CephObjectStore controller reconciles again.

Signed-off-by: Blaine Gardner blaine.gardner@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

}
t.Logf("returning default success for command: %s %v", command, args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what commands are the default success returned? There are others besides radosgw-admin?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the output...

...
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [config get mon. rgw_rados_pool_pg_num_min --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool get .rgw.root all --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool get my-store.rgw.log all --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
2021-09-24 16:23:46.315080 I | cephclient: setting pool property "pg_num_min" to "" on pool ".rgw.root"
2021-09-24 16:23:46.315086 I | cephclient: setting pool property "pg_num_min" to "" on pool "my-store.rgw.log"
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [osd pool set .rgw.root pg_num_min  --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
...
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:467: returning default success for command: radosgw-admin [period update --commit --rgw-realm=my-store --rgw-zonegroup=my-store --rgw-zone=my-store --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring]
...
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [config set client.rgw.my.store.a rgw_log_nonexistent_bucket true --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
2021-09-24 16:23:46.315652 I | op-config: successfully set "client.rgw.my.store.a"="rgw_log_nonexistent_bucket"="true" option to the mon configuration database
2021-09-24 16:23:46.315656 I | op-config: setting "client.rgw.my.store.a"="rgw_log_object_name_utc"="true" option to the mon configuration database
...
    /Users/blaine/go/src/github.com/rook/rook/pkg/operator/ceph/object/controller_test.go:445: returning default success for command: ceph [dashboard get-rgw-api-access-key --connect-timeout=15 --cluster=rook-ceph --conf=rook-ceph/rook-ceph.config --name=client.admin --keyring=rook-ceph/client.admin.keyring --format json]
...

Also, to be clear, I have only added the log message and not changed the behavior in any other way. In general, I think it's probably a bad practice to return a default success for any commands in these tests.

}
logger.Debugf("updated period for realm %v", objContext.Realm)
// the period will help notify other zones of changes if there are multi-zones
_, err = runAdminCommand(objContext, false, "period", "update", "--commit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there were side effects of updating the period with every reconcile when nothing changed. But that was a long time ago, so maybe something changed. Did the rgw team recommend this change? Or at least would be good to review the expectations around updating the period.

Copy link
Member Author

@BlaineEXE BlaineEXE Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. It would be good to get clarity on this. @thotz ?

From what I observed, the command didn't fail, but that doesn't mean there couldn't be a side effect. TBH, I think Rook should be doing this, and if there is a side effect, that should be a Ceph bug to fix. Even with cephadm, Ceph is getting into the idempotent setup game.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of side effects wrt period update command. May @alimaredia or @cbodley have a more clear idea.
Even if there is no side effect IMHO it is better to check the via period get to see whether zone/zonegroups are synced before performing the period commit than performing everytime

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each period commit would generate a new period 'epoch', and all radosgws in that realm would have to pause incoming requests to re-initialize themselves with that new period configuration. that should be avoided if nothing actually changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley is there a straightforward way to know whether we should commit a new period or not? It's going to be tricky to inspect each zone, group, realm and decide to update or not.

Also, looking ahead, is it a reasonable feature request for the RGW to make the commands idempotent and not update the RGWs if there are no changes needed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbodley is there a straightforward way to know whether we should commit a new period or not? It's going to be tricky to inspect each zone, group, realm and decide to update or not.

so rook doesn't know whether or not it changed anything? you could presumably run period update without --commit to generate the staging period, then compare the json output of period get (the current period) with period get --staging (the updated period) to decide whether or not to period commit

Also, looking ahead, is it a reasonable feature request for the RGW to make the commands idempotent and not update the RGWs if there are no changes needed?

the command is doing what you told it to do, so i'm not crazy about the idea. but if someone wants to work on that, i'm willing to help with review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the tip about the staging period. For now, I reverted back to our old method of seeing if we needed updates with an addition. It seems that some of our code around multisite configuration always updates realms/zonegroups/zones/periods, and we will have to modify those in the future. The staging feature should help us do that more easily.

thotz
thotz previously requested changes Sep 27, 2021
}
logger.Debugf("updated period for realm %v", objContext.Realm)
// the period will help notify other zones of changes if there are multi-zones
_, err = runAdminCommand(objContext, false, "period", "update", "--commit")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not aware of side effects wrt period update command. May @alimaredia or @cbodley have a more clear idea.
Even if there is no side effect IMHO it is better to check the via period get to see whether zone/zonegroups are synced before performing the period commit than performing everytime

@BlaineEXE
Copy link
Member Author

Even if there is no side effect IMHO it is better to check the via period get to see whether zone/zonegroups are synced before performing the period commit than performing everytime

As we discussed in huddle, if there is no side effect, I think it is better to update the period every time. This just reduces the number of API calls to Ceph, and it means there is one less command that can become broken if, for example, the output of the command fails. We've had issues with radosgw-admin calls before.

I haven't been able to get in contact with @cbodley or @alimaredia directly, so for now I will proceed with the check-and-set method proposed with the intent to change it to an idempotent set in the future as long as there are no potentially harmful side effects.

@BlaineEXE
Copy link
Member Author

Even if there is no side effect IMHO it is better to check the via period get to see whether zone/zonegroups are synced before performing the period commit than performing everytime

As we discussed in huddle, if there is no side effect, I think it is better to update the period every time. This just reduces the number of API calls to Ceph, and it means there is one less command that can become broken if, for example, the output of the command fails. We've had issues with radosgw-admin calls before.

I haven't been able to get in contact with @cbodley or @alimaredia directly, so for now I will proceed with the check-and-set method proposed with the intent to change it to an idempotent set in the future as long as there are no potentially harmful side effects.

Actually, I want to backtrack that comment. @alimaredia wrote the code for multisite, and he created two places in the code where it calls radosgw-admin period update --commit without any checking beforehand (here and here). Given that, I think we should consider it safe for now to proceed forward with just setting the period each time. If we find out later that there are side-effects, then we will need to change all 3 places.

@BlaineEXE BlaineEXE force-pushed the rgw-period-command-should-be-run-on-every-reconcile branch from c995104 to 7e1c5ee Compare September 28, 2021 16:51
@BlaineEXE BlaineEXE force-pushed the rgw-period-command-should-be-run-on-every-reconcile branch 2 times, most recently from a308b3b to 572d5d9 Compare September 28, 2021 20:08
@BlaineEXE BlaineEXE changed the title rgw: always update period, even if realm exists rgw: update period if period does not exist Sep 28, 2021
@BlaineEXE BlaineEXE force-pushed the rgw-period-command-should-be-run-on-every-reconcile branch from 572d5d9 to a3e64ae Compare September 28, 2021 20:13
Comment on lines +83 to +126

// Mock an executed command with the desired return values.
// STDERR is returned *before* STDOUT.
//
// This will return an error if the given exit code is nonzero. The error return is the primary
// benefit of using this method.
//
// In order for this to work in a `*_test.go` file, you MUST import TestMockExecHelperProcess
// exactly as shown below:
// import exectest "github.com/rook/rook/pkg/util/exec/test"
// // import TestMockExecHelperProcess
// func TestMockExecHelperProcess(t *testing.T) {
// exectest.TestMockExecHelperProcess(t)
// }
// Inspired by: https://github.com/golang/go/blob/master/src/os/exec/exec_test.go
func MockExecCommandReturns(t *testing.T, stdout, stderr string, retcode int) error {
cmd := exec.Command(os.Args[0], "-test.run=TestMockExecHelperProcess") //nolint:gosec //Rook controls the input to the exec arguments
cmd.Env = append(os.Environ(),
"GO_WANT_HELPER_PROCESS=1",
fmt.Sprintf("GO_HELPER_PROCESS_STDOUT=%s", stdout),
fmt.Sprintf("GO_HELPER_PROCESS_STDERR=%s", stderr),
fmt.Sprintf("GO_HELPER_PROCESS_RETCODE=%d", retcode),
)
err := cmd.Run()
return err
}

// TestHelperProcess isn't a real test. It's used as a helper process for MockExecCommandReturns to
// simulate output from a command. Notably, this can return a realistic os/exec error.
// Inspired by: https://github.com/golang/go/blob/master/src/os/exec/exec_test.go
func TestMockExecHelperProcess(t *testing.T) {
if os.Getenv("GO_WANT_HELPER_PROCESS") != "1" {
return
}

// test should set these in its environment to control the output of the test commands
fmt.Fprint(os.Stderr, os.Getenv("GO_HELPER_PROCESS_STDERR")) // return stderr before stdout
fmt.Fprint(os.Stdout, os.Getenv("GO_HELPER_PROCESS_STDOUT"))
rc, err := strconv.Atoi(os.Getenv("GO_HELPER_PROCESS_RETCODE"))
if err != nil {
panic(err)
}
os.Exit(rc)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to specifically get an error with a parsable return code from the new unit tests, so I move some code from exec/exec_test.go that I used for the same purpose to exec/test/mockexec.go to allow for reuse.

Rook should update the RGW object store's period if the period doesn't
yet exist. This protects us from the case where the
'radosgw-admin period update --commit` command fails and the
CephObjectStore controller reconciles again.

Signed-off-by: Blaine Gardner <blaine.gardner@redhat.com>
@BlaineEXE BlaineEXE force-pushed the rgw-period-command-should-be-run-on-every-reconcile branch from a3e64ae to 7b92936 Compare September 28, 2021 20:17
// ENOENT means “No such file or directory”
if err == nil && code == int(syscall.ENOENT) {
// period does not exist and so needs to be created
updatePeriod = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the period doesn't exist, won't updatePeriod already be set previously since the zone, zonegroup, and realm were also created? Or you're guarding against if the operator dies in the middle of the reconcile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... that's a good question. I assume get will fail if it wasn't set, but I can't remember if that's what I found out from my testing or if that's an assumption... let me check on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that if period update --commit hasn't been called, then period get will fail. Also verified in a manual test that the period is updated on the first reconcile and not on subsequent reconciles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR in the https://bugzilla.redhat.com/show_bug.cgi?id=2002220 setup, the radosgw-admin period get was working fine with containing only realm data but radowgw-admin period get --staging was failing with ENOENT

period init failed: (2) No such file or directory
2021-09-28T05:49:50.508+0000 7fd753369380  0 failed reading obj info from .rgw.root:periods.3f5441b6-b2ea-43fe-a64e-8cb78e5fe8ae:staging: (2) No such file or directory

So IMO it is better to add --staging for the radosgw-admin period get command from this experience

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation avoids using --staging at all, but that feedback was useful for adding to this issue: #8879

code, err := exec.ExtractExitCode(err)
// ENOENT means “No such file or directory”
if err == nil && code == int(syscall.ENOENT) {
// period does not exist and so needs to be created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have info or debug log here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't really needed since the command for the create below is debug logged

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More logging is always better at least as an indication for humans. Anyway, I'm not gonna hold for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it in a follow-up

@travisn travisn dismissed thotz’s stale review September 29, 2021 19:07

Addressed, or tracked in a follow-up issue

@BlaineEXE BlaineEXE merged commit 2826bc9 into rook:master Sep 29, 2021
@BlaineEXE BlaineEXE deleted the rgw-period-command-should-be-run-on-every-reconcile branch September 29, 2021 19:08
mergify bot added a commit that referenced this pull request Sep 29, 2021
rgw: update period if period does not exist (backport #8828)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants