Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete model #17208

Merged
merged 9 commits into from
May 22, 2024
Merged

Delete model #17208

merged 9 commits into from
May 22, 2024

Conversation

SimonRichardson
Copy link
Member

@SimonRichardson SimonRichardson commented Apr 15, 2024

As we can't drop a database when a migration fails, we have to
delete the contents of the database. This includes all the schema
that was added to the database. The rationale for deleting the schema
is that, if you migrate src:A to dst:A, but that fails, and then you
perform an upgrade to dst via a patch release, there isn't a guarantee
that dst:A schema is completely updated. So removing everything leaves
us with a blank slate.

There have been thoughts around aliasing tables instead of a clean
slate, but this is fraught with danger if you accidentally write to
the wrong DB (i.e. the alias one).

In addition to this, we also can't clear the DB in one query, as we
get a segfault. The code batches up the statements, so we're not running
a query for every statement, but should be enough to prevent a
segfault.

The way that we drop everything from the db needs to happen in this
order to prevent constraint violations. In addition, we ensure that
the foreign key param is set to false to also prevent any issues
in ordering of deletions.

We currently do not delete the model during a destroy-model
command, as that completely locks up the model. That will
have to be tackled later.

Checklist

  • Code style: imports ordered, good names, simple structure, etc
  • Comments saying why design decisions were made
  • Go unit tests, with comments saying what you're testing

QA steps

Apply patch

We need to apply this patch to ensure that the migration will fail:

diff --git a/domain/modelconfig/modelmigration/import.go b/domain/modelconfig/modelmigration/import.go
index 1cea5efeb4..13667a3481 100644
--- a/domain/modelconfig/modelmigration/import.go
+++ b/domain/modelconfig/modelmigration/import.go
@@ -5,6 +5,7 @@ package modelmigration

 import (
        "context"
+       "fmt"

        "github.com/juju/description/v5"
        "github.com/juju/errors"
@@ -61,6 +62,8 @@ func (i *importOperation) Setup(scope modelmigration.Scope) error {
 func (i *importOperation) Execute(ctx context.Context, model description.Model) error {
        attrs := model.Config()

+       return fmt.Errorf("BOOM")
+
        // If we don't have any model config, then there is something seriously
        // wrong. In this case, we should return an error.
        if len(attrs) == 0 {

Create the src:

$ juju bootstrap lxd src
$ juju add-model default

Create the dst:

$ juju bootstrap lxd dst
$ juju switch src
$ juju migrate src:default dst

This should fail (check the logs).
Using the juju models command, find the default model-uuid, we'll need it later.

$ make repl-install

Login to dqlite

Ensure to replace <model-uuid> with the valid one.

$ juju ssh -m dst:controller 0
$ sudo snap install yq
# Note we need to pipe indirection because of snap confinement.
$ sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllercert' | xargs -I% echo % > dqlite.cert
$ sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllerkey' | xargs -I% echo % > dqlite.key
$ sudo dqlite -s file:///var/lib/juju/dqlite/cluster.yaml -c ./dqlite.cert -k ./dqlite.key <model-uuid>
> SELECT * FROM sqlite_master WHERE name NOT LIKE 'sqlite_%';

The result should be empty.

Revert patch

Revert the patch and upgrade the src and dst controllers.

$ juju migrate src:default dst

Should succeed.

Links

Jira card: JUJU-5877

@SimonRichardson
Copy link
Member Author

/build

apiserver/facades/client/modelmanager/modelmanager.go Outdated Show resolved Hide resolved
// other models.
modelUUID := coremodel.UUID(stModel.UUID())

// TODO (stickupkid): We need to delete the model info when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't make sense. In any case, could we not in this instance delete and recreate the table? I don't think anything has RI to it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because we need to implement the life cycle for a model. We're currently going from Alive -> Dead in the model manager, instead of Alive -> Dying -> Dead

@@ -221,12 +224,20 @@ func NewWorker(cfg WorkerConfig) (*dbWorker, error) {
// that case we do want to cause the dbaccessor to go down. This
// will then bring up a new dqlite app.
IsFatal: func(err error) bool {
// If a database is dead we should not kill the worker.
if errors.Is(err, database.ErrDBDead) {
Copy link
Member

@manadart manadart Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this says that if we kill the worker with ErrDBDead, the worker is not restarted and it is not fatal to us, the parent, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

internal/worker/dbaccessor/worker.go Outdated Show resolved Hide resolved
core/context/sourceable.go Outdated Show resolved Hide resolved
internal/worker/dbaccessor/tracker.go Outdated Show resolved Hide resolved
@SimonRichardson
Copy link
Member Author

The intermittent test failures are fixed by #17341

@SimonRichardson
Copy link
Member Author

/build

Copy link
Member

@manadart manadart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good as a progression. All QA is solid.

We do however have to be very careful regarding the proper progression of model life-cycle, and cleanup/teardown when we come to it.

A note in the risk register wouldn't hurt - we don't ever want this path recruited by accident or prematurely.

@@ -49,6 +50,7 @@ type sharedServerContext struct {
logger corelogger.Logger
charmhubHTTPClient facade.HTTPClient
dbGetter changestream.WatchableDBGetter
dbDeleter database.DBDeleter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a comment here regarding the fact that it ultimately only ends up in a migration context, and is under no circumstances to have expanded availability.

@@ -85,6 +88,7 @@ type ModelInfoService interface {
// CreateModel is responsible for creating a new read only model
// that is being imported.
CreateModel(context.Context, uuid.UUID) error
DeleteModel(context.Context) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment this. Say exactly what it does, which DB(s) it acts on.

The model migration service needs to be able to delete models if it
fails to correctly import. Removal of the model is required so that
another attempt can be made.

This mostly involves threading the model deleter through the apiserver
to pass to the model migartion coordinator.
As we can't drop a database, when a migration fails, we have to
delete the contents of the database. This includes all the schema
that was added to the database. The rationale for deleting the schema
is that, if you migrate src:A to dst:A, but that fails and then you
perform a upgrade to dst via a patch release, there isn't a guarantee
that dst:A schema is completely updated. So removing everything leaves
us with a blank state.

There have been thoughts around aliasing tables instead of a clean
slate, but this is fraught with danger if you accidently write to
the wrong DB (i.e. the alias one).

In addition to this, we also can't clear the DB in one query, as we
get a segfault. The code batches up the statements, so we're not running
a query for every statement, but should be enough to prevent a
segfault.

The way that we drop everything from the db needs to happen in this
order to prevent constraint violations. In addition we ensure that
the foreign key param is set to _false_ to also prevent any issues
in ordering of deletions.
As the read-only model has some immutable triggers around it to prevent
deletion and updates, we need to destroy it before deleting the DB.
The worker needs to be stopped upon deletion. We want to ensure it's
dead. We don't want to allow the worker to spring back up. So having
a sentinel error for when we want to delete a model propergate through
the stack allows others to know when to give up.
Until we fully implement tearing down of a model, we can't currently
remove a model until the very last moment. This code just makes it
optional, so that we can state that during a model migration we do
want to delete the model so we can retry.

In addition, I've also fixed the dbaccessor to correctly terminate
transactions based on if the tomb is dying. See context sourceable.
To prevent all trackers from polling the db at the same causing
problems, introduce a jitter to give more random access patterns
to the requests.
We need to remove the model info (read-only model), but we can't
do it until everything has been removed from the database. Attempting
to remove it early causes everything to lock up.
The comments weren't quite correct, improve them to better explain
i.e. why we're tying the tomb errors to the context.
@SimonRichardson
Copy link
Member Author

/merge

@jujubot jujubot merged commit 154487a into juju:main May 22, 2024
15 of 17 checks passed
@SimonRichardson SimonRichardson deleted the delete-model branch May 22, 2024 15:56

// If the db should not be deleted then we can return early.
if !options.DeleteDB() {
s.logger.Infof("skipping model deletion, model database will still be present")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need logging in here? Surely this would be better from the callers side and also updating the comment of the function to explain expected default behaviour.

My reason for asking is I am really not a fan of logging in the services layer. The services layer has nice well defined behaviour that is documented and strong errors. Logging can get pushed up a layer with all of that done I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only exists because we can't delete a model outside of model migration in its current state. The key part we're missing is going from alive -> dying -> dead. We should delete these options when models are correctly progressing through the life states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants