Delete model #17208

SimonRichardson · 2024-04-15T20:15:34Z

As we can't drop a database when a migration fails, we have to
delete the contents of the database. This includes all the schema
that was added to the database. The rationale for deleting the schema
is that, if you migrate src:A to dst:A, but that fails, and then you
perform an upgrade to dst via a patch release, there isn't a guarantee
that dst:A schema is completely updated. So removing everything leaves
us with a blank slate.

There have been thoughts around aliasing tables instead of a clean
slate, but this is fraught with danger if you accidentally write to
the wrong DB (i.e. the alias one).

In addition to this, we also can't clear the DB in one query, as we
get a segfault. The code batches up the statements, so we're not running
a query for every statement, but should be enough to prevent a
segfault.

The way that we drop everything from the db needs to happen in this
order to prevent constraint violations. In addition, we ensure that
the foreign key param is set to false to also prevent any issues
in ordering of deletions.

We currently do not delete the model during a destroy-model
command, as that completely locks up the model. That will
have to be tackled later.

Checklist

Code style: imports ordered, good names, simple structure, etc
Comments saying why design decisions were made
Go unit tests, with comments saying what you're testing

QA steps

Apply patch

We need to apply this patch to ensure that the migration will fail:

diff --git a/domain/modelconfig/modelmigration/import.go b/domain/modelconfig/modelmigration/import.go
index 1cea5efeb4..13667a3481 100644
--- a/domain/modelconfig/modelmigration/import.go
+++ b/domain/modelconfig/modelmigration/import.go
@@ -5,6 +5,7 @@ package modelmigration

 import (
        "context"
+       "fmt"

        "github.com/juju/description/v5"
        "github.com/juju/errors"
@@ -61,6 +62,8 @@ func (i *importOperation) Setup(scope modelmigration.Scope) error {
 func (i *importOperation) Execute(ctx context.Context, model description.Model) error {
        attrs := model.Config()

+       return fmt.Errorf("BOOM")
+
        // If we don't have any model config, then there is something seriously
        // wrong. In this case, we should return an error.
        if len(attrs) == 0 {

Create the src:

$ juju bootstrap lxd src
$ juju add-model default

Create the dst:

$ juju bootstrap lxd dst
$ juju switch src
$ juju migrate src:default dst

This should fail (check the logs).
Using the juju models command, find the default model-uuid, we'll need it later.

$ make repl-install

Login to dqlite

Ensure to replace <model-uuid> with the valid one.

$ juju ssh -m dst:controller 0
$ sudo snap install yq
# Note we need to pipe indirection because of snap confinement.
$ sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllercert' | xargs -I% echo % > dqlite.cert
$ sudo cat /var/lib/juju/agents/machine-0/agent.conf | yq '.controllerkey' | xargs -I% echo % > dqlite.key
$ sudo dqlite -s file:///var/lib/juju/dqlite/cluster.yaml -c ./dqlite.cert -k ./dqlite.key <model-uuid>
> SELECT * FROM sqlite_master WHERE name NOT LIKE 'sqlite_%';

The result should be empty.

Revert patch

Revert the patch and upgrade the src and dst controllers.

$ juju migrate src:default dst

Should succeed.

Links

Jira card: JUJU-5877

domain/model/modelmigration/import.go

SimonRichardson · 2024-04-16T07:44:08Z

/build

apiserver/facades/client/modelmanager/modelmanager.go

manadart · 2024-04-25T18:05:58Z

apiserver/facades/client/modelmanager/modelmanager.go

+			// other models.
+			modelUUID := coremodel.UUID(stModel.UUID())
+
+			// TODO (stickupkid): We need to delete the model info when


This comment doesn't make sense. In any case, could we not in this instance delete and recreate the table? I don't think anything has RI to it...

This is because we need to implement the life cycle for a model. We're currently going from Alive -> Dead in the model manager, instead of Alive -> Dying -> Dead

manadart · 2024-04-26T09:42:00Z

internal/worker/dbaccessor/worker.go

@@ -221,12 +224,20 @@ func NewWorker(cfg WorkerConfig) (*dbWorker, error) {
 			// that case we do want to cause the dbaccessor to go down. This
 			// will then bring up a new dqlite app.
 			IsFatal: func(err error) bool {
+				// If a database is dead we should not kill the worker.
+				if errors.Is(err, database.ErrDBDead) {


So this says that if we kill the worker with ErrDBDead, the worker is not restarted and it is not fatal to us, the parent, right?

internal/worker/dbaccessor/worker.go

core/context/sourceable.go

internal/worker/dbaccessor/tracker.go

SimonRichardson · 2024-05-06T10:15:54Z

The intermittent test failures are fixed by #17341

SimonRichardson · 2024-05-07T14:43:39Z

/build

manadart

This is good as a progression. All QA is solid.

We do however have to be very careful regarding the proper progression of model life-cycle, and cleanup/teardown when we come to it.

A note in the risk register wouldn't hurt - we don't ever want this path recruited by accident or prematurely.

manadart · 2024-05-22T09:52:38Z

apiserver/shared.go

@@ -49,6 +50,7 @@ type sharedServerContext struct {
 	logger               corelogger.Logger
 	charmhubHTTPClient   facade.HTTPClient
 	dbGetter             changestream.WatchableDBGetter
+	dbDeleter            database.DBDeleter


I'd like a comment here regarding the fact that it ultimately only ends up in a migration context, and is under no circumstances to have expanded availability.

manadart · 2024-05-22T09:53:10Z

apiserver/facades/client/modelmanager/services.go

@@ -85,6 +88,7 @@ type ModelInfoService interface {
 	// CreateModel is responsible for creating a new read only model
 	// that is being imported.
 	CreateModel(context.Context, uuid.UUID) error
+	DeleteModel(context.Context) error


Comment this. Say exactly what it does, which DB(s) it acts on.

The model migration service needs to be able to delete models if it fails to correctly import. Removal of the model is required so that another attempt can be made. This mostly involves threading the model deleter through the apiserver to pass to the model migartion coordinator.

As we can't drop a database, when a migration fails, we have to delete the contents of the database. This includes all the schema that was added to the database. The rationale for deleting the schema is that, if you migrate src:A to dst:A, but that fails and then you perform a upgrade to dst via a patch release, there isn't a guarantee that dst:A schema is completely updated. So removing everything leaves us with a blank state. There have been thoughts around aliasing tables instead of a clean slate, but this is fraught with danger if you accidently write to the wrong DB (i.e. the alias one). In addition to this, we also can't clear the DB in one query, as we get a segfault. The code batches up the statements, so we're not running a query for every statement, but should be enough to prevent a segfault. The way that we drop everything from the db needs to happen in this order to prevent constraint violations. In addition we ensure that the foreign key param is set to _false_ to also prevent any issues in ordering of deletions.

As the read-only model has some immutable triggers around it to prevent deletion and updates, we need to destroy it before deleting the DB.

The worker needs to be stopped upon deletion. We want to ensure it's dead. We don't want to allow the worker to spring back up. So having a sentinel error for when we want to delete a model propergate through the stack allows others to know when to give up.

Until we fully implement tearing down of a model, we can't currently remove a model until the very last moment. This code just makes it optional, so that we can state that during a model migration we do want to delete the model so we can retry. In addition, I've also fixed the dbaccessor to correctly terminate transactions based on if the tomb is dying. See context sourceable.

To prevent all trackers from polling the db at the same causing problems, introduce a jitter to give more random access patterns to the requests.

We need to remove the model info (read-only model), but we can't do it until everything has been removed from the database. Attempting to remove it early causes everything to lock up.

The comments weren't quite correct, improve them to better explain i.e. why we're tying the tomb errors to the context.

SimonRichardson · 2024-05-22T15:42:49Z

/merge

tlm · 2024-05-29T02:04:04Z

domain/model/service/service.go

+
+	// If the db should not be deleted then we can return early.
+	if !options.DeleteDB() {
+		s.logger.Infof("skipping model deletion, model database will still be present")


Do we really need logging in here? Surely this would be better from the callers side and also updating the comment of the function to explain expected default behaviour.

My reason for asking is I am really not a fan of logging in the services layer. The services layer has nice well defined behaviour that is documented and strong errors. Logging can get pushed up a layer with all of that done I think.

So this only exists because we can't delete a model outside of model migration in its current state. The key part we're missing is going from alive -> dying -> dead. We should delete these options when models are correctly progressing through the life states.

SimonRichardson added the 4.0 label Apr 15, 2024

SimonRichardson self-assigned this Apr 15, 2024

SimonRichardson requested a review from manadart as a code owner April 15, 2024 20:15

SimonRichardson commented Apr 15, 2024

View reviewed changes

domain/model/modelmigration/import.go Outdated Show resolved Hide resolved

SimonRichardson force-pushed the delete-model branch 2 times, most recently from 38d6323 to 4eb1e2c Compare April 17, 2024 08:24

hpidcock added the has merge conflicts label Apr 18, 2024

SimonRichardson force-pushed the delete-model branch from 9591666 to 77f18d4 Compare April 19, 2024 09:09

SimonRichardson removed the has merge conflicts label Apr 19, 2024

hpidcock added the has merge conflicts label Apr 24, 2024

SimonRichardson force-pushed the delete-model branch from 77f18d4 to 3aaae75 Compare April 24, 2024 10:20

SimonRichardson removed the has merge conflicts label Apr 24, 2024

SimonRichardson force-pushed the delete-model branch from 3aaae75 to 1b729d5 Compare April 26, 2024 08:09

manadart reviewed Apr 26, 2024

View reviewed changes

SimonRichardson force-pushed the delete-model branch 4 times, most recently from a903456 to 4ec3cba Compare April 26, 2024 15:00

hpidcock added the has merge conflicts label Apr 30, 2024

SimonRichardson force-pushed the delete-model branch from 4ec3cba to 97fe63c Compare May 6, 2024 09:01

SimonRichardson removed the has merge conflicts label May 6, 2024

SimonRichardson force-pushed the delete-model branch from 97fe63c to f16f741 Compare May 6, 2024 14:00

hpidcock added the has merge conflicts label May 9, 2024

SimonRichardson force-pushed the delete-model branch from f16f741 to ec166c0 Compare May 9, 2024 15:39

SimonRichardson removed the has merge conflicts label May 9, 2024

SimonRichardson requested a review from manadart May 9, 2024 15:39

hpidcock added the has merge conflicts label May 13, 2024

SimonRichardson force-pushed the delete-model branch 2 times, most recently from 2597e30 to d31391f Compare May 15, 2024 13:01

SimonRichardson removed the has merge conflicts label May 15, 2024

manadart approved these changes May 22, 2024

View reviewed changes

SimonRichardson added 8 commits May 22, 2024 15:14

Ensure we remove the read-only model correctly

6dfe649

As the read-only model has some immutable triggers around it to prevent deletion and updates, we need to destroy it before deleting the DB.

Jitter the dbaccessor tracker

c22b6a8

To prevent all trackers from polling the db at the same causing problems, introduce a jitter to give more random access patterns to the requests.

Prevent the removal of the model info

b670bc5

We need to remove the model info (read-only model), but we can't do it until everything has been removed from the database. Attempting to remove it early causes everything to lock up.

Improve the comments explaining the rationale

0f08fc3

The comments weren't quite correct, improve them to better explain i.e. why we're tying the tomb errors to the context.

SimonRichardson force-pushed the delete-model branch from d31391f to cb4ccb4 Compare May 22, 2024 14:55

Add comments for improved readability

cc890c0

SimonRichardson force-pushed the delete-model branch from cb4ccb4 to cc890c0 Compare May 22, 2024 15:21

jujubot merged commit 154487a into juju:main May 22, 2024
15 of 17 checks passed

SimonRichardson deleted the delete-model branch May 22, 2024 15:56

tlm reviewed May 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete model #17208

Delete model #17208

SimonRichardson commented Apr 15, 2024 •

edited

SimonRichardson commented Apr 16, 2024

manadart Apr 25, 2024

SimonRichardson Apr 26, 2024

manadart Apr 26, 2024 •

edited

SimonRichardson Apr 26, 2024

SimonRichardson commented May 6, 2024

SimonRichardson commented May 7, 2024

manadart left a comment

manadart May 22, 2024

manadart May 22, 2024

SimonRichardson commented May 22, 2024

tlm May 29, 2024

SimonRichardson May 29, 2024

Delete model #17208

Delete model #17208

Conversation

SimonRichardson commented Apr 15, 2024 • edited

Checklist

QA steps

Apply patch

Create the src:

Create the dst:

Login to dqlite

Revert patch

Links

SimonRichardson commented Apr 16, 2024

manadart Apr 25, 2024

Choose a reason for hiding this comment

SimonRichardson Apr 26, 2024

Choose a reason for hiding this comment

manadart Apr 26, 2024 • edited

Choose a reason for hiding this comment

SimonRichardson Apr 26, 2024

Choose a reason for hiding this comment

SimonRichardson commented May 6, 2024

SimonRichardson commented May 7, 2024

manadart left a comment

Choose a reason for hiding this comment

manadart May 22, 2024

Choose a reason for hiding this comment

manadart May 22, 2024

Choose a reason for hiding this comment

SimonRichardson commented May 22, 2024

tlm May 29, 2024

Choose a reason for hiding this comment

SimonRichardson May 29, 2024

Choose a reason for hiding this comment

SimonRichardson commented Apr 15, 2024 •

edited

manadart Apr 26, 2024 •

edited