Add status change event #3903

evidolob · 2023-11-01T13:18:29Z

Solution/Idea

PR adds event API and new SSE(server sent event) channel status_change.

Proposed changes

PR adds event API implementation in pkg/crc/api/events/events.go
Modify sync client implementation to trigger event when status is changed.
Fixes in existing log events, rename constants related to SSE

Testing

launch demon
connect to status_change channel with curl -GET -i --unix-socket <path to socket>/crc-http.sock "http://localhost/events?stream=status_change"
start crc with curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start"
Ensure that new events appeared, like:

id:                                                                                                                            │
data: {"status":{"CrcStatus":"Starting","OpenshiftStatus":"Starting","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":"microshift"}}
event: status_change

openshift-ci · 2023-11-01T13:18:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from evidolob. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

evidolob · 2023-11-20T07:06:03Z

/retest

anjannath · 2023-12-06T07:01:13Z

connect to status_change channel with curl -GET -i --unix-socket /crc-http.sock "http://localhost/events?stream=status_change"

start crc with crc start

the events are shown only when starting via the api curl --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" running the crc start command the events are not shown, is this expected?

pkg/events/emitter.go

pkg/crc/api/events/log_stream.go

evidolob · 2023-12-06T07:45:03Z

the events are shown only when starting via the api curl --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" running the crc start command the events are not shown, is this expected?

Yes, that is expected, as start, stop is performed on CLI side, not on daemon, at least for now.

anjannath · 2023-12-06T13:06:15Z

Tested locally and functionality is working as expected, LGTM. it'd be nice to have more detailed commit messages explaining the changes in each commit

cfergeau

Need to spend more time on the last commit, but here are some initial comments.

cfergeau · 2023-12-06T13:46:54Z

cmd/crc/cmd/status.go

@@ -81,7 +81,7 @@ func runWatchStatus(writer io.Writer, client *daemonclient.Client, cacheDir stri

 	isPullInit := false

-	err := client.SSEClient.Status(func(loadResult *types.ClusterLoadResult) {
+	err := client.SSEClient.ClusterLoad(func(loadResult *types.ClusterLoadResult) {


Not obvious why the renaming is needed? The commit log should explain this so that we all have the same understanding of the need for it.

It is not actually a rename, but rather a 'prepare'. The way this will work is to enable the status to be sent when requested (starts listening).

Yevhen can explain, as I asked him the same on one of our meetings.

Yes, I rename that to avoid confusion with name Status vs StatusChange

pkg/events/emitter.go

pkg/crc/machine/status.go

pkg/crc/machine/state/state.go

pkg/crc/api/events/log_stream.go

pkg/crc/api/events/event_server.go

pkg/crc/api/events/status_change_stream.go

cfergeau

Another round of comments, still need to spend more time on the pkg/crc/machine/sync.go changes.

pkg/crc/api/events/cluster_load_stream.go

cfergeau · 2023-12-07T16:44:41Z

pkg/crc/api/events/log_stream.go

@@ -23,7 +25,7 @@ func newSSEStreamHook(server *sse.Server) *streamHook {
 		&logrus.JSONFormatter{
 			TimestampFormat:   "",
 			DisableTimestamp:  false,
-			DisableHTMLEscape: false,
+			DisableHTMLEscape: true,


Can it cause parsing issues at the other hand if HTML is not escaped?

Yes, especially new line characters

I don't fully understand what this solves, we remove the new line characters with TrimRight, so DisableHTMLEscape: true must be solving a different problem?

True, delete this change

pkg/crc/api/events/status_change_stream.go

cfergeau · 2023-12-08T15:42:33Z

pkg/events/events.go

+}
+
+var (
+	StatusChanged = NewEvent[StatusChangedEvent]()


Not sure about the global variable here, but we can refine this later.

pkg/events/emitter.go

pkg/crc/api/events/status_change_stream.go

cfergeau · 2023-12-11T16:49:31Z

pkg/crc/machine/sync.go

+		events.StatusChanged.Fire(events.StatusChangedEvent{State: startResult.Status})
+	} else {
+		events.StatusChanged.Fire(events.StatusChangedEvent{State: state.Error, Error: err})
+	}


(commenting here, but that applies to most of the file).
Duplicating the state logic in this file ("after calling Start(), the VM is in the Start state, unless there was an error, in which case it's in the Error state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status().

Can we emit a StatusChangedEvent to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status() if they need the status, and we don't have 2 source for the status to keep in sync?

Regarding errors, it seems if Start() (or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status() at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?

Duplicating the state logic in this file ("after calling Start(), the VM is in the Start state, unless there was an error, in which case it's in the Error state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status().

Not sure to understand how to resolve you concern, the problem is that we don't have a way to track state change(at least I wasn’t able to find it, it would be nice if you point me that place if if it exist)

Can we emit a StatusChangedEvent to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status() if they need the status, and we don't have 2 source for the status to keep in sync?

I first do that, but I face a race condition with updating state, for example:
User call start, we fire state change event, before actuality call Start, and if listener is fast enough, it could get old state (basically we could lost starting state in transition from stoped to running states)

Possible solution to that, may be moving state change event firing from Synchronized client in to client implementation of machine.Client interface. And fire events only after actually state is changed.

Regarding errors, it seems if Start() (or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status() at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?

It is matched, as we send with SSE status reported by Status() in addition we also send error message, which come from status change event.

My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error while crc status will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.

(I understand the current code base is missing a centralized "setState" facility, and that this PR has to find a way of doing this, just trying to understand the limitations in the PR)

My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error while crc status will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.

No, output should be exactly the same, to get that I add
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR37
and
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR44-R50

Otherwise we need to move event triggering deeper in crc code, and fire event after execution of pice of code which change crc status output.

And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?

@gbraad @cfergeau @praveenkumar WDYT?

And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?

I'd say we can go with this for now, more details about the design choices, the implementation limitations, ...

One thing we can explore over time some centralized tracking of the cluster state, a StateTracker type, we'd add some calls to StateTracker.SetState(...) in the appropriate places, the StateTracker would fire events on status changes, it could have a CanChangeState(state) helper if needed to know if calling Stop will result in an error or not, ...
As long as all crc commands do not go through the daemon, it could do some polling to reconcile the internal state with changes through external crc calls, ...

pkg/events/emitter.go

praveenkumar · 2024-01-24T10:23:05Z

/need-rebase

cfergeau · 2024-01-29T16:42:40Z

I've rebased it at https://github.com/cfergeau/crc/tree/stus-change-event and added a fixup commit to fix compliation of the first commit in the series.

cfergeau · 2024-01-30T16:51:58Z

I'm not sure step 3. is correct in the testing instructions:

start crc with crc start

Don't we need to start it with the REST API for now as otherwise start won't get through the daemon?

 curl -GET -i --unix-socket ~/.crc/crc-http.sock  "http://localhost/api/start"

I was wondering if curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" should also send a status event to help with syncing state if the state change behind the daemon's back?

evidolob · 2024-01-31T07:10:42Z

Don't we need to start it with the REST API for now as otherwise start won't get through the daemon?

Yes, you are correct, need to start with rest API.

 curl -GET -i --unix-socket ~/.crc/crc-http.sock  "http://localhost/api/start"
I was wondering if curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" should also send a status event to help with syncing state if the state change behind the daemon's back?

Daemon send response only after start is finished(or error incase of any error), but calling api/start will trigger events on events?stream=status_change event channel.

cfergeau · 2024-01-31T16:39:08Z

Few more things I noticed during testing, if VM exists, and then I call delete, I get

id: 
data: {"status":{"CrcStatus":"Error","OpenshiftStatus":"","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":""},"error":"Cannot load 'crc' virtual machine: no such libmachine vm: crc"}
event: status_change

which does not seem correct as the VM existed before being deleted.
After that, if I call again delete I don't get any notifications. If I call stop, I get first stopping and then error

Probably something that can improve with the NoVM state.

This allows to avoid confusion with upcoming status change event stream. Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

It allows us to have unified way to propagate state changes across an app Co-Authored-By: Christophe Fergeau <cfergeau@redhat.com> Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

This would simplify podman-desktop ext status handling Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

evidolob · 2024-02-01T10:26:45Z

Few more things I noticed during testing, if VM exists, and then I call delete, I get
id: 
data: {"status":{"CrcStatus":"Error","OpenshiftStatus":"","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":""},"error":"Cannot load 'crc' virtual machine: no such libmachine vm: crc"}
event: status_change
which does not seem correct as the VM existed before being deleted. After that, if I call again delete I don't get any notifications. If I call stop, I get first stopping and then error

That was actually bug, status was not report NoVm state, only error. (It was due to not implementing Is method for MissingHostError, so comparing error was always wrong)

Now it should work properly.

pkg/crc/api/events/status_change_stream.go

cfergeau · 2024-02-01T16:49:46Z

pkg/crc/api/events/status_change_stream.go

+	} else {
+		// event could be fired, before actual code, which change state is called
+		// so status could contain 'old' state, replace it with state received in event
+		status.CrcStatus = changedEvent.State // override with actual reported state


This is the part I find questionable in this PR. Instead of having a single source of truth for the status, we now have 2, some of the status comes from st.machineClient.Status(), but some of the status is hardcoded in machine/sync.go. If the 2 get out of sync, or if we want to improve status reporting in general, we'll have multiple places to change.

Spending more time on it, I understand more the problems this is trying to solve.
One example is when the VM is stopped. crc status will succeed and report that crc is stopped, however crc stop will return an error and say the machine is already stopped.

Without status.CrcStatus = changedEvent.State, when invoking api/stop, we would not be reporting an error in the stream, but instead we'd only say the machine is stopped. This would not match what was reported to the api/stop call. There are most likely other state transitions where this is needed.

The comment however seems to indicate sometimes Status could report a stale status, I haven't seen that in the little testing I did, but I'd be interested in understanding more when this happens? If we don't record this knowledge now when it's more or less fresh in our minds, it will be very difficult to find this back in 6 months from now if we need it :-/

ah, there's more details in an older discussion:

User call start, we fire state change event, before actuality call Start, and if listener is fast enough, it could get old state (basically we could lost starting state in transition from stoped to running states)

In my opinion, this is an indication that these events are not fired in the right place, and this is papered over with even more complexity :-/

t instead we'd only say the machine is stopped. This would not match what was reported to the api/stop

The API matches the usecase. Stop should in my opinion not return an error when already stopped; as you succeed in doing what is requested. What more information do we actually give with saying this failed as it was already stopped?!

If you believe this should be fixed or recorded, make these states clear for the convergence.

The part you quoted is a description of the behaviour in this PR. These are some notes I took during review/testing when I was trying to understand the design choices which led to the code being reviewed.

I try to replace this code, with firing event at more "proper" place, but face problem that starting state is "artificial", ie exist only in sync.go file, and VM doesn't have/reports starting state. Also stopping state is the same.
So I have no other ideas, than accept current PR and think on better state handling, like @cfergeau propose there #3903 (comment)

Or change state handling first, and then rebase this PR on top of it.

@gbraad @cfergeau WDYT?

I think we can go forward with the current code, but we really need more details in the commit log, so that it records the design choices you made when writing the code, so that we know the current limitations, why this overwrites the state, ...

gbraad · 2024-04-09T08:19:20Z

What is holding back this PR?

cfergeau · 2024-04-09T09:34:38Z

What is holding back this PR?

Some discussion around #3903 (comment) would have been useful

Apart from this, as said multiple times, the comit log must be a lot more detailed about how this works:

If we don't record this knowledge now when it's more or less fresh in our minds, it will be very difficult to find this back in 6 months from now if we need it :-/

I'd say we can go with this for now, more details about the design choices, the implementation limitations, ...
(this sentence is malformed though :-/ )

And #3903 (comment) is also still pending

Will propagated to client via SSE in ith own channel, to inform clients about CRC status change Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

Having '\n' character in between message field is braking SSE specification and client could not parse correctly such events Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

openshift-ci · 2024-04-09T15:14:41Z

@evidolob: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/security	`e7445d4`	link	false	`/test security`
ci/prow/integration-crc	`e7445d4`	link	true	`/test integration-crc`
ci/prow/e2e-crc	`e7445d4`	link	true	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

evidolob requested review from gbraad, praveenkumar, cfergeau and anjannath November 1, 2023 13:18

evidolob mentioned this pull request Nov 20, 2023

Use SSE to track CRC status change and logs crc-org/crc-extension#192

Open

anjannath reviewed Dec 6, 2023

View reviewed changes

pkg/events/emitter.go Show resolved Hide resolved

anjannath reviewed Dec 6, 2023

View reviewed changes

pkg/crc/api/events/log_stream.go Outdated Show resolved Hide resolved

evidolob force-pushed the stus-change-event branch from a067a0f to c863072 Compare December 6, 2023 08:41

cfergeau reviewed Dec 6, 2023

View reviewed changes

evidolob force-pushed the stus-change-event branch 2 times, most recently from a6a8fb4 to f63d585 Compare December 7, 2023 10:49

evidolob requested review from cfergeau and anjannath December 7, 2023 10:50

cfergeau reviewed Dec 8, 2023

View reviewed changes

evidolob force-pushed the stus-change-event branch 3 times, most recently from 45a11dd to 7184f64 Compare December 11, 2023 08:55

evidolob requested a review from cfergeau December 11, 2023 08:56

cfergeau reviewed Dec 11, 2023

View reviewed changes

cfergeau reviewed Dec 21, 2023

View reviewed changes

pkg/events/emitter.go Show resolved Hide resolved

evidolob force-pushed the stus-change-event branch from 7184f64 to a36a8f0 Compare January 8, 2024 15:10

openshift-merge-robot added the needs-rebase label Jan 24, 2024

evidolob force-pushed the stus-change-event branch from a36a8f0 to 1fde404 Compare February 1, 2024 07:33

evidolob and others added 2 commits February 1, 2024 09:44

Rename 'status' to 'cluster_load' event stream.

a04f52c

This allows to avoid confusion with upcoming status change event stream. Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

Add events implementation

0b8ec64

It allows us to have unified way to propagate state changes across an app Co-Authored-By: Christophe Fergeau <cfergeau@redhat.com> Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

evidolob force-pushed the stus-change-event branch from 1fde404 to 54df54b Compare February 1, 2024 07:45

openshift-merge-robot removed the needs-rebase label Feb 1, 2024

Add "NoVM" state

6d61cbf

This would simplify podman-desktop ext status handling Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

evidolob force-pushed the stus-change-event branch from 54df54b to 8807ace Compare February 1, 2024 10:10

evidolob requested a review from cfergeau February 1, 2024 10:26

cfergeau reviewed Feb 2, 2024

View reviewed changes

evidolob added 2 commits April 9, 2024 16:16

Add status change event

e9f4b73

Will propagated to client via SSE in ith own channel, to inform clients about CRC status change Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

Fix logging channel messages.

e7445d4

Having '\n' character in between message field is braking SSE specification and client could not parse correctly such events Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>

evidolob force-pushed the stus-change-event branch from 8807ace to e7445d4 Compare April 9, 2024 13:16

Add status change event #3903

Are you sure you want to change the base?

Add status change event #3903

Conversation

evidolob commented Nov 1, 2023 • edited

Solution/Idea

Proposed changes

Testing

openshift-ci bot commented Nov 1, 2023

evidolob commented Nov 20, 2023

anjannath commented Dec 6, 2023

evidolob commented Dec 6, 2023

anjannath commented Dec 6, 2023

cfergeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbraad Dec 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfergeau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

praveenkumar commented Jan 24, 2024

cfergeau commented Jan 29, 2024

cfergeau commented Jan 30, 2024

evidolob commented Jan 31, 2024

cfergeau commented Jan 31, 2024 • edited

evidolob commented Feb 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbraad commented Apr 9, 2024

cfergeau commented Apr 9, 2024

openshift-ci bot commented Apr 9, 2024

evidolob commented Nov 1, 2023 •

edited

gbraad Dec 7, 2023 •

edited

cfergeau commented Jan 31, 2024 •

edited