Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add status change event #3903

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

evidolob
Copy link
Contributor

@evidolob evidolob commented Nov 1, 2023

Solution/Idea

PR adds event API and new SSE(server sent event) channel status_change.

Proposed changes

PR adds event API implementation in pkg/crc/api/events/events.go
Modify sync client implementation to trigger event when status is changed.
Fixes in existing log events, rename constants related to SSE

Testing

  1. launch demon
  2. connect to status_change channel with curl -GET -i --unix-socket <path to socket>/crc-http.sock "http://localhost/events?stream=status_change"
  3. start crc with curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start"
  4. Ensure that new events appeared, like:
id:                                                                                                                            │
data: {"status":{"CrcStatus":"Starting","OpenshiftStatus":"Starting","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":"microshift"}}
event: status_change

Copy link

openshift-ci bot commented Nov 1, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from evidolob. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@evidolob
Copy link
Contributor Author

/retest

@anjannath
Copy link
Member

  1. connect to status_change channel with curl -GET -i --unix-socket /crc-http.sock "http://localhost/events?stream=status_change"
  2. start crc with crc start

the events are shown only when starting via the api curl --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" running the crc start command the events are not shown, is this expected?

@evidolob
Copy link
Contributor Author

evidolob commented Dec 6, 2023

the events are shown only when starting via the api curl --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" running the crc start command the events are not shown, is this expected?

Yes, that is expected, as start, stop is performed on CLI side, not on daemon, at least for now.

@anjannath
Copy link
Member

Tested locally and functionality is working as expected, LGTM. it'd be nice to have more detailed commit messages explaining the changes in each commit

Copy link
Contributor

@cfergeau cfergeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to spend more time on the last commit, but here are some initial comments.

@@ -81,7 +81,7 @@ func runWatchStatus(writer io.Writer, client *daemonclient.Client, cacheDir stri

isPullInit := false

err := client.SSEClient.Status(func(loadResult *types.ClusterLoadResult) {
err := client.SSEClient.ClusterLoad(func(loadResult *types.ClusterLoadResult) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not obvious why the renaming is needed? The commit log should explain this so that we all have the same understanding of the need for it.

Copy link
Contributor

@gbraad gbraad Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not actually a rename, but rather a 'prepare'. The way this will work is to enable the status to be sent when requested (starts listening).

Yevhen can explain, as I asked him the same on one of our meetings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I rename that to avoid confusion with name Status vs StatusChange

pkg/events/emitter.go Outdated Show resolved Hide resolved
pkg/events/emitter.go Show resolved Hide resolved
pkg/events/emitter.go Outdated Show resolved Hide resolved
pkg/events/emitter.go Show resolved Hide resolved
pkg/crc/machine/status.go Outdated Show resolved Hide resolved
pkg/crc/machine/state/state.go Outdated Show resolved Hide resolved
pkg/crc/api/events/log_stream.go Outdated Show resolved Hide resolved
pkg/crc/api/events/event_server.go Show resolved Hide resolved
pkg/crc/api/events/status_change_stream.go Show resolved Hide resolved
@evidolob evidolob force-pushed the stus-change-event branch 2 times, most recently from a6a8fb4 to f63d585 Compare December 7, 2023 10:49
Copy link
Contributor

@cfergeau cfergeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments, still need to spend more time on the pkg/crc/machine/sync.go changes.

pkg/crc/api/events/cluster_load_stream.go Outdated Show resolved Hide resolved
@@ -23,7 +25,7 @@ func newSSEStreamHook(server *sse.Server) *streamHook {
&logrus.JSONFormatter{
TimestampFormat: "",
DisableTimestamp: false,
DisableHTMLEscape: false,
DisableHTMLEscape: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it cause parsing issues at the other hand if HTML is not escaped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, especially new line characters

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand what this solves, we remove the new line characters with TrimRight, so DisableHTMLEscape: true must be solving a different problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, delete this change

pkg/crc/api/events/status_change_stream.go Outdated Show resolved Hide resolved
pkg/crc/api/events/status_change_stream.go Outdated Show resolved Hide resolved
}

var (
StatusChanged = NewEvent[StatusChangedEvent]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the global variable here, but we can refine this later.

pkg/events/emitter.go Outdated Show resolved Hide resolved
pkg/events/emitter.go Show resolved Hide resolved
pkg/crc/api/events/status_change_stream.go Outdated Show resolved Hide resolved
@evidolob evidolob force-pushed the stus-change-event branch 3 times, most recently from 45a11dd to 7184f64 Compare December 11, 2023 08:55
events.StatusChanged.Fire(events.StatusChangedEvent{State: startResult.Status})
} else {
events.StatusChanged.Fire(events.StatusChangedEvent{State: state.Error, Error: err})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(commenting here, but that applies to most of the file).
Duplicating the state logic in this file ("after calling Start(), the VM is in the Start state, unless there was an error, in which case it's in the Error state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status().

Can we emit a StatusChangedEvent to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status() if they need the status, and we don't have 2 source for the status to keep in sync?

Regarding errors, it seems if Start() (or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status() at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating the state logic in this file ("after calling Start(), the VM is in the Start state, unless there was an error, in which case it's in the Error state") will likely cause additional maintainance burden, as we have to make sure to keep in sync the state logic in this file, and the state logic in other parts of crc - the status events may be reporting a different state from Status().

Not sure to understand how to resolve you concern, the problem is that we don't have a way to track state change(at least I wasn’t able to find it, it would be nice if you point me that place if if it exist)

Can we emit a StatusChangedEvent to tell other code the status changed, but not specify the current status in the event payload? This way the receiver of the event can check Status() if they need the status, and we don't have 2 source for the status to keep in sync?

I first do that, but I face a race condition with updating state, for example:
User call start, we fire state change event, before actuality call Start, and if listener is fast enough, it could get old state (basically we could lost starting state in transition from stoped to running states)

Possible solution to that, may be moving state change event firing from Synchronized client in to client implementation of machine.Client interface. And fire events only after actually state is changed.

Regarding errors, it seems if Start() (or other commands) return an error, you consider that the cluster is in an error state? This makes sense for some commands, but I don't think this matches what is reported by Status() at the moment? Would be nice to fix the reported Status() when there are start failures for examples, but doing it here would be papering over the real issue in my opinion.
And in some other cases, errors are less bad, for example in the power off case, maybe the command failed early, and the cluster is still nicely running?

It is matched, as we send with SSE status reported by Status() in addition we also send error message, which come from status change event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error while crc status will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I understand the current code base is missing a centralized "setState" facility, and that this PR has to find a way of doing this, just trying to understand the limitations in the PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern is #3903 (comment) , and I'm under the impression that if you have a terminal showing the stream of events, and another running crc status every second, then the state between the 2 will not always match, in particular sometimes the event stream will contain State: state.Error while crc status will not report an error?
Guess I need to run both side by side and compare myself to understand better how this works.

No, output should be exactly the same, to get that I add
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR37
and
https://github.com/crc-org/crc/pull/3903/files#diff-e5710ade03da511d4845f380cd63319bf7329e960c1afbbba7fbed4060fcf8edR44-R50

Otherwise we need to move event triggering deeper in crc code, and fire event after execution of pice of code which change crc status output.

And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?

@gbraad @cfergeau @praveenkumar WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it is a question for all, should we integrate events in more depths of CRC core(which leads to spreading events over our codebase) or keep as is( in relative hi level Client interface implementation )?

I'd say we can go with this for now, more details about the design choices, the implementation limitations, ...

One thing we can explore over time some centralized tracking of the cluster state, a StateTracker type, we'd add some calls to StateTracker.SetState(...) in the appropriate places, the StateTracker would fire events on status changes, it could have a CanChangeState(state) helper if needed to know if calling Stop will result in an error or not, ...
As long as all crc commands do not go through the daemon, it could do some polling to reconcile the internal state with changes through external crc calls, ...

@praveenkumar
Copy link
Member

/need-rebase

@cfergeau
Copy link
Contributor

I've rebased it at https://github.com/cfergeau/crc/tree/stus-change-event and added a fixup commit to fix compliation of the first commit in the series.

@cfergeau
Copy link
Contributor

I'm not sure step 3. is correct in the testing instructions:

  1. start crc with crc start

Don't we need to start it with the REST API for now as otherwise start won't get through the daemon?

 curl -GET -i --unix-socket ~/.crc/crc-http.sock  "http://localhost/api/start"

I was wondering if curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" should also send a status event to help with syncing state if the state change behind the daemon's back?

@evidolob
Copy link
Contributor Author

Don't we need to start it with the REST API for now as otherwise start won't get through the daemon?

Yes, you are correct, need to start with rest API.

 curl -GET -i --unix-socket ~/.crc/crc-http.sock  "http://localhost/api/start"

I was wondering if curl -GET -i --unix-socket ~/.crc/crc-http.sock "http://localhost/api/start" should also send a status event to help with syncing state if the state change behind the daemon's back?

Daemon send response only after start is finished(or error incase of any error), but calling api/start will trigger events on events?stream=status_change event channel.

@cfergeau
Copy link
Contributor

cfergeau commented Jan 31, 2024

Few more things I noticed during testing, if VM exists, and then I call delete, I get

id: 
data: {"status":{"CrcStatus":"Error","OpenshiftStatus":"","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":""},"error":"Cannot load 'crc' virtual machine: no such libmachine vm: crc"}
event: status_change

which does not seem correct as the VM existed before being deleted.
After that, if I call again delete I don't get any notifications. If I call stop, I get first stopping and then error

Probably something that can improve with the NoVM state.

evidolob and others added 2 commits February 1, 2024 09:44
This allows to avoid confusion with upcoming status change event stream.

Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>
It allows us to have unified way to propagate state changes across an app

Co-Authored-By: Christophe Fergeau <cfergeau@redhat.com>
Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>
This would simplify podman-desktop ext status handling

Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>
@evidolob
Copy link
Contributor Author

evidolob commented Feb 1, 2024

Few more things I noticed during testing, if VM exists, and then I call delete, I get

id: 
data: {"status":{"CrcStatus":"Error","OpenshiftStatus":"","OpenshiftVersion":"","PodmanVersion":"","DiskUse":0,"DiskSize":0,"RAMUse":0,"RAMSize":0,"Preset":""},"error":"Cannot load 'crc' virtual machine: no such libmachine vm: crc"}
event: status_change

which does not seem correct as the VM existed before being deleted. After that, if I call again delete I don't get any notifications. If I call stop, I get first stopping and then error

That was actually bug, status was not report NoVm state, only error. (It was due to not implementing Is method for MissingHostError, so comparing error was always wrong)

Now it should work properly.

pkg/crc/api/events/status_change_stream.go Outdated Show resolved Hide resolved
} else {
// event could be fired, before actual code, which change state is called
// so status could contain 'old' state, replace it with state received in event
status.CrcStatus = changedEvent.State // override with actual reported state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part I find questionable in this PR. Instead of having a single source of truth for the status, we now have 2, some of the status comes from st.machineClient.Status(), but some of the status is hardcoded in machine/sync.go. If the 2 get out of sync, or if we want to improve status reporting in general, we'll have multiple places to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spending more time on it, I understand more the problems this is trying to solve.
One example is when the VM is stopped. crc status will succeed and report that crc is stopped, however crc stop will return an error and say the machine is already stopped.

Without status.CrcStatus = changedEvent.State, when invoking api/stop, we would not be reporting an error in the stream, but instead we'd only say the machine is stopped. This would not match what was reported to the api/stop call. There are most likely other state transitions where this is needed.

The comment however seems to indicate sometimes Status could report a stale status, I haven't seen that in the little testing I did, but I'd be interested in understanding more when this happens? If we don't record this knowledge now when it's more or less fresh in our minds, it will be very difficult to find this back in 6 months from now if we need it :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, there's more details in an older discussion:

User call start, we fire state change event, before actuality call Start, and if listener is fast enough, it could get old state (basically we could lost starting state in transition from stoped to running states)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, this is an indication that these events are not fired in the right place, and this is papered over with even more complexity :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t instead we'd only say the machine is stopped. This would not match what was reported to the api/stop

The API matches the usecase. Stop should in my opinion not return an error when already stopped; as you succeed in doing what is requested. What more information do we actually give with saying this failed as it was already stopped?!

If you believe this should be fixed or recorded, make these states clear for the convergence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part you quoted is a description of the behaviour in this PR. These are some notes I took during review/testing when I was trying to understand the design choices which led to the code being reviewed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to replace this code, with firing event at more "proper" place, but face problem that starting state is "artificial", ie exist only in sync.go file, and VM doesn't have/reports starting state. Also stopping state is the same.
So I have no other ideas, than accept current PR and think on better state handling, like @cfergeau propose there #3903 (comment)

Or change state handling first, and then rebase this PR on top of it.

@gbraad @cfergeau WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go forward with the current code, but we really need more details in the commit log, so that it records the design choices you made when writing the code, so that we know the current limitations, why this overwrites the state, ...

@gbraad
Copy link
Contributor

gbraad commented Apr 9, 2024

What is holding back this PR?

@cfergeau
Copy link
Contributor

cfergeau commented Apr 9, 2024

What is holding back this PR?

Some discussion around #3903 (comment) would have been useful

Apart from this, as said multiple times, the comit log must be a lot more detailed about how this works:

If we don't record this knowledge now when it's more or less fresh in our minds, it will be very difficult to find this back in 6 months from now if we need it :-/

I'd say we can go with this for now, more details about the design choices, the implementation limitations, ...
(this sentence is malformed though :-/ )

And #3903 (comment) is also still pending

Will propagated to client via SSE in ith own channel, to inform clients about CRC status change

Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>
Having '\n' character in between message field is braking SSE specification and client could not parse correctly such events

Signed-off-by: Yevhen Vydolob <yvydolob@redhat.com>
Copy link

openshift-ci bot commented Apr 9, 2024

@evidolob: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/security e7445d4 link false /test security
ci/prow/integration-crc e7445d4 link true /test integration-crc
ci/prow/e2e-crc e7445d4 link true /test e2e-crc

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready for review
Development

Successfully merging this pull request may close these issues.

None yet

6 participants