Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding exception handling for room start tasks #30

Closed
wants to merge 2 commits into from

Conversation

jzhang20133
Copy link
Collaborator

Sometimes, we have observed issues that task group in websocket_server is no longer active and user will be stuck after that and no longer able to access files and have to restart nb servers to unblock. _task_group instance variable in websocket_server will become inactive when one of its tasks of starting room or its children task group task fail with exception.

One place where an exception can occur is in the _broadcast_update method, which is a key task in children task group of _task_group in websocket_server. In this PR, the exception is handled and logged to prevent the parent task from crashing while still displaying the issue that will allow us better handle them in the future.

Resolving open source issues:
jupyterlab/jupyter-collaboration#290
jupyterlab/jupyter-collaboration#245

Copy link

welcome bot commented Apr 23, 2024

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

Copy link
Collaborator

@davidbrochart davidbrochart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jzhang20133.
I suggest to go a step further: stop serving a client if it fails, and stop writing to the store if it fails.
I'll merge #24 first, I don't think it will affect this PR too much, but you may have to rebase.

self.log.debug("Writing Y update to YStore")
self._task_group.start_soon(self.ystore.write, update)
except Exception as e:
self.log.error("Error broadcast updates for room:", exc_info=e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.log.error("Error broadcast updates for room:", exc_info=e)
self.log.error("Error broadcasting YRoom updates", exc_info=e)

Comment on lines 133 to 145
for client in self.clients:
self.log.debug("Sending Y update to client with endpoint: %s", client.path)
message = create_update_message(update)
self._task_group.start_soon(client.send, message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove a client if we cannot send them updates? Otherwise it will likely fail on every updated.

Suggested change
for client in self.clients:
self.log.debug("Sending Y update to client with endpoint: %s", client.path)
message = create_update_message(update)
self._task_group.start_soon(client.send, message)
for client in self.clients:
try:
self.log.debug("Sending Y update to client with endpoint: %s", client.path)
message = create_update_message(update)
self._task_group.start_soon(client.send, message)
except Exception as e:
self.log.error("Error sending Y update to client with endpoint: %s", client.path, exc_info=e)
self.clients.remove(client)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For send method in YDocWebsocketHandler class, it already have a try except and will capture all exception and log debug message. https://github.com/jupyterlab/jupyter-collaboration/blob/adcde3229c986ffe498b8e6a18857195e1605ace/projects/jupyter-server-ydoc/jupyter_server_ydoc/handlers.py#L222 It is unlikely this client.send flow that raise exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if just one send failure for temporary reason, we should probably not remove client. But for some specific exception like WebSocketClosedError, we should remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if just one send failure for temporary reason, we should probably not remove client.

Yeah, I think we need some tolerance here. Perhaps we have a time-based tolerance before removing the client. If it's not responding for some fixed time, say 1 minute, remove it. For tornado-based servers, we should likely match the websocket ping timeout configured by jupyter_server..

But for some specific exception like WebSocketClosedError, we should remove.

As @davidbrochart already mentioned, this package is agnostic to the server framework, so we can't use this error type. However, if we add a time-based tolerance for exception, we don't have to be so specific about the exception type here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will remove that WebsocketClosedError for now.

Comment on lines 137 to 156
if self.ystore:
self.log.debug("Writing Y update to YStore")
self._task_group.start_soon(self.ystore.write, update)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we stop writing to YStore if it fails?

Suggested change
if self.ystore:
self.log.debug("Writing Y update to YStore")
self._task_group.start_soon(self.ystore.write, update)
if self.ystore:
try:
self._task_group.start_soon(self.ystore.write, update)
self.log.debug("Writing Y update to YStore")
except Exception as e:
self.ystore = None
self.log.error("Error writing Y update to YStore", exc_info=e)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Update write is caused by a temporary issue and next write restores, would the missing of one Y update cause any issue for future updates or it could auto recovers? I have been purposely adding exception in ystore update flow but I did not notice any obvious issue in UI.

@@ -229,7 +232,7 @@ async def serve(self, websocket: Websocket):
)
tg.start_soon(client.send, message)
except Exception as e:
self.log.debug("Error serving endpoint: %s", websocket.path, exc_info=e)
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove a client if we cannot serve them?

Suggested change
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e)
self.log.error("Error serving endpoint: %s", websocket.path, exc_info=e)
self.clients.remove(websocket)

@jzhang20133
Copy link
Collaborator Author

@davidbrochart and @Zsailer would you like to review this PR again?

@davidbrochart
Copy link
Collaborator

@jzhang20133 in the future could you create a new branch from which you send a PR? I see your branch is main.

@davidbrochart
Copy link
Collaborator

I just rebased.

@@ -16,6 +16,7 @@
from anyio.abc import TaskGroup, TaskStatus
from anyio.streams.memory import MemoryObjectReceiveStream, MemoryObjectSendStream
from pycrdt import Doc, Subscription
from tornado.websocket import WebSocketClosedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pycrdt-websocket is server agnostic, Tornado cannot be used here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Zsailer
Copy link
Member

Zsailer commented Apr 26, 2024

@jzhang20133 in the future could you create a new branch from which you send a PR? I see your branch is main.

Just to explain the reasoning here, @jzhang20133. Branching helps maintainers, so they can help you rebase or push changes directly to your branch without contaminating their local main branch's history. Otherwise, it adds work for reviewers who might be reviewing multiple PRs in parallel.

@davidbrochart
Copy link
Collaborator

I'd like to align this PR with #31, so that we don't directly handle exceptions in pycrdt-websocket, but through an optional external exception handler. That would give more control to users who want to handle exceptions and not just ignore them (although they can if they use the exception_logger). What do you think?

@jzhang20133
Copy link
Collaborator Author

Sounds good, I will address those. @davidbrochart @Zsailer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants