New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manage kernel message queueing better to prevent out-of-order execution #9571
Conversation
Fixes jupyterlab#9566 Followup on jupyterlab#8562 Changes solution in jupyterlab#9484 If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order. One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner. Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can. Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
Thanks for making a pull request to JupyterLab! To try out this branch on binder, follow this link: |
From #9566 (comment):
@davidbrochart - Yeah, possibly there is another issue with stalling. @minrk did put in a fallback that should hopefully help in those cases, but there are probably lots of reasons why the messages could be stuck even if the kernel is idle. I think this does solve the out-of-order problem. I think there still may be issues with the logic of #8562 - would like to look at that logic again at some point. |
@davidbrochart - did you review the code as well? |
The usage test failure looks unrelated:
|
The linkcheck failure also looks unrelated - a bunch of links have a
|
Link check failure addressed in #9572 |
I don't think I can make a valuable review yet, I've never looked at jlab's code 😄 |
@jtpio - just checking, was your comment an indication that you reviewed the PR as a whole? |
@jasongrout, the failing test will be fixed by jupyter-server/jupyter_server#379. I'll make a patch release in a few minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you!
I'll restart CI once there is a new server release. Once the usage test passes, I'll merge and cut a release. |
jupyter_server v.1.2.1 released. |
Good to go, thanks all! |
References
Fixes #9566
Followup on #8562
Changes solution in #9484
Code changes
If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.
One exception to this is the kernel info request when we are restarting. We redo the logic in #9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.
Also, we realize that the sendMessage
queue
parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
User-facing changes
Backwards-incompatible changes