Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manage kernel message queueing better to prevent out-of-order execution #9571

Merged
merged 5 commits into from Jan 8, 2021

Conversation

jasongrout
Copy link
Contributor

References

Fixes #9566
Followup on #8562
Changes solution in #9484

Code changes

If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.

One exception to this is the kernel info request when we are restarting. We redo the logic in #9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.

Also, we realize that the sendMessage queue parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.

Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.

User-facing changes

Backwards-incompatible changes

Fixes jupyterlab#9566
Followup on jupyterlab#8562
Changes solution in jupyterlab#9484

If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.

One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.

Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.

Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
@jasongrout jasongrout added this to the 3.0 milestone Jan 7, 2021
@jupyterlab-dev-mode
Copy link

Thanks for making a pull request to JupyterLab!

To try out this branch on binder, follow this link: Binder

@jasongrout
Copy link
Contributor Author

jasongrout commented Jan 7, 2021

From #9566 (comment):

I tried the generated binder, and I could see once that all the cells were marked with * but not executed (and the kernel was idle), but couldn't reproduce while recording a GIF. Maybe you solved the out-of-order bug and this is another one?

@davidbrochart - Yeah, possibly there is another issue with stalling. @minrk did put in a fallback that should hopefully help in those cases, but there are probably lots of reasons why the messages could be stuck even if the kernel is idle.

I think this does solve the out-of-order problem.

I think there still may be issues with the logic of #8562 - would like to look at that logic again at some point.

@jasongrout
Copy link
Contributor Author

@davidbrochart - did you review the code as well?

@jasongrout
Copy link
Contributor Author

The usage test failure looks unrelated:

[I 2021-01-08 00:45:50.061 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Traceback (most recent call last):
  File "./test_install/bin/jupyter-labhub", line 8, in <module>
    sys.exit(main())
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyterlab/labhubapp.py", line 531, in main
    return OverrideSingleUserNotebookApp.launch_instance(argv)
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 500, in launch_instance
    serverapp.start()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1966, in start
    self.start_app()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1931, in start_app
    self._handle_browser_opening()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 978, in _handle_browser_opening
    if self.starter_app:
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 956, in starter_app
    return self.extension_manager.extension_points.get(name, None).app
AttributeError: 'NoneType' object has no attribute 'app'
+ kill 10128
./scripts/ci_script.sh: line 318: kill: (10128) - No such process
Error: Process completed with exit code 1.

@jasongrout
Copy link
Contributor Author

jasongrout commented Jan 8, 2021

The linkcheck failure also looks unrelated - a bunch of links have a // in them, apparently:

FAILED docs/build/html/user/urls.html::/home/runner/work/jupyterlab/jupyterlab/docs/build/html/user/urls.html <a href=https://github.com/jupyterlab/jupyterlab//master/docs/source/user/urls.rst>

@jasongrout
Copy link
Contributor Author

The linkcheck failure also looks unrelated - a bunch of links have a // in them, apparently:

Link check failure addressed in #9572

@davidbrochart
Copy link
Contributor

@davidbrochart - did you review the code as well?

I don't think I can make a valuable review yet, I've never looked at jlab's code 😄

@jasongrout
Copy link
Contributor Author

@jtpio - just checking, was your comment an indication that you reviewed the PR as a whole?

@Zsailer
Copy link
Member

Zsailer commented Jan 8, 2021

@jasongrout, the failing test will be fixed by jupyter-server/jupyter_server#379. I'll make a patch release in a few minutes.

Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@blink1073
Copy link
Member

blink1073 commented Jan 8, 2021

I'll restart CI once there is a new server release. Once the usage test passes, I'll merge and cut a release.

@Zsailer
Copy link
Member

Zsailer commented Jan 8, 2021

jupyter_server v.1.2.1 released.

@blink1073
Copy link
Member

Good to go, thanks all!

@blink1073 blink1073 merged commit b62371e into jupyterlab:master Jan 8, 2021
@github-actions github-actions bot added the status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label Jul 8, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pkg:csvviewer pkg:services status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Out-of-order execution
5 participants