Manage kernel message queueing better to prevent out-of-order execution #9571

jasongrout · 2021-01-07T22:36:28Z

References

Fixes #9566
Followup on #8562
Changes solution in #9484

Code changes

If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.

One exception to this is the kernel info request when we are restarting. We redo the logic in #9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.

Also, we realize that the sendMessage queue parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.

Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.

User-facing changes

Backwards-incompatible changes

Fixes jupyterlab#9566 Followup on jupyterlab#8562 Changes solution in jupyterlab#9484 If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order. One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner. Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can. Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.

jupyterlab-dev-mode · 2021-01-07T22:36:44Z

Thanks for making a pull request to JupyterLab!

To try out this branch on binder, follow this link:

jasongrout · 2021-01-07T23:13:14Z

From #9566 (comment):

I tried the generated binder, and I could see once that all the cells were marked with * but not executed (and the kernel was idle), but couldn't reproduce while recording a GIF. Maybe you solved the out-of-order bug and this is another one?

@davidbrochart - Yeah, possibly there is another issue with stalling. @minrk did put in a fallback that should hopefully help in those cases, but there are probably lots of reasons why the messages could be stuck even if the kernel is idle.

I think this does solve the out-of-order problem.

I think there still may be issues with the logic of #8562 - would like to look at that logic again at some point.

jasongrout · 2021-01-08T00:27:26Z

@davidbrochart - did you review the code as well?

jasongrout · 2021-01-08T07:39:44Z

The usage test failure looks unrelated:

[I 2021-01-08 00:45:50.061 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Traceback (most recent call last):
  File "./test_install/bin/jupyter-labhub", line 8, in <module>
    sys.exit(main())
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyterlab/labhubapp.py", line 531, in main
    return OverrideSingleUserNotebookApp.launch_instance(argv)
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 500, in launch_instance
    serverapp.start()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1966, in start
    self.start_app()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1931, in start_app
    self._handle_browser_opening()
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 978, in _handle_browser_opening
    if self.starter_app:
  File "/home/runner/work/jupyterlab/jupyterlab/test_install/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 956, in starter_app
    return self.extension_manager.extension_points.get(name, None).app
AttributeError: 'NoneType' object has no attribute 'app'
+ kill 10128
./scripts/ci_script.sh: line 318: kill: (10128) - No such process
Error: Process completed with exit code 1.

jasongrout · 2021-01-08T07:40:55Z

The linkcheck failure also looks unrelated - a bunch of links have a // in them, apparently:

FAILED docs/build/html/user/urls.html::/home/runner/work/jupyterlab/jupyterlab/docs/build/html/user/urls.html <a href=https://github.com/jupyterlab/jupyterlab//master/docs/source/user/urls.rst>

jasongrout · 2021-01-08T08:38:05Z

The linkcheck failure also looks unrelated - a bunch of links have a // in them, apparently:

Link check failure addressed in #9572

davidbrochart · 2021-01-08T08:46:34Z

@davidbrochart - did you review the code as well?

I don't think I can make a valuable review yet, I've never looked at jlab's code 😄

packages/services/src/kernel/messages.ts

jasongrout · 2021-01-08T11:27:10Z

@jtpio - just checking, was your comment an indication that you reviewed the PR as a whole?

Zsailer · 2021-01-08T17:51:51Z

@jasongrout, the failing test will be fixed by jupyter-server/jupyter_server#379. I'll make a patch release in a few minutes.

blink1073

Looks good, thank you!

blink1073 · 2021-01-08T18:39:21Z

I'll restart CI once there is a new server release. Once the usage test passes, I'll merge and cut a release.

Zsailer · 2021-01-08T18:53:37Z

jupyter_server v.1.2.1 released.

blink1073 · 2021-01-08T19:37:48Z

Good to go, thanks all!

jasongrout added 3 commits January 7, 2021 14:24

Add isExecuteRequestMsg typecheck function.

b1409f1

Remove debugging messages.

56faad1

jasongrout added this to the 3.0 milestone Jan 7, 2021

github-actions bot added the pkg:services label Jan 7, 2021

jasongrout mentioned this pull request Jan 7, 2021

Out-of-order execution #9566

Closed

lint

d63d9f9

github-actions bot added the pkg:csvviewer label Jan 8, 2021

jtpio reviewed Jan 8, 2021

View reviewed changes

packages/services/src/kernel/messages.ts Outdated Show resolved Hide resolved

Remove unused type guard function.

b5ae089

blink1073 approved these changes Jan 8, 2021

View reviewed changes

blink1073 merged commit b62371e into jupyterlab:master Jan 8, 2021

github-actions bot added the status:resolved-locked Closed issues are locked after 30 days inactivity. Please open a new issue for related discussion. label Jul 8, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manage kernel message queueing better to prevent out-of-order execution #9571

Manage kernel message queueing better to prevent out-of-order execution #9571

jasongrout commented Jan 7, 2021

jupyterlab-dev-mode bot commented Jan 7, 2021

jasongrout commented Jan 7, 2021 •

edited

jasongrout commented Jan 8, 2021

jasongrout commented Jan 8, 2021

jasongrout commented Jan 8, 2021 •

edited

jasongrout commented Jan 8, 2021

davidbrochart commented Jan 8, 2021

jasongrout commented Jan 8, 2021

Zsailer commented Jan 8, 2021

blink1073 left a comment

blink1073 commented Jan 8, 2021 •

edited

Zsailer commented Jan 8, 2021

blink1073 commented Jan 8, 2021

Navigation Menu

Manage kernel message queueing better to prevent out-of-order execution #9571

Manage kernel message queueing better to prevent out-of-order execution #9571

Conversation

jasongrout commented Jan 7, 2021

References

Code changes

User-facing changes

Backwards-incompatible changes

jupyterlab-dev-mode bot commented Jan 7, 2021

jasongrout commented Jan 7, 2021 • edited

jasongrout commented Jan 8, 2021

jasongrout commented Jan 8, 2021

jasongrout commented Jan 8, 2021 • edited

jasongrout commented Jan 8, 2021

davidbrochart commented Jan 8, 2021

jasongrout commented Jan 8, 2021

Zsailer commented Jan 8, 2021

blink1073 left a comment

Choose a reason for hiding this comment

blink1073 commented Jan 8, 2021 • edited

Zsailer commented Jan 8, 2021

blink1073 commented Jan 8, 2021

jasongrout commented Jan 7, 2021 •

edited

jasongrout commented Jan 8, 2021 •

edited

blink1073 commented Jan 8, 2021 •

edited