Out-of-order execution #9566

davidbrochart opened this issue Jan 6, 2021 · 22 comments

Manually executing cells while the kernel is restarting leads to out-of-order cell execution:
  1. Open this binder:
  2. Restart the kernel
  3. Quickly execute every cell by hand while the kernel is restarting
  4. After execution finished, you should see some cells still marked with *, and/or some cells executed out of order, potentially leading to execution errors.

Expected behavior

Cells should be executed in order.


  • Operating System and version: Linux Ubuntu 20.04 LTS
  • Browser and version: Google Chrome 87.0.4280.88 (Official Build) (64-bit)
  • JupyterLab version: 3.0.1
Troubleshoot Output
jovyan@jupyter-jupyter-2dwidgets-2dipyleaflet-2duyxyynfv:~$ jupyter troubleshoot



3.7.8 | packaged by conda-forge | (default, Jul 23 2020, 03:54:19)
[GCC 7.5.0]


which -a jupyter:

pip list:
Browser Output
243.d9d134ad2ffdfb9205b3.js:2 Deprecated include of L.Mixin.Events: this property will be removed in future releases, please inherit from L.Evented instead. Error
    at Function.C.extend (
    at Object.665 (
    at j (
    at Object.w. (
    at j (
    at Object.8197 (
    at j (
    at Module.4795 (
(anonymous) @ 243.d9d134ad2ffdfb9205b3.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Ignored setting registry preload errors. Array(1)
(anonymous) @ jlab_core.3099b1be2bd042406f2c.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Ignored setting registry preload errors. Array(1)
(anonymous) @ jlab_core.3099b1be2bd042406f2c.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Connection lost, reconnecting in 0 seconds.
_reconnect @ jlab_core.3099b1be2bd042406f2c.js:2
static/components/MathJax/MathJax.js?config=TeX-AMS-MML_HTMLorMML-full,Safe&delayStartupUntil=configured:1 Uncaught SyntaxError: Unexpected token '<'
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught ReferenceError: MathJax is not defined
    at i._onLoad (jlab_core.3099b1be2bd042406f2c.js:2)
    at HTMLScriptElement. (jlab_core.3099b1be2bd042406f2c.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Connection lost, reconnecting in 0 seconds.
_reconnect @ jlab_core.3099b1be2bd042406f2c.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n.o (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._propagateEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n.o (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._propagateEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n.o (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Connection lost, reconnecting in 0 seconds.
_reconnect @ jlab_core.3099b1be2bd042406f2c.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n.o (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Ee.t.send (272.2a8425db7209008188fc.js:1)
    at Ie.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._propagateEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send
    at l.send (jlab_core.3099b1be2bd042406f2c.js:2)
    at e.send (272.2a8425db7209008188fc.js:1)
    at Yt.t.send (272.2a8425db7209008188fc.js:1)
    at Xt.t.send (272.2a8425db7209008188fc.js:1)
    at n. (138.cfc773cf0b77045b9fbb.js:1)
    at (243.d9d134ad2ffdfb9205b3.js:2)
    at n._fireDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at n._handleDOMEvent (243.d9d134ad2ffdfb9205b3.js:2)
    at HTMLDivElement.r (243.d9d134ad2ffdfb9205b3.js:2)
jlab_core.3099b1be2bd042406f2c.js:2 Connection lost, reconnecting in 0 seconds.
_reconnect @ jlab_core.3099b1be2bd042406f2c.js:2
jlab_core.3099b1be2bd042406f2c.js:2 Connection lost, reconnecting in 0 seconds.
_reconnect @ jlab_core.3099b1be2bd042406f2c.js:2
fperez commented Jan 6, 2021

Does it happen if using "run all" instead? I think the execution queue is populated differently, but still worth checking...

fperez commented Jan 6, 2021

BTW - for me it's been impossible to reproduce it even as reported above, but that can be variations in the binder node I happened to get. My kernel restarts were super fast, and execution (whether manual or "run all") came very quickly and always in-order. So it may be a bug that's a little tricky to trigger/replicate.

Contributor Author

Does it happen if using "run all" instead? I think the execution queue is populated differently, but still worth checking...

No, it happens only when executing the cells manually.

Contributor Author

The browser output shows a bunch of jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send errors, maybe it is related?

fperez commented Jan 6, 2021

Sounds possible, though I don't know that code at all, unfortunately... Worth perhaps instrumenting that particular path to add more info about what kind of message it was trying to send.

This is definitely a pretty serious bug, as it will lead to potentially very hard to understand behavior if a user doesn't realize it happened (and depending on their code, it could still run, just produce odd results).

jtpio commented Jan 6, 2021

The browser output shows a bunch of jlab_core.3099b1be2bd042406f2c.js:2 Uncaught Error: Cannot send errors, maybe it is related?

Looks like the error might come from this?

throw new Error('Cannot send');

Contributor Author

If it's only possible to reproduce depending on location (I opened the binder from France), I'd be happy to help with debugging.

Looks like the error might come from this?

So it looks like it may be tied to comm messages? So perhaps widgets are a good way to reproduce it?

sackh commented Jan 7, 2021

I am not sure if this is related but I have opened another issue in which I could see Uncaught Error: Cannot send post restart of the kernel. jupyter-widgets/ipyleaflet#774

Copy link

jasongrout commented Jan 7, 2021

I think the root problem here isn't with comm messages. I can occasionally reproduce the issue on a local install of all current packages in a fresh conda env with a notebook with simple inputs. Here is one result of restarting the kernel and immediately evaluating the cells in this notebook from top to bottom:

Screen Shot 2021-01-07 at 12 15 24 AM

Copy link

// Send if the ws allows it, otherwise buffer the message.
if (
this.connectionStatus === 'connected' &&
this._kernelSession !== RESTARTING_KERNEL_SESSION
) {
} else if (queue) {
} else {
throw new Error('Could not send message');
(notice that there might be circumstances where messages are sent even if there are already pending messages that should actually go first).

jtpio commented Jan 7, 2021

I think I found a place in the logic where some messages get queued, then others are sent (jumping the queue)

Maybe that explains the example above? With:

  • The kernel is restarting, so the execute request messages from the first cells are being queued
  • Once the kernel has restarted, messages from the cells below go through and are sent
  • The queue is processed and and the previously queued messages sent

Yes, that's my hypothesis. And probably the queued messages are scheduled for sending, but async, so the direct sending cuts in line?

We have to be careful, because we do want kernel info requests to jump the queue.

Copy link

Relevant issues:

jasongrout added a commit to jasongrout/jupyterlab that referenced this issue Jan 7, 2021
Fixes jupyterlab#9566
Followup on jupyterlab#8562
Changes solution in jupyterlab#9484

If we restarted a kernel, then quickly evaluated a lot of cells, we were often seeing the cells evaluated out of order. This came because the initial evaluations would be queued (because we had the kernel restarting sentinel in place), but later evaluations would happen synchronously, even if there were still messages queued. The logic is now changed to (almost) always queue a message if there are already queued messages waiting to be sent to preserve the message order.

One exception to this is the kernel info request when we are restarting. We redo the logic in jupyterlab#9484 to encode the exception in the _sendMessage function (rather than hacking around the conditions for queueing a message). This brings the exception closer to the logic it is working around, so it is a bit cleaner.

Also, we realize that the sendMessage `queue` parameter is really signifying when we are sending pending messages. As such, we always try to send those messages if we can.

Finally, we saw that there was a race condition between sending messages after a restart and when the websocket was reconnected, leading to some stalled initial message replies. We delete the logic that sends pending messages on shutdown_reply, since those pending messages will be more correctly sent when the websocket reconnects anyway. We also don’t worry about setting the kernel session there since the calling function does that logic.
Copy link

Contributor Author

I tried the generated binder, and I could see once that all the cells were marked with * but not executed (and the kernel was idle), but couldn't reproduce while recording a GIF. Maybe you solved the out-of-order bug and this is another one?

Copy link

I tried the generated binder, and I could see once that all the cells were marked with * but not executed (and the kernel was idle), but couldn't reproduce while recording a GIF. Maybe you solved the out-of-order bug and this is another one?

Let's take the review conversation over to that PR. I'll copy your comment over there.

echarles commented Jan 8, 2021

Hi, I have just read the various issues around this and wonder if the following can help

  1. I had opened Push restarting status in case of Kernel Restart jupyter-server/jupyter_server#247 (Push restarting status in case of Kernel Restart) as while developing Slow Terminating Kernels #8562. If I remember well I was not seeing any restarting status being pushed by the server. This would help to confirm on jupyterlab side that the new kernel messages can be sent.
  2. I have seen a mix of queing and direct usage of send. What if all messages were queued and the queue would be the single place to send the message? If we want to send directly a message, we just queue it and force purge the queue (assuming the kernel status is connected)?

Thanks for looking at this @echarles.

Can we distinguish between what might be a good idea in the future, and what is needed now to fix this rather serious bug without making things worse? To me, it sounds like your (1) item may be something to address in the future, and (2) may be a suggestion for a change on this PR? Am I reading things how you intended?

That does seem elegant to always have a queue, and always work through that queue. Except, of course, for our queue-jumping kernel_info messages that we need right now to clear the restarting status sentinel. But I think the logic would look very similar to how it does now? Right now, if we send a message, basically the _sendMessage function deals with whether to put it on a queue, or to send immediately - that is transparent to the user. Even if we enqueued all messages (except those special kernel_info_request ones), somewhere we would need logic to flush the queue. I'd hate to put that on to the user to do as an extra step, so we'd have something somewhere keeping track of the queue state, and when a new message is added, determining whether we should flush the queue or not - which is essentially what _sendMessage is doing right now anyway. In other words, we'll need the same logic as we have now, and the api would also essentially look the same, especially to the user. Am I missing something?

echarles commented Jan 8, 2021

Can we distinguish between what might be a good idea in the future, and what is needed now to fix this rather serious bug without making things worse? To me, it sounds like your (1) item may be something to address in the future, and (2) may be a suggestion for a change on this PR? Am I reading things how you intended?

Yes, (1) is for longer term, (2) can be discussed in the scope of the needed hot fix.

.... Am I missing something?

I don' know but I think I do miss something as I have not spent enough time to look at all the conditions and the conversation history. I just had a feeling that a managed queue could help solve the issues. When I look at

_sendMessage(message: Terminal.IMessage, queue = true): void {
if (this._isDisposed || !message.content) {
if (this.connectionStatus === 'connected' && this._ws) {
const msg = [message.type, ...message.content];
} else if (queue) {
} else {
throw new Error(`Could not send message: ${JSON.stringify(message)}`);
I see that if the status is connected, the message is sent, wherever the queue is empty or not. I would rather call _sendPending before sending the message.

I see that if the status is connected, the message is sent, wherever the queue is empty or not. I would rather call _sendPending before sending the message.

Well, you have to be a bit more careful - _sendPending calls this function as the lowest-level sending function IIRC, so you have to guard against the queue flush sends getting cycled back onto the queue itself.

The other thing is that it wasn't clear this was a problem in kernels before the restart status logic was introduced. The connection status change triggered a queue flush, which IIRC was synchronous, which wouldn't let other messages jump the queue. If that is the case, it may not be a problem here either.

Copy link

