Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

Merged
merged 9 commits into from
Feb 25, 2024

Conversation

jdranczewski
Copy link
Contributor

Fixes #1202 by introducing a timed exit to the Tk (on Linux) and Qt (on Windows and Linux) eventloops so schedule_stop_aborting can fire after stop_on_error_timeout in the kernel's main io_loop.

This only affects the eventloops specified, as the others periodically hand control back to the kernel, but Tk (on Linux) and Qt normally wait for a ZMQ socket event to go back to the kernel. The stop_on_error_timeout does not fire a socket event, so the timer does not expire and future events are aborted when they should not.

The proposed solution introduces an optional _schedule_exit method to eventloops that need it, and makes sure it is called if an eventloop exit is needed for the stop_on_error_timeout to fire.

See issue ipython#1202 - this implements the base logic, eventloops need to implement `_schedule_exit` themselves.
See issue ipython#1202, allows for returning to the kernel io_loop on a timeout
The QTimer was introduced in d4755f6 and then changed is issue ipython#990 to close a memory leak.
This change makes the code more concise by removing not instancing a QTimer object but calling the static singleShot method instead
See issue ipython#1202 for _schedule_exit reasoning.
Replaced `stream` as an argument with a direct reference to `kernel.shell_stream`, simplifying code.
The reason for `stream` being an argument was support for multiple `kernel.shell_streams`, which has been deprecated since, see e719892
The default QTimer is coarse, so can fire within +-5% of the time specified.
We _need_ to fire after the delay specified so that we exit _after_ schedule_stop_aborting is scheduled.
The QTimer thus needs to be a PreciseTimer.
For small values of stop_on_error_timeout I found the QTimer _still_ fired up to 5ms too early, so added 10ms of delay offset.
singleShot signature is inconsistent between PySide and PyQt, so we store a timer object, similar to ipython#990
Copy link
Member

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you!

@blink1073
Copy link
Member

@ccordoba12 did you want to look at this as well?

@ccordoba12
Copy link
Member

Yes, thanks for the ping @blink1073! I'll run our test suite against this PR and report back any issues in a couple of days.

If the qt loop is disabled through `%gui` and then enabled again, _qt_notifier and _qt_timer stay as references to dead Qt objects, causing an error
@ccordoba12
Copy link
Member

@blink1073, just so you know, @jdranczewski is helping us to try to solve an additional issue in Spyder (referenced above), so this could take a bit longer than expected.

@blink1073
Copy link
Member

Understood, thank you both!

@jdranczewski
Copy link
Contributor Author

Hi @blink1073, while investigating spyder-ide/spyder#21299 with @ccordoba12 I found that it's not caused by #1202 but likely a slightly different issue with how ipykernel schedules the next eventloop run.

schedule_next in kernelbase.py puts advance_eventloop on an io_loop call_later, which means it will be called soon, but usually after the msg_queue is dealt with. I've found that if the ZMQ socket event for a dispatch_shell arrives before we enter the eventloop, there is sometimes a race condition between the advance_eventloop call and process_one running the call to dispatch shell.

In a situation like this, schedule_dispatch is called, and the dispatch_shell call is put on the msg_queue. Normally advance_eventloop is supposed to guard against entering the eventloop if there are any unprocessed messages: https://github.com/ipython/ipykernel/blob/v6.29.2/ipykernel/kernelbase.py#L480-L484 Unfortunately, since there is already a getter registered for the queue in process_one, the event is immediately consumed, the queue size is reported as zero, and the eventloop is entered before the dispatch_shell can fire. This results in the kernel apparently hanging, as it has entered the eventloop and it won't process the shell request until the next ZMQ event is received.

I have not been able to meaningfully reproduce this in Jupyter, so this race condition may be fairly rare, but it's possible to reproduce in Spyder by spamming the run button until the request arrives at just the right time to trigger this situation, and for some users it apparently arises naturally. A fix I could suggest is putting the call to advance the eventloop on the message queue, so it's for sure only called once any dispatch events are processed: jdranczewski@30da44a What do you think of doing it this way? Do you think it should be a separate PR (as it solves a different problem than stop_on_error_timeout not working right), or would it fall under the umbrella of this one as 'eventloop scheduling'?

@blink1073
Copy link
Member

I'd say if it is easier for you to combine them in this PR, that is fine.

This ensures that the msg_queue is truly empty before entering the eventloop, fixing a possible race condition where process_one has consumed a dispatch but not executed it yet.
See spyder-ide/spyder#21299 and ipython#1212
@jdranczewski
Copy link
Contributor Author

Ok, in that case I think it will be easier to just continue with this PR.

@ccordoba12 I've pushed the fix to this branch if you would like to run the Spyder test suite against it.

I'm not sure why the linting test started failing, I have not touched the line it's now failing at...

@jdranczewski jdranczewski changed the title Schedule eventloop exits for stop_on_error_timeout Eventloop scheduling improvements for stop_on_error_timeout and schedule_next Feb 24, 2024
@blink1073
Copy link
Member

The lint failure is unrelated, from a typings change in ipython.

@ccordoba12
Copy link
Member

Thanks for the update @jdranczewski! I opened spyder-ide/spyder#21834 to test your work on our side.

@ccordoba12
Copy link
Member

@blink1073, good news! Our tests are passing without issues, so this should be ready to be merged.

Also, we'd appreciate if you could release a new IPykernel version with this fix so we can depend on it in our next Spyder version. Thanks!

@blink1073
Copy link
Member

Excellent! I'll make a bug release tomorrow.

@blink1073 blink1073 merged commit de2221c into ipython:main Feb 25, 2024
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Qt eventloop interferes with asyncio.call_later in _abort_queues
3 participants