Send the shell reply through the ZMQStream instead of raw on its socket#1529
Send the shell reply through the ZMQStream instead of raw on its socket#1529BoykoNeov wants to merge 3 commits into
Conversation
The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529, filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now points to the PR and the applicable patch; steel-sim's retry mitigation is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The failing checks here are pre-existing flakiness on the Windows runners, not this patch — each is on a code path these three files don't touch:
Everything that actually exercises the change is green: all CPython Windows builds (which run |
|
@BoykoNeov not sure it is relevant, but the in our case of microsoft/vscode-jupyter#17228 , the OS is Ubuntu. I hope the fix will work for multiple OSes. |
|
I've confirmed the flaky tests passed the second time around.
A regression test is important. Is there a simpler solution of the |
On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle and never replies, and the client times out waiting for execute_reply (historically ~30% of headless notebook runs in our measurements; which cell hangs wanders run to run). Root cause: the shell ROUTER socket is dual-use on the shell-channel thread. A ZMQStream reads execute_requests off it, while replies are sent back over the SAME socket out-of-band via a raw send_multipart in SubshellManager._send_on_shell_channel. That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary: after zmq_send the socket may become readable without a new edge). Because the send is not ZMQStream-mediated, the stream is never re-armed and a request that arrived concurrently strands unread on a registered-but- non-readable fd. The strand is terminal: no later arrival re-edges it. Fix: send the reply through shell_stream.send_multipart rather than raw on shell_socket. This keeps the ZMQStream the sole user of the socket: the send is serviced by the stream's own _handle_events, which recvs any concurrently-queued request first and then re-arms POLLIN via _rebuild_io_state, so the request cannot strand. It removes the root cause rather than re-arming after it, and needs no reach into the private _handle_events. The shell_stream (built in kernelapp.init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it, and falls back to the raw socket if no stream was threaded in. Verified by the regression test added next, which reproduces the strand precondition deterministically and asserts the queued request is still delivered once the reply is routed through the stream. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deterministically reproduces the strand precondition (a request queued on the dual-use shell ROUTER with its edge-triggered ZMQ_FD read edge already consumed) then performs the out-of-band reply send through the real SubshellManager._send_on_shell_channel path and asserts the queued request is still delivered to on_recv. Behavioural, not implementation-coupled: it passes whether the reply send re-arms the stream explicitly or is routed through it, and fails (times out) against a raw-send body. Deterministic (no timing race); verified on Windows, the edge-trigger behaviour it relies on is general. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c53bfe6 to
78782b4
Compare
for more information, see https://pre-commit.ci
|
Thanks @ianthomas23 — and good the re-run came back green. Agreed, and I think routing through The test is deterministic rather than statistical: it sets up the strand by hand (queue a request on the ROUTER, consume the @practicusai — the mechanism is generic libzmq edge-triggering rather than Windows-specific, so this should help the Linux case too. |
|
Dear @BoykoNeov, We're seeing some issues with flaky tests on Windows that seem to be somewhat related with this subject. Would you give us some help here, please? 🙏 |
Summary
On Windows, ipykernel 7 intermittently drops an
execute_requeston the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting forexecute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.Forcing
WindowsSelectorEventLoopPolicydoes not help — the kernel already runs a selector loop.Root cause
The shell ROUTER socket is dual-use on the shell-channel thread:
ZMQStreamreadsexecute_requests off it (init_kernelbuildsZMQStream(self.shell_socket, …), delivered viaon_recv), whilesend_multipartthat bypasses the stream, inSubshellManager._send_on_shell_channel:That out-of-band send drains the socket's edge-triggered
ZMQ_FDread edge — the documented libzmq corollary that afterzmq_senda socket may become readable without producing a new read event. Because the send is notZMQStream-mediated, the read side is never re-armed, so anexecute_requestthat arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no0 → nonzeroEVENTStransition, so no later arrival re-edges it, and the kernel sits idle.Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):
Fix
After each out-of-band reply send, schedule the shell
ZMQStream's read handler on the shell-channel loop — the same edge-trap rescheduleZMQStream._update_handleralready runs internally:The
shell_stream(built ininit_kernel) is threaded throughShellChannelThreadintoSubshellManagerso the reply path can reach it; the re-arm is guarded bystream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.Validation
Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:
send_multipartadd_callback(lambda: None)(no re-arm)add_callback(lambda: stream._handle_events(stream.socket, 0))P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. Theshamarm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where_send_on_shell_channelis byte-identical tomain.A public-API alternative
stream.flush(zmq.POLLIN)— a publicZMQStreammethod ipykernel already calls inkernelbase.py— may be preferable to reaching into_handle_events.flushis a synchronous drain loop whereas_handle_events(socket, 0)is the edge-trap reschedule (related but not identical). I only measured_handle_events; happy to switch toflushif you'd rather stay on the public API.Notes
main:_send_on_shell_channelis a baresend_multipart.ZMQ_FDedge-trigger bug class), getsockopt(zmq.EVENTS) drains signaler, can cause zmq.asyncio recv to miss wakeups zeromq/pyzmq#2173 (same signaler-drain family, different layer).CellTimeoutErrorsignature — works today and is orthogonal to this fix.🤖 Generated with Claude Code