Skip to content

Send the shell reply through the ZMQStream instead of raw on its socket#1529

Open
BoykoNeov wants to merge 3 commits into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send
Open

Send the shell reply through the ZMQStream instead of raw on its socket#1529
BoykoNeov wants to merge 3 commits into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send

Conversation

@BoykoNeov

Copy link
Copy Markdown

Summary

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting for execute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.

Forcing WindowsSelectorEventLoopPolicy does not help — the kernel already runs a selector loop.

Root cause

The shell ROUTER socket is dual-use on the shell-channel thread:

  • a ZMQStream reads execute_requests off it (init_kernel builds ZMQStream(self.shell_socket, …), delivered via on_recv), while
  • replies go back over the same socket out-of-band through a raw send_multipart that bypasses the stream, in SubshellManager._send_on_shell_channel:
def _send_on_shell_channel(self, msg) -> None:
    assert current_thread().name == SHELL_CHANNEL_THREAD_NAME
    self._shell_socket.send_multipart(msg)   # raw send on the ZMQStream's own socket

That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge — the documented libzmq corollary that after zmq_send a socket may become readable without producing a new read event. Because the send is not ZMQStream-mediated, the read side is never re-armed, so an execute_request that arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no 0 → nonzero EVENTS transition, so no later arrival re-edges it, and the kernel sits idle.

Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):

import select, time, zmq
def readable(fd): return bool(select.select([fd], [], [], 0)[0])

ctx = zmq.Context()
a = ctx.socket(zmq.DEALER); a.setsockopt(zmq.IDENTITY, b"A")
b = ctx.socket(zmq.ROUTER)
b.bind("tcp://127.0.0.1:5704"); a.connect("tcp://127.0.0.1:5704"); time.sleep(0.3)
bfd = b.getsockopt(zmq.FD)

a.send(b"hello"); time.sleep(0.2)                       # warmup: learn route 'A', clear setup edges
assert b.recv_multipart() == [b"A", b"hello"]
b.getsockopt(zmq.EVENTS)
assert not readable(bfd) and not (b.getsockopt(zmq.EVENTS) & zmq.POLLIN)

a.send(b"req1"); time.sleep(0.2)                        # a new request arrives, UNREAD
assert readable(bfd)                                    # read edge is set

b.send_multipart([b"A", b"reply"]); time.sleep(0.05)    # OUT-OF-BAND send on the same ROUTER
assert not readable(bfd)                                # <-- the send DRAINED the read edge
assert b.getsockopt(zmq.EVENTS) & zmq.POLLIN            # ...while POLLIN stays set...
assert b.recv_multipart(zmq.NOBLOCK) == [b"A", b"req1"] # ...and the request was still queued

Fix

After each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop — the same edge-trap reschedule ZMQStream._update_handler already runs internally:

self._shell_channel_io_loop.add_callback(
    lambda: stream._handle_events(stream.socket, 0)
)

The shell_stream (built in init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it; the re-arm is guarded by stream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.

Validation

Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:

Arm After the reply send_multipart Wedged
control none 6/20 (30%)
sham identical scheduling overhead, add_callback(lambda: None) (no re-arm) 5/20
fix add_callback(lambda: stream._handle_events(stream.socket, 0)) 0/20

P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. The sham arm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where _send_on_shell_channel is byte-identical to main.

A public-API alternative

stream.flush(zmq.POLLIN) — a public ZMQStream method ipykernel already calls in kernelbase.py — may be preferable to reaching into _handle_events. flush is a synchronous drain loop whereas _handle_events(socket, 0) is the edge-trap reschedule (related but not identical). I only measured _handle_events; happy to switch to flush if you'd rather stay on the public API.

Notes

🤖 Generated with Claude Code

BoykoNeov added a commit to BoykoNeov/steel-sim that referenced this pull request Jun 14, 2026
The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529,
filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now
points to the PR and the applicable patch; steel-sim's retry mitigation is
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@BoykoNeov

Copy link
Copy Markdown
Author

The failing checks here are pre-existing flakiness on the Windows runners, not this patch — each is on a code path these three files don't touch:

  • build (windows-latest, qt5, 3.10)tests/inprocess/test_kernel.py::test_pylab timed out while matplotlib built its font cache on a cold runner. That's the in-process kernel (no ZMQ socket, no SubshellManager, no kernelapp), so the shell-channel change can't reach it.
  • build (windows-latest, qt5, pypy-3.11)RuntimeError: Kernel died before replying to kernel_info, a kernel startup crash that occurs before any execute_request, i.e. before the re-arm in _send_on_shell_channel ever runs. The same pypy-3.11 job passes under qt6 and fails under qt5, and main fails this same job identically (same error; 141 passed / 42 skipped, attributed to test_async_interrupt), so it isn't introduced here.
  • enforce-label — needs a maintainer label.

Everything that actually exercises the change is green: all CPython Windows builds (which run tests/test_subshells.py under the patch), the downstream-project tests, and lint. Happy to rebase or re-run if that's useful.

@practicusai

Copy link
Copy Markdown

@BoykoNeov not sure it is relevant, but the in our case of microsoft/vscode-jupyter#17228 , the OS is Ubuntu. I hope the fix will work for multiple OSes.

@ianthomas23

Copy link
Copy Markdown
Collaborator

I've confirmed the flaky tests passed the second time around.

  • Happy to add a regression test or open a companion tracking issue if you'd prefer.

A regression test is important.

Is there a simpler solution of the _send_on_shell_channel using the shell_stream rather than shell_socket to send_multipart?

BoykoNeov and others added 2 commits June 16, 2026 14:46
On Windows, ipykernel 7 intermittently drops an execute_request on the
shell channel: the kernel goes idle and never replies, and the client
times out waiting for execute_reply (historically ~30% of headless
notebook runs in our measurements; which cell hangs wanders run to run).

Root cause: the shell ROUTER socket is dual-use on the shell-channel
thread. A ZMQStream reads execute_requests off it, while replies are sent
back over the SAME socket out-of-band via a raw send_multipart in
SubshellManager._send_on_shell_channel. That out-of-band send drains the
socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary:
after zmq_send the socket may become readable without a new edge). Because
the send is not ZMQStream-mediated, the stream is never re-armed and a
request that arrived concurrently strands unread on a registered-but-
non-readable fd. The strand is terminal: no later arrival re-edges it.

Fix: send the reply through shell_stream.send_multipart rather than raw on
shell_socket. This keeps the ZMQStream the sole user of the socket: the
send is serviced by the stream's own _handle_events, which recvs any
concurrently-queued request first and then re-arms POLLIN via
_rebuild_io_state, so the request cannot strand. It removes the root cause
rather than re-arming after it, and needs no reach into the private
_handle_events. The shell_stream (built in kernelapp.init_kernel) is
threaded through ShellChannelThread into SubshellManager so the reply path
can reach it, and falls back to the raw socket if no stream was threaded in.

Verified by the regression test added next, which reproduces the strand
precondition deterministically and asserts the queued request is still
delivered once the reply is routed through the stream.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deterministically reproduces the strand precondition (a request queued on the
dual-use shell ROUTER with its edge-triggered ZMQ_FD read edge already consumed)
then performs the out-of-band reply send through the real
SubshellManager._send_on_shell_channel path and asserts the queued request is
still delivered to on_recv. Behavioural, not implementation-coupled: it passes
whether the reply send re-arms the stream explicitly or is routed through it,
and fails (times out) against a raw-send body. Deterministic (no timing race);
verified on Windows, the edge-trigger behaviour it relies on is general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@BoykoNeov BoykoNeov force-pushed the fix/shell-zmqstream-rearm-after-reply-send branch from c53bfe6 to 78782b4 Compare June 16, 2026 11:53
@BoykoNeov

Copy link
Copy Markdown
Author

Thanks @ianthomas23 — and good the re-run came back green.

Agreed, and I think routing through shell_stream is the better fix. The raw shell_socket.send_multipart is the one op on the dual-use ROUTER that the stream never sees, so it drains the edge-triggered ZMQ_FD read edge and nothing re-arms it. Sending through shell_stream.send_multipart keeps the stream the sole user of the socket — its own _handle_events recvs any concurrently-queued request first, then re-arms POLLIN via _rebuild_io_state. So it kills the root cause instead of re-arming after it, and drops the poke into the private _handle_events. _send_on_shell_channel already runs on the shell-channel thread that owns the stream's loop, so no thread-affinity worries. I've pushed that — the PR is now the stream-send fix plus a regression test.

The test is deterministic rather than statistical: it sets up the strand by hand (queue a request on the ROUTER, consume the ZMQ_FD read edge so it can't re-fire), then does the out-of-band reply send and asserts the queued request still lands in on_recv. It's behavioural, so it passes whether the reply re-arms the stream or routes through it, and times out on a raw-send body. Verified on Windows; the edge-trigger behaviour it leans on is general, so CI should cover the other platforms.

@practicusai — the mechanism is generic libzmq edge-triggering rather than Windows-specific, so this should help the Linux case too.

@BoykoNeov BoykoNeov changed the title Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows) Send the shell reply through the ZMQStream instead of raw on its socket Jun 16, 2026
@ZupoLlask

Copy link
Copy Markdown

Dear @BoykoNeov,

We're seeing some issues with flaky tests on Windows that seem to be somewhat related with this subject.

Would you give us some help here, please? 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants