Send the shell reply through the ZMQStream instead of raw on its socket by BoykoNeov · Pull Request #1529 · ipython/ipykernel

BoykoNeov · 2026-06-14T14:52:03Z

Summary

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle (0% CPU) and never replies, and the client times out waiting for execute_reply. In a headless notebook smoke-test we measured this at ~30% of runs; which cell hangs wanders run to run (it is content-innocent), and only a kernel restart recovers. This looks like the same user-visible hang tracked downstream in microsoft/vscode-jupyter#17228.

Forcing WindowsSelectorEventLoopPolicy does not help — the kernel already runs a selector loop.

Root cause

The shell ROUTER socket is dual-use on the shell-channel thread:

a ZMQStream reads execute_requests off it (init_kernel builds ZMQStream(self.shell_socket, …), delivered via on_recv), while
replies go back over the same socket out-of-band through a raw send_multipart that bypasses the stream, in SubshellManager._send_on_shell_channel:

def _send_on_shell_channel(self, msg) -> None:
    assert current_thread().name == SHELL_CHANNEL_THREAD_NAME
    self._shell_socket.send_multipart(msg)   # raw send on the ZMQStream's own socket

That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge — the documented libzmq corollary that after zmq_send a socket may become readable without producing a new read event. Because the send is not ZMQStream-mediated, the read side is never re-armed, so an execute_request that arrived concurrently strands unread on a registered-but-non-readable fd. The strand is terminal: while a backlog is already pending there is no 0 → nonzero EVENTS transition, so no later arrival re-edges it, and the kernel sits idle.

Minimal raw-pyzmq reproduction of just the send-drains-read-edge step (no Jupyter), on a ROUTER (the shell socket type):

import select, time, zmq
def readable(fd): return bool(select.select([fd], [], [], 0)[0])

ctx = zmq.Context()
a = ctx.socket(zmq.DEALER); a.setsockopt(zmq.IDENTITY, b"A")
b = ctx.socket(zmq.ROUTER)
b.bind("tcp://127.0.0.1:5704"); a.connect("tcp://127.0.0.1:5704"); time.sleep(0.3)
bfd = b.getsockopt(zmq.FD)

a.send(b"hello"); time.sleep(0.2)                       # warmup: learn route 'A', clear setup edges
assert b.recv_multipart() == [b"A", b"hello"]
b.getsockopt(zmq.EVENTS)
assert not readable(bfd) and not (b.getsockopt(zmq.EVENTS) & zmq.POLLIN)

a.send(b"req1"); time.sleep(0.2)                        # a new request arrives, UNREAD
assert readable(bfd)                                    # read edge is set

b.send_multipart([b"A", b"reply"]); time.sleep(0.05)    # OUT-OF-BAND send on the same ROUTER
assert not readable(bfd)                                # <-- the send DRAINED the read edge
assert b.getsockopt(zmq.EVENTS) & zmq.POLLIN            # ...while POLLIN stays set...
assert b.recv_multipart(zmq.NOBLOCK) == [b"A", b"req1"] # ...and the request was still queued

Fix

After each out-of-band reply send, schedule the shell ZMQStream's read handler on the shell-channel loop — the same edge-trap reschedule ZMQStream._update_handler already runs internally:

self._shell_channel_io_loop.add_callback(
    lambda: stream._handle_events(stream.socket, 0)
)

The shell_stream (built in init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it; the re-arm is guarded by stream is not None and stream.socket is self._shell_socket. No polling and no new dependency — it moves the existing edge-trap reschedule to the one un-mediated operation that drains the edge.

Validation

Windows, Python 3.13 and 3.14, pyzmq 27.1.0 / libzmq 4.3.5; three arms × 20 real notebook runs, same machine/session:

Arm	After the reply `send_multipart`	Wedged
control	none	6/20 (30%)
sham	identical scheduling overhead, `add_callback(lambda: None)` (no re-arm)	5/20
fix	`add_callback(lambda: stream._handle_events(stream.socket, 0))`	0/20

P(0/20 | p=0.30) ≈ 0.70^20 ≈ 8e-4. The sham arm isolates the re-arm itself from the added wake-ups (it stays at the control rate). With the diff applied the threaded reference was live on every send (551 re-arms, 0 None/mismatch). Validated against 7.2.0 / 7.3.0, where _send_on_shell_channel is byte-identical to main.

A public-API alternative

stream.flush(zmq.POLLIN) — a public ZMQStream method ipykernel already calls in kernelbase.py — may be preferable to reaching into _handle_events. flush is a synchronous drain loop whereas _handle_events(socket, 0) is the edge-trap reschedule (related but not identical). I only measured _handle_events; happy to switch to flush if you'd rather stay on the public API.

Notes

Still unfixed on main: _send_on_shell_channel is a bare send_multipart.
Related: ipykernel 7 causes notebook execution to hang microsoft/vscode-jupyter#17228 (downstream symptom), ipykernel 7.0.0 release with subshells #1438 (7.0.0 problem hub), ensure qt zmq streams starts in a clean state #307 (the 4.8.2 fix for the same ZMQ_FD edge-trigger bug class), getsockopt(zmq.EVENTS) drains signaler, can cause zmq.asyncio recv to miss wakeups zeromq/pyzmq#2173 (same signaler-drain family, different layer).
A user-side mitigation — retry the kernel subprocess on the CellTimeoutError signature — works today and is orthogonal to this fix.
Happy to add a regression test or open a companion tracking issue if you'd prefer.

🤖 Generated with Claude Code

The ZMQ_FD edge-drain fix is now an upstream PR (ipython/ipykernel#1529, filed under BoykoNeov, 3 files +21/-0, validated 0/20 vs 6/20). The doc now points to the PR and the applicable patch; steel-sim's retry mitigation is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

BoykoNeov · 2026-06-15T13:20:25Z

The failing checks here are pre-existing flakiness on the Windows runners, not this patch — each is on a code path these three files don't touch:

build (windows-latest, qt5, 3.10) — tests/inprocess/test_kernel.py::test_pylab timed out while matplotlib built its font cache on a cold runner. That's the in-process kernel (no ZMQ socket, no SubshellManager, no kernelapp), so the shell-channel change can't reach it.
build (windows-latest, qt5, pypy-3.11) — RuntimeError: Kernel died before replying to kernel_info, a kernel startup crash that occurs before any execute_request, i.e. before the re-arm in _send_on_shell_channel ever runs. The same pypy-3.11 job passes under qt6 and fails under qt5, and main fails this same job identically (same error; 141 passed / 42 skipped, attributed to test_async_interrupt), so it isn't introduced here.
enforce-label — needs a maintainer label.

Everything that actually exercises the change is green: all CPython Windows builds (which run tests/test_subshells.py under the patch), the downstream-project tests, and lint. Happy to rebase or re-run if that's useful.

practicusai · 2026-06-15T18:23:26Z

@BoykoNeov not sure it is relevant, but the in our case of microsoft/vscode-jupyter#17228 , the OS is Ubuntu. I hope the fix will work for multiple OSes.

ianthomas23 · 2026-06-16T09:45:30Z

I've confirmed the flaky tests passed the second time around.

Happy to add a regression test or open a companion tracking issue if you'd prefer.

A regression test is important.

Is there a simpler solution of the _send_on_shell_channel using the shell_stream rather than shell_socket to send_multipart?

On Windows, ipykernel 7 intermittently drops an execute_request on the shell channel: the kernel goes idle and never replies, and the client times out waiting for execute_reply (historically ~30% of headless notebook runs in our measurements; which cell hangs wanders run to run). Root cause: the shell ROUTER socket is dual-use on the shell-channel thread. A ZMQStream reads execute_requests off it, while replies are sent back over the SAME socket out-of-band via a raw send_multipart in SubshellManager._send_on_shell_channel. That out-of-band send drains the socket's edge-triggered ZMQ_FD read edge (a documented libzmq corollary: after zmq_send the socket may become readable without a new edge). Because the send is not ZMQStream-mediated, the stream is never re-armed and a request that arrived concurrently strands unread on a registered-but- non-readable fd. The strand is terminal: no later arrival re-edges it. Fix: send the reply through shell_stream.send_multipart rather than raw on shell_socket. This keeps the ZMQStream the sole user of the socket: the send is serviced by the stream's own _handle_events, which recvs any concurrently-queued request first and then re-arms POLLIN via _rebuild_io_state, so the request cannot strand. It removes the root cause rather than re-arming after it, and needs no reach into the private _handle_events. The shell_stream (built in kernelapp.init_kernel) is threaded through ShellChannelThread into SubshellManager so the reply path can reach it, and falls back to the raw socket if no stream was threaded in. Verified by the regression test added next, which reproduces the strand precondition deterministically and asserts the queued request is still delivered once the reply is routed through the stream. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Deterministically reproduces the strand precondition (a request queued on the dual-use shell ROUTER with its edge-triggered ZMQ_FD read edge already consumed) then performs the out-of-band reply send through the real SubshellManager._send_on_shell_channel path and asserts the queued request is still delivered to on_recv. Behavioural, not implementation-coupled: it passes whether the reply send re-arms the stream explicitly or is routed through it, and fails (times out) against a raw-send body. Deterministic (no timing race); verified on Windows, the edge-trigger behaviour it relies on is general. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

BoykoNeov · 2026-06-16T11:57:07Z

Thanks @ianthomas23 — and good the re-run came back green.

Agreed, and I think routing through shell_stream is the better fix. The raw shell_socket.send_multipart is the one op on the dual-use ROUTER that the stream never sees, so it drains the edge-triggered ZMQ_FD read edge and nothing re-arms it. Sending through shell_stream.send_multipart keeps the stream the sole user of the socket — its own _handle_events recvs any concurrently-queued request first, then re-arms POLLIN via _rebuild_io_state. So it kills the root cause instead of re-arming after it, and drops the poke into the private _handle_events. _send_on_shell_channel already runs on the shell-channel thread that owns the stream's loop, so no thread-affinity worries. I've pushed that — the PR is now the stream-send fix plus a regression test.

The test is deterministic rather than statistical: it sets up the strand by hand (queue a request on the ROUTER, consume the ZMQ_FD read edge so it can't re-fire), then does the out-of-band reply send and asserts the queued request still lands in on_recv. It's behavioural, so it passes whether the reply re-arms the stream or routes through it, and times out on a raw-send body. Verified on Windows; the edge-trigger behaviour it leans on is general, so CI should cover the other platforms.

@practicusai — the mechanism is generic libzmq edge-triggering rather than Windows-specific, so this should help the Linux case too.

ZupoLlask · 2026-06-17T10:23:15Z

Dear @BoykoNeov,

We're seeing some issues with flaky tests on Windows that seem to be somewhat related with this subject.

Would you give us some help here, please? 🙏

ianthomas23 added the bug label Jun 15, 2026

BoykoNeov and others added 2 commits June 16, 2026 14:46

BoykoNeov force-pushed the fix/shell-zmqstream-rearm-after-reply-send branch from c53bfe6 to 78782b4 Compare June 16, 2026 11:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c2eef1

for more information, see https://pre-commit.ci

BoykoNeov changed the title ~~Re-arm the shell ZMQStream read after the out-of-band reply send (fix intermittent dropped execute_request on Windows)~~ Send the shell reply through the ZMQStream instead of raw on its socket Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Send the shell reply through the ZMQStream instead of raw on its socket#1529

Send the shell reply through the ZMQStream instead of raw on its socket#1529
BoykoNeov wants to merge 3 commits into
ipython:mainfrom
BoykoNeov:fix/shell-zmqstream-rearm-after-reply-send

BoykoNeov commented Jun 14, 2026

Uh oh!

BoykoNeov commented Jun 15, 2026

Uh oh!

practicusai commented Jun 15, 2026

Uh oh!

ianthomas23 commented Jun 16, 2026

Uh oh!

BoykoNeov commented Jun 16, 2026

Uh oh!

ZupoLlask commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

BoykoNeov commented Jun 14, 2026

Summary

Root cause

Fix

Validation

A public-API alternative

Notes

Uh oh!

BoykoNeov commented Jun 15, 2026

Uh oh!

practicusai commented Jun 15, 2026

Uh oh!

ianthomas23 commented Jun 16, 2026

Uh oh!

BoykoNeov commented Jun 16, 2026

Uh oh!

ZupoLlask commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants