Skip to content

feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst#415

Merged
drewstone merged 11 commits into
mainfrom
feat/coordination-eval-suite
Jun 30, 2026
Merged

feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst#415
drewstone merged 11 commits into
mainfrom
feat/coordination-eval-suite

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

A cost-aware ablation suite under examples/ablation-suite/ comparing three agent topologies at matched budget on contamination-proof coding tasks, plus the supervisor's real-time self-improvement analyst. All examples/ — no src/ changes.

Topologies (the lever under test):

  • continuous (refine) — one worker, carries its conversation across shots.
  • ralph (ralph-strategy.ts) — one persistent file, FRESH context each round (re-read + continue).
  • supervisor (supervise() via selfImprovingSupervisor) — a driver brain spawning serial workers, with a failuresAnalyst feeding the driver each worker's still-failing tests.

Evals: seed-derived, contamination-proof oscillation tasks (long-coding-env*.ts — shared conventions → fix-one-break-another) + a real-library reconstruction (verkit-env.ts, renamed python-semver, 257 real tests).

Observability (counting-surface.ts): every arm reports resolve, pass-fraction score, tokens in/out, LLM calls, refine-shots, $, latency, and per-tool call counts.

Findings (receipts)

  • Supervisor > single agent, significant: +20.8pp (95% CI [8,38], n=24, deepseek-flash) — Pareto-dominant (won every seed continuous won + 5 more, lost none). Mechanism: fresh worker contexts (no transcript bloat) + a driver that accumulates conventions across workers.
  • Compute-confound refuted: continuous at 2× budget still 0% — it can't convert extra compute, so the win isn't brute force.
  • Ralph loses (−8.3pp): the win is the driver's accumulated memory, not fresh respawn.
  • Model-relative boundary: at frontier (gemini-2.5-pro) the same eval is solved solo (100% all arms) → coordination adds 0pp at 3× cost. Coordination is a weak-model amplifier; its value lives in the worker's middle band, which a stronger model shrinks.

Architecture

selfImprovingSupervisor is NOT a separate primitive — it is supervise() configured. Self-improvement = the authored failuresAnalyst (surfaces the worker's failing tests so the driver targets persistently-hard cases), fed to supervise()'s existing analyzeOnSettle knob. Two timescales, one entrypoint each: within-run (the analyst) and across-run (improve()/selfImprove wrapping it — gepa-driver-prompt.ts).

tsc + biome clean; failing-tests-reach-driver verified by a $0 check.

Adds the three-topology comparison to the ablation board:
- run_tests feedback + iterate-until-pass prompt in hard-coding-env
- ralph strategy: one persistent file, fresh context per round (re-read + continue)
- persistent-surface: artifact carries across a supervisor's workers
- serial workers (maxLiveWorkers:1) so a shared file is not raced
- long-coding-env: 54-function oscillation eval (9 shared seed-derived
  conventions) that forces context-overflow, so a single continuous context
  degrades while fresh respawns / a driver that accumulates do not
… significant)

long-coding-env-lite: 28 tiny functions, 196 tests, 9 shared seed-derived
conventions (10 dual-coupled) — difficulty from cases not file size, so the
slow supervisor arm is tractable for a powered run.

Confirms the three-way (continuous vs ralph vs supervisor) at n=24:
supervisor 46% vs continuous 25% vs ralph 17%; supervisor - continuous
= +20.8pp, 95% CI [8,38], Pareto-dominant (won every seed continuous won
plus 5 more, lost none). Ralph loses, so the win is the driver's accumulated
memory across fresh workers, not respawn. ~3-4x cost (a capability win).
Every arm now reports the complete column set — resolve (k/n), tokens in/out,
LLM calls (completions, incl. the supervisor's conserved-pool iterations),
refine-shots, total $, $/task, latency/task — plus optional per-tool call
counts via a countingSurface wrapper (ArmResult.toolCalls). selfImprovingSupervisor
now returns completions (spentTotal.iterations) so the supervisor arm is no longer
a 0 in the calls column. printAutopsy widened to show all of it.
…-call counting

verkit-env: reconstruct a renamed python-semver (1 class + 16 fns, 257 real
pytest tests, ~276 impl lines) from its tests — the real-code analog of the
synthetic oscillation eval, for the drown-vs-accumulate regime. Calibrated
stub 0/257 -> real 257/257. counting-surface tallies per-tool calls; wired
into runAblation so every arm reports tool counts. Contamination: module/class
renamed to defeat verbatim recall; spec-covered surface still partly known
(documented in fixtures/verkit/PROVENANCE.md).
Adds scoreMean (mean pass-fraction [0,1]) and perTaskScore (per-task gradient)
to ArmResult + the autopsy — the metric that matters when a task is hard enough
that binary all-tests-pass resolve is mostly 0 (e.g. a real 257-test library).
printAutopsy gains a score column.
…arate thing

Collapse the 'self-improving' concept onto supervise()'s existing analyst knob, and
make the lens actually rich:
- failuresAnalyst (replaces score-only progressAnalyst): surfaces the worker's actual
  still-FAILING tests so the driver targets the persistently-hard cases across workers
  — real within-run self-improvement, not just 'didn't resolve'.
- surface-worker captures the worker's last run_tests -> SurfaceWorkerOut.failing via a
  seam-local capture wrapper (no re-open / refcount hazard).
- driver prompt targets + accumulates the failing tests across spawns.
- docstring: ONE entrypoint (supervise); two timescales (within-run = this analyst,
  across-run = improve()/selfImprove); the analyst is the only knob that defines
  what self-improvement means.
Verified: failing tests reach the driver finding ($0 check); flash smoke clean.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 9f1b5b8f

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:03:20Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 2 (2 low)
Heuristic 0.1s
Duplication 0.0s
Interrogation 8.4s (2 bridge agents)
Total 8.5s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T000516Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 9f1b5b8f

Review health 100/100 · Reviewer score 66/100 · Confidence 70/100 · 5 findings (2 medium, 3 low)

deepseek: Correctness 66 · Security 66 · Testing 66 · Architecture 66

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Missing --color=no in 3 of 4 pytest invocation surfaces — examples/ablation-suite/hard-coding-env.ts

verkit-env.ts:96 includes --color=no in pytest args with the comment 'Force plain output: a forced-color env would wrap FAILED/test names in ANSI codes and break the summary parsing below.' The other 3 surfaces (hard-coding-env.ts:303-311 and :329, long-coding-env.ts:1151, long-coding-env-lite.ts:944) do NOT pass --color=no. If FORCE_COLOR=true is set in CI or the environment, ANSI escape sequences will break the passed/failed/test-name regex parsing across all 3 surfaces. Fix: add --color=no to pytest args in hardCodingEnv.pytestPassed, hardCodingEnv.runTestsReport, longCodingEnv.pytestPassed, longCodingEnv.runTestsReport, and both lite equivalents.

🟠 MEDIUM hard-coding-env.ts runTestsReport regex matches FAILED but not ERROR — examples/ablation-suite/hard-coding-env.ts

Line 338: failing regex is /FAILED \S*::(\S+)/g. The other 3 surfaces (long-coding-env.ts:1183, long-coding-env-lite.ts:976, verkit-env.ts:132) all use (?:FAILED|ERROR) to capture both failure and error test names. When a test errors (unexpected exception during execution), its name is silently omitted from the failing list shown to the agent. The agent then has incomplete feedback about what is broken. Fix: change regex to /(?:FAILED|ERROR) \S*::(\S+)/g matching the other surfaces.

🟡 LOW stub.py uses bare open() without context managers — examples/ablation-suite/fixtures/verkit/stub.py

Line 21: tree = ast.parse(open(src).read()) and line 37: open(dst, 'w').write(header + ast.unparse(tree) + '\n'). Neither uses a context manager (with statement), so file descriptors are closed by GC rather than deterministically. This is a build-time script (not hot-path), so practical impact is negligible, but it's a Python anti-pattern. Fix: use with open(src) as f: tree = ast.parse(f.read()) and with open(dst, 'w') as f: f.write(...).

🟡 LOW hard-coding-env.ts runTestsReport missing collection-error detection — examples/ablation-suite/hard-coding-env.ts

Lines 324-341: when pytest fails to collect any tests (syntax error in calc.py), the function returns just '0/0 tests passed.' with no failing-list. Contrast with long-coding-env.ts:1184-1185 et al. which detect passed===0 && failing.length===0 and return a COLLECTION/SYNTAX ERROR message telling the agent the file didn't import cleanly. The agent has no way to distinguish 'everything is broken' from an import-level crash. Fix: add the passed===0 && failing.length===0 guard matching the other 3 surfaces.

🟡 LOW long-coding-env-lite.ts comment claims referenceLib is exported but it is not — examples/ablation-suite/long-coding-env-lite.ts

Line 6 says 'SAME exported names (longCodingEnv, longCodingTasks, referenceLib)' but only longCodingEnv (line 986) and longCodingTasks (line 1089) are exported. referenceLib is a private function ([line 1115](https://github.com/tangle-network/agent-runtime/blob/9f1b5b8f82fa72b17210924d7d979297a75d9147/examples/ablation-suite/long-coding-env-lite.ts#


tangletools · 2026-06-30T00:09:26Z · trace

tangletools
tangletools previously approved these changes Jun 30, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 5 non-blocking findings — 9f1b5b8f

Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-30T00:09:26Z · immutable trace

@tangletools

Copy link
Copy Markdown
Contributor

Premise check withheld merge — 9f1b5b8f

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: medium.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +20.8pp
  • PR body excerpt: feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 3 numeric claim(s) (+20.8pp, 8.3pp, 0pp) and eval-related terms appear in pr_body. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #415

- surface-worker captureFailures: guard the regex group (m[1] is possibly
  undefined under strict) — fixes the CI examples typecheck failure.
- add --color=no to the 6 pytest invocations missing it (hard/long/long-lite)
  so captured run_tests output carries no ANSI codes (review finding).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 3715e00c

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:16:21Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 6 (1 medium-concern, 2 low, 3 weak-concern)
Heuristic 0.1s
Duplication 0.0s
Interrogation 222.0s (2 bridge agents)
Total 222.1s

💰 Value — sound-with-nits

A contamination-proof ablation harness that empirically validates supervisor > single-agent (+20.8pp, n=24, CI excludes 0) by composing existing primitives (supervise/runAgentic/defineStrategy); ships with three small in-grain surface wrappers and a real-library fixture, plus three weak nits (partia

  • What it does: Adds three things, all under examples/ablation-suite/: (1) an ablation runner (ablation.ts) that compares topologies (single=refine, fanout=sample, fanout-refine=sampleThenRefine, ralph=fresh-respawn) PLUS driver-brain arms (driverSteer=selfImprovingSupervisor via supervise(); optimize='gepa'=GEPA-tune the driver prompt on a disjoint train slice, freeze, then drive) at matched budget, with paired-
  • Goals it achieves: Stated in the code: (a) prove the supervisor topology earns its cost on weak models — findings show +20.8pp over single at matched budget, CI [8,38], n=24, Pareto-dominant; (b) refute the compute-confound (continuous at 2x budget still 0% — it can't convert extra compute, so the win isn't brute force); (c) localize the win to the driver's accumulated memory, not fresh respawn (ralph loses -8.3pp);
  • Assessment: Sound and in the grain of the codebase. The change composes existing primitives — supervise() from src/runtime/supervise/, defineStrategy from src/runtime/strategy.ts:789, runAgentic — rather than reinventing them. self-improving-supervisor.ts:2 explicitly frames itself as 'NOT a separate primitive: it is supervise() configured.' The new wrappers (persistentSurface, countingSurface, ralph-strategy
  • Better / existing approach: Searched src/runtime/ for existing equivalents. Found: (1) src/runtime/run-benchmark.ts:132 runBenchmark already does paired-bootstrap + Pareto + perStrategy/perTask tables for Strategy arms — but it cannot express the driverSteer (supervise()-based) or optimize:'gepa' arms, which are the actual point of this ablation, so the unified custom runner is justified; the topology-only arms (single/fanou
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound-with-nits

A well-wired ablation suite over the existing AgenticSurface/Strategy/supervise() seams; one wrapper file (persistentSurface) is genuinely dead and its absence means a stated supervisor mechanism is not actually firing.

  • Integration: Reachable and wired into the substrate's documented seams. ablation.ts is the entrypoint (main() at line 358, runnable via pnpm tsx); it wraps the env ONCE in countingSurface (ablation.ts:155) and threads counter into both runAgentic (line 247) and selfImprovingSupervisor (line 213). ralph is registered in topologyStrategy (ablation.ts:74) and reachable via the topology:'ralph' k
  • Fit with existing patterns: Excellent fit. The wrappers (countingSurface, persistentSurface) compose transparently over AgenticSurface (strategy.ts:76-83), exactly mirroring the test fixture's own shotCountingSurface (tests/loops/strategy-evolution.test.ts:25). ralph is authored with defineStrategy — the sanctioned compact-strategy path (strategy.ts:789). selfImprovingSupervisor is correctly NOT a new primitive
  • Real-world viability: Robust on the happy path and documented error paths. Each env ships a $0 calibrate() self-check proving stub→0 and real→257 (verkit-env.ts:293-342; hard-coding-env.ts:603-625). The ablation runner catches per-task throws and counts them unresolved (ablation.ts:268-277) so one network failure cannot lose the whole arm. pytest invocations use --continue-on-collection-errors and a fixed denominat
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

💰 Value Audit

🟡 long-coding-env.ts and -lite.ts duplicate ~1500 lines of generator machinery [duplication] ``

long-coding-env.ts (1383 lines) and long-coding-env-lite.ts (1189 lines) define pick/pyLit/tline/Conv/conventions IDENTICALLY (verified: both define function conventions(seed) at line ~108-112, both define pyLit at line 70-75, both copy pick verbatim). The lite version's header says it's 'SAME surface EXACTLY' with 'half the functions.' A shared env-generator module (conventions + pick + pyLit + tline + the function-table parameterization) parameterized by a function-count knob would cut ~1200 l

🟡 captureFailures regex-couples the seam to three env files' output format [duplication] ``

surface-worker.ts:71-79 parses 'FAILING: name1, name2' out of the run_tests output via regex (/FAILING:\s*(.+)/i). That string is emitted independently by runTestsReport in hard-coding-env.ts:340, long-coding-env.ts, long-coding-env-lite.ts, and verkit-env.ts:141. If any env changes its report wording, captureFailures silently returns [] and the failuresAnalyst loses its signal (no error, just degraded steering). Two cleaner options: (a) surface.call could return a structured {report, failing?:

🟡 Topology-only arm loop partially reinvents src/runtime/run-benchmark.ts [duplication] ``

ablation.ts:246-267 inlines a per-task Strategy-driver loop (runAgentic per task, accumulate ti/to/usd/ms/shots/perTask) that is functionally the same as runBenchmark's cell loop at src/runtime/run-benchmark.ts:138-187. runBenchmark already provides pairedBootstrap, paretoFrontier, perStrategy summaries, and error-resilient per-task tracking. The ablation's driverSteer/optimize arms cannot use runBenchmark (supervise() topology is outside its Strategy seam), which justifies the custom runner — b

🎯 Usefulness Audit

🟠 persistentSurface is dead surface — defined, exported, zero call sites [integration] ``

grep across the whole repo for persistentSurface( returns only its own definition (persistent-surface.ts:17) and a docstring composition example (counting-surface.ts:9). No file imports it. ablation.ts:155 wraps the env in countingSurface(opts.environment) only — never in persistentSurface. Each supervisor worker therefore opens its own workspace (surface-worker.ts:117 → refine → depthStrategy → surface.open at strategy.ts:587), so the 'build-on-progress across fresh-context workers' mecha


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T002148Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 3715e00c

Review health 100/100 · Reviewer score 86/100 · Confidence 70/100 · 3 findings (3 low)

deepseek: Correctness 86 · Security 86 · Testing 86 · Architecture 86

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.

🟡 LOW printAutopsy NaN display when holdoutN=0 — examples/ablation-suite/ablation.ts

At line 324, Math.round(r.resolve * r.n) evaluates to NaN when r.n === 0 (because resolved/0 = NaN, then NaN * 0 = NaN, Math.round(NaN) = NaN). The format string at line 335 then displays NaN% NaN/0. Only triggered by holdoutN=0 which is pathological, but the guard Math.max(1, r.n) used on lines 342-343 for latency/task and $/task should be applied to the resol

🟡 LOW persistentSurface.close silently skips base.close for untracked handles — examples/ablation-suite/persistent-surface.ts

At lines 42-53, close() iterates the handles Map looking for an entry whose resolved handle ID matches. If no entry matches (e.g., close called with a handle from a different wrapper layer or directly from the base surface), the loop exits without calling base.close() and without decrementing any entry. This is a silent no-op that leaks workspace resources (tmpdir + files) in that unexpected path. It's correct under normal use (open+close always pair through the same persistentSurface), but the silent no-op masks misuse. Fix: after the loop, if no entry was found, fall through to base.close(handle) directly.

🟡 LOW captureFailures regex capture includes COLLECTION-BLOCKED suffix from verkit-env — examples/ablation-suite/surface-worker.ts

At line 72, the regex /FAILING:\s*(.+)/i captures everything after FAILING: including the | COLLECTION-BLOCKED: ... suffix that verkit-env.ts:runTestsReport (line 144-146) appends when both failing tests AND collection-blocked files exist. The captured group then includes the COLLECTION-BLOCKED text, producing malformed test names after split+trim. This only fires in the mixed failing+blocked state (worker has partial implementation but some test files still block at collection), which is transi


tangletools · 2026-06-30T00:22:46Z · trace

tangletools
tangletools previously approved these changes Jun 30, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 3 non-blocking findings — 3715e00c

Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-30T00:22:46Z · immutable trace

@tangletools

Copy link
Copy Markdown
Contributor

Premise check withheld merge — 3715e00c

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: medium.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +20.8pp
  • PR body excerpt: feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 3 numeric claim(s) (+20.8pp, 8.3pp, 0pp) and eval-related terms appear in pr_body. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #415

…ines)

Lead with what it is and the three things you do with it (chat turn / supervise /
improve), each with a tiny example, then four plain-language bullets on how it
works and links to the deep docs. Drops the dense internal vocabulary from the
front page (it lives in docs/).

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 4f43a6bb

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:24:45Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 7 (2 low, 5 weak-concern)
Heuristic 0.1s
Duplication 0.0s
Interrogation 399.7s (2 bridge agents)
Total 399.8s

💰 Value — sound-with-nits

A rigorous, cost-aware ablation of three agent topologies with statistical receipts and a contamination-proof eval substrate — all in examples/, built on existing substrate primitives; the bundled README rewrite is slightly off-title but genuinely improves the front door.

  • What it does: Adds a cost-aware ablation runner that compares three agent topologies at matched budget — continuous (refine), ralph (fresh-context-per-round, state in the file), and supervisor (a driver brain steering serial workers with a failures analyst) — over seed-derived, contamination-proof oscillation coding tasks and a real-library reconstruction (python-semver, renamed to verkit, 257 tests). Every
  • Goals it achieves: (1) Prove, with paired-bootstrap CI, that the supervisor's coordination beats a single continuous worker at matched budget on weak/mid models — and refute the compute-confound (continuous at 2x budget still 0%). (2) Localize the win: ralph loses, so fresh-respawn alone is not the mechanism; the driver's accumulated memory is. (3) Show the model-relative boundary (at frontier, coordination adds 0pp
  • Assessment: Coherent and in the grain of the codebase. It composes existing substrate primitives exclusively — supervise(), refine, defineStrategy, the AgenticSurface seam — and adds no src/ surface (verified: git diff --stat -- 'src/*' is empty). The new pieces are minimal and seam-aligned: ralph is a 43-line defineStrategy body in the same shape as adaptiveRefine/sampleThenRefine; `count
  • Better / existing approach: Looked for an existing equivalent before flagging none. Searched src/ for persistentSurface|countingSurface|surface-zoo|withSurface|observingSurface|tracingSurface — no hits. The closest existing pattern is captureFailures inside examples/ablation-suite/surface-worker.ts:54, which intercepts surface.call to remember the last run_tests output for the analyst; it is intentionally kept loca
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound-with-nits

A calibrated, end-to-end-runnable ablation suite correctly wired into the substrate's real extension points (supervise, defineStrategy, AgenticSurface, selfImprove); ships ahead-of-first-caller utilities but nothing dead.

  • Integration: Tight. Every new module reaches a real substrate export: ralph uses defineStrategy (src/runtime/strategy.ts:789), selfImprovingSupervisor configures supervise() (src/runtime/supervise/supervise.ts:116) with a BYO-executor seam (surface-worker.ts:175-179 → resolved first by ExecutorRegistry.resolve at src/runtime/supervise/runtime.ts:1235), countingSurface/persistentSurface correctly im
  • Fit with existing patterns: Good, fits the codebase grain. selfImprovingSupervisor is explicitly NOT a new primitive — it's supervise() configured (self-improving-supervisor.ts:1-17), matching the repo's 'one entrypoint, authored content over code paths' rule. ralph is a defineStrategy exactly like refine/sample. The wrapper pattern in countingSurface/persistentSurface mirrors the existing surface-zoo style.
  • Real-world viability: Mostly holds up. Error path handled (one task throwing is counted unresolved, arm continues — ablation.ts:268-277). countingSurface.call tallies every call including errored ones (honest count). Per-arm counter.resetToolCounts() works because the outer arm loop and inner task loop are both sequential for...of with await (ablation.ts:157, 205) — no race today. Two real-world caveats: (1) th
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

💰 Value Audit

🟡 README rewrite is orthogonal to the PR title [proportion] ``

The final commit 4f43a6b docs: rewrite README — simple, plain-language front door (371 → ~85 lines) is a real improvement (matches the AGENTS.md 'speak plainly' guidance, accurate entry points) but is unrelated to the ablation suite the PR title advertises. One-line FYI to split if desired; not a blocker — the rewrite is correct and the old README was bloated.

🟡 Three surface.call-intercepting wrappers now coexist [maintenance] ``

captureFailures (examples/ablation-suite/surface-worker.ts:54, local), countingSurface (counting-surface.ts:28), and persistentSurface (persistent-surface.ts:17) all wrap AgenticSurface to add one orthogonal concern (last-output / per-name-counts / shared-workspace). Not duplication — each has a distinct purpose and the comments explicitly design for composition (countingSurface(persistentSurface(base))). Flag only as a pattern to watch: if a fourth appears, a generic `observingSurface

🎯 Usefulness Audit

🟡 runAblation is purely sequential; runBenchmark has bounded concurrency [ergonomics] ``

ablation.ts:157 and :205 are for...of await loops with no concurrency, while src/runtime/run-benchmark.ts:111-127 ships a bounded pool() (default concurrency 3). For a 5-arm × n=24 × multi-worker-per-task run (each task spawns a driver + several LLM-tool-loop workers), wall-time is the bottleneck. Not blocking (n=24 was run), but a small available improvement: lift runBenchmark's pool pattern for the task loop, leaving the arm loop serial (arms share the counter).

🟡 Shared countingSurface instance would race under future arm-level parallelism [robustness] ``

ablation.ts:155 creates one counter reused across arms with resetToolCounts() between them (ablation.ts:204). Safe only because arms run serially. If a future change parallelizes arms (e.g. to fix the wall-time above), per-tool counts silently bleed across arms. Note for the reviewer: if concurrency is added, give each arm its own countingSurface(opts.environment) wrapper.

🟡 persistentSurface exported but no caller in this PR [integration] ``

persistent-surface.ts is imported nowhere (grep finds only a doc reference in counting-surface.ts:9). It's documented as composable with countingSurface and is the structural lever the ralph-vs-supervisor contrast relies on (fresh respawn over an accumulating file), so it's ahead-of-first-caller rather than dead. Fine to ship; flag so a human knows the next eval run is its first consumer.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T003310Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — 4f43a6bb

Review health 100/100 · Reviewer score 54/100 · Confidence 75/100 · 4 findings (1 high, 1 medium, 2 low)

deepseek: Correctness 54 · Security 54 · Testing 54 · Architecture 54

Reviewer score is advisory once the run is complete and the verdict has no blockers.

Full multi-shot audit completed 3/3 planned shots over 28 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH bump_build has duplicated selection logic overwriting the increment result — examples/ablation-suite/fixtures/verkit/verkit.py

Lines 352-374: the build-string selection block (lines 352-359) is repeated identically at lines 363-370 AFTER the _increment_string call at line 362. The second block ([lines

Other

🟠 MEDIUM captureFailures regex produces malformed test names from verkit-env's combined report format — examples/ablation-suite/surface-worker.ts

Line 72: const body = /FAILING:\s*(.+)/i.exec(lastReport)?.[1] captures everything after 'FAILING: ' including any 'COLLECTION-BLOCKED' suffix when both sections are present (verkit-env.ts runTestsReport joins them with ' | '). Verified: the captured string becomes 'test1, test2, test3 | COLLECTION-BLOCKED ...' causing the last comma-split entry to be 'test3 | COLLECTION-BLOCKED (implement...)' which is passed as a test name to the analyst. Also, long-coding-env's '(+N more)' suffix is included. Impact: analyst receives some malformed 'test names' which the driver may misinterpret but this is data-quality noise, not a crash. Fix: extract only the

🟡 LOW perTaskScore vector computed but unused in significance testing — examples/ablation-suite/ablation.ts

Line 294: perTaskScore is populated per-task and stored in ArmResult.perTaskScore but printAutopsy only bootstraps perTask (binary resolve) for significance. scoreMean is displayed in the table column, but the paired-bootstrap significance test uses only binary resolve. When tasks are so hard that binary resolve is near 0, the perTaskScore vector is more informative for comparing arms but never used in the CI computation. Not a bug — the data is exported for external analysis — but worth noting.

🟡 LOW long-coding-env.ts and long-coding-env-lite.ts share ~95% boilerplate as separate files — examples/ablation-suite/long-coding-env-lite.ts

Both files (1383 and 1189 lines) contain near-identical scaffolding: pick/pyLit/tline functions, AgenticSurface open/tools/call/score/close, runTestsReport, calibrate(), and the exported task supplier shape. The lite version is explicitly a 'lighter cut' per its header, but ~900 lines of surface boilerplate are duplicated. This is not a bug but a maintenance risk: a fix to the surface pattern (e.g., error handling, timeout) must be made in two places. Consider extracting shared surface scaffolding into a base builder function that accepts a function-builder list and a name.


tangletools · 2026-06-30T00:33:20Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — 4f43a6bb

Full multi-shot audit completed 3/3 planned shots over 28 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-30T00:33:20Z · immutable trace

@tangletools

Copy link
Copy Markdown
Contributor

Premise check withheld merge — 4f43a6bb

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +20.8pp
  • PR body excerpt: feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 3 numeric claim(s) (+20.8pp, 8.3pp, 0pp) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #415

A cross-harness × cross-model run of the WebCode benchmark (exa.ai/blog/webcode —
web-search-grounded coding tasks, post-Aug-2025 APIs, hidden-test graded) expressed
as the matrix axes + one runProfileMatrix call. The harness is the sandbox backend
(claude-code/codex/opencode/gemini), the model is the LLM, each cell runs in its own
sandbox with in-box web search and is scored on the hidden tests — no LLM judge.
The runtime does the per-cell sandbox + grading; this file is just the axes + the call.
tangletools
tangletools previously approved these changes Jun 30, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 360bd452

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:33:57Z

Benchmarked against Vercel AI SDK + Cloudflare Agents. Two gaps closed:
- README now showcases the 27 examples grouped by task (Cloudflare's pattern) —
  we had a strong examples/ dir but linked it as an afterthought.
- docs/README.md leads with a task-first 'Start here' path instead of opening on
  the insider architecture TOC.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 0831dfb6

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:37:11Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 2 (2 low)
Heuristic 0.1s
Duplication 0.0s
Interrogation 6.4s (2 bridge agents)
Total 6.5s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T003903Z

@tangletools

Copy link
Copy Markdown
Contributor

⚠️ Review Incomplete — 0831dfb6

At least one required reviewer lane failed closed. No approval or request-changes review was published. This is a reviewer run failure, not a PR quality score.

Trigger a fresh review on the current PR head.

tangletools · 2026-06-30T00:46:53Z

tangletools
tangletools previously approved these changes Jun 30, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 0831dfb6

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:47:10Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 2 (2 low)
Heuristic 0.1s
Duplication 0.0s
Interrogation 7.4s (2 bridge agents)
Total 7.5s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T004741Z

- webcode-matrix + ablation-suite now ship a runnable main/runner + a README each,
  so the only step left is to run them (creds + harnesses).
- 6 deep docs (architecture, learning-flywheel, eval-substrate, execution-model,
  concepts, glossary) gain a plain-language 'In plain terms' intro for cold readers.
- move the stale in-flight simplification-plan tracker under docs/research/.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — 6e17daac

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:54:50Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚪ Value Audit — audit-incomplete

Verdict audit-incomplete
Concerns 2 (2 low)
Heuristic 0.1s
Duplication 0.0s
Interrogation 15.5s (2 bridge agents)
Total 15.6s

💰 Value — error

value agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🎯 Usefulness — error

usefulness agent produced no parseable value-audit JSON.

  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/ablation-suite/ablation.ts

  • console.log(

🟡 Cruft: magic number added examples/ablation-suite/ablation.ts

  •    pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
    

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260630T005714Z

@drewstone drewstone merged commit 7e2f6d2 into main Jun 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants