feat(examples): coordination-vs-compute eval suite + supervisor self-improvement analyst#415
Conversation
Adds the three-topology comparison to the ablation board: - run_tests feedback + iterate-until-pass prompt in hard-coding-env - ralph strategy: one persistent file, fresh context per round (re-read + continue) - persistent-surface: artifact carries across a supervisor's workers - serial workers (maxLiveWorkers:1) so a shared file is not raced - long-coding-env: 54-function oscillation eval (9 shared seed-derived conventions) that forces context-overflow, so a single continuous context degrades while fresh respawns / a driver that accumulates do not
… significant) long-coding-env-lite: 28 tiny functions, 196 tests, 9 shared seed-derived conventions (10 dual-coupled) — difficulty from cases not file size, so the slow supervisor arm is tractable for a powered run. Confirms the three-way (continuous vs ralph vs supervisor) at n=24: supervisor 46% vs continuous 25% vs ralph 17%; supervisor - continuous = +20.8pp, 95% CI [8,38], Pareto-dominant (won every seed continuous won plus 5 more, lost none). Ralph loses, so the win is the driver's accumulated memory across fresh workers, not respawn. ~3-4x cost (a capability win).
Every arm now reports the complete column set — resolve (k/n), tokens in/out, LLM calls (completions, incl. the supervisor's conserved-pool iterations), refine-shots, total $, $/task, latency/task — plus optional per-tool call counts via a countingSurface wrapper (ArmResult.toolCalls). selfImprovingSupervisor now returns completions (spentTotal.iterations) so the supervisor arm is no longer a 0 in the calls column. printAutopsy widened to show all of it.
…-call counting verkit-env: reconstruct a renamed python-semver (1 class + 16 fns, 257 real pytest tests, ~276 impl lines) from its tests — the real-code analog of the synthetic oscillation eval, for the drown-vs-accumulate regime. Calibrated stub 0/257 -> real 257/257. counting-surface tallies per-tool calls; wired into runAblation so every arm reports tool counts. Contamination: module/class renamed to defeat verbatim recall; spec-covered surface still partly known (documented in fixtures/verkit/PROVENANCE.md).
Adds scoreMean (mean pass-fraction [0,1]) and perTaskScore (per-task gradient) to ArmResult + the autopsy — the metric that matters when a task is hard enough that binary all-tests-pass resolve is mostly 0 (e.g. a real 257-test library). printAutopsy gains a score column.
…arate thing Collapse the 'self-improving' concept onto supervise()'s existing analyst knob, and make the lens actually rich: - failuresAnalyst (replaces score-only progressAnalyst): surfaces the worker's actual still-FAILING tests so the driver targets the persistently-hard cases across workers — real within-run self-improvement, not just 'didn't resolve'. - surface-worker captures the worker's last run_tests -> SurfaceWorkerOut.failing via a seam-local capture wrapper (no re-open / refcount hazard). - driver prompt targets + accumulates the failing tests across spawns. - docstring: ONE entrypoint (supervise); two timescales (within-run = this analyst, across-run = improve()/selfImprove); the analyst is the only knob that defines what self-improvement means. Verified: failing tests reach the driver finding ($0 check); flash smoke clean.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 9f1b5b8f
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:03:20Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 2 (2 low) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 8.4s (2 bridge agents) |
| Total | 8.5s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 5 non-blocking findings — 9f1b5b8f
Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-30T00:09:26Z · immutable trace
Premise check withheld merge —
|
- surface-worker captureFailures: guard the regex group (m[1] is possibly undefined under strict) — fixes the CI examples typecheck failure. - add --color=no to the 6 pytest invocations missing it (hard/long/long-lite) so captured run_tests output carries no ANSI codes (review finding).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 3715e00c
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:16:21Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 6 (1 medium-concern, 2 low, 3 weak-concern) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 222.0s (2 bridge agents) |
| Total | 222.1s |
💰 Value — sound-with-nits
A contamination-proof ablation harness that empirically validates supervisor > single-agent (+20.8pp, n=24, CI excludes 0) by composing existing primitives (supervise/runAgentic/defineStrategy); ships with three small in-grain surface wrappers and a real-library fixture, plus three weak nits (partia
- What it does: Adds three things, all under examples/ablation-suite/: (1) an ablation runner (ablation.ts) that compares topologies (single=refine, fanout=sample, fanout-refine=sampleThenRefine, ralph=fresh-respawn) PLUS driver-brain arms (driverSteer=selfImprovingSupervisor via supervise(); optimize='gepa'=GEPA-tune the driver prompt on a disjoint train slice, freeze, then drive) at matched budget, with paired-
- Goals it achieves: Stated in the code: (a) prove the supervisor topology earns its cost on weak models — findings show +20.8pp over single at matched budget, CI [8,38], n=24, Pareto-dominant; (b) refute the compute-confound (continuous at 2x budget still 0% — it can't convert extra compute, so the win isn't brute force); (c) localize the win to the driver's accumulated memory, not fresh respawn (ralph loses -8.3pp);
- Assessment: Sound and in the grain of the codebase. The change composes existing primitives — supervise() from src/runtime/supervise/, defineStrategy from src/runtime/strategy.ts:789, runAgentic — rather than reinventing them. self-improving-supervisor.ts:2 explicitly frames itself as 'NOT a separate primitive: it is supervise() configured.' The new wrappers (persistentSurface, countingSurface, ralph-strategy
- Better / existing approach: Searched src/runtime/ for existing equivalents. Found: (1) src/runtime/run-benchmark.ts:132 runBenchmark already does paired-bootstrap + Pareto + perStrategy/perTask tables for Strategy arms — but it cannot express the driverSteer (supervise()-based) or optimize:'gepa' arms, which are the actual point of this ablation, so the unified custom runner is justified; the topology-only arms (single/fanou
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound-with-nits
A well-wired ablation suite over the existing AgenticSurface/Strategy/supervise() seams; one wrapper file (persistentSurface) is genuinely dead and its absence means a stated supervisor mechanism is not actually firing.
- Integration: Reachable and wired into the substrate's documented seams.
ablation.tsis the entrypoint (main() at line 358, runnable viapnpm tsx); it wraps the env ONCE incountingSurface(ablation.ts:155) and threadscounterinto bothrunAgentic(line 247) andselfImprovingSupervisor(line 213).ralphis registered intopologyStrategy(ablation.ts:74) and reachable via thetopology:'ralph'k - Fit with existing patterns: Excellent fit. The wrappers (
countingSurface,persistentSurface) compose transparently overAgenticSurface(strategy.ts:76-83), exactly mirroring the test fixture's ownshotCountingSurface(tests/loops/strategy-evolution.test.ts:25).ralphis authored withdefineStrategy— the sanctioned compact-strategy path (strategy.ts:789).selfImprovingSupervisoris correctly NOT a new primitive - Real-world viability: Robust on the happy path and documented error paths. Each env ships a $0
calibrate()self-check proving stub→0 and real→257 (verkit-env.ts:293-342; hard-coding-env.ts:603-625). The ablation runner catches per-task throws and counts them unresolved (ablation.ts:268-277) so one network failure cannot lose the whole arm. pytest invocations use--continue-on-collection-errorsand a fixed denominat - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
💰 Value Audit
🟡 long-coding-env.ts and -lite.ts duplicate ~1500 lines of generator machinery [duplication] ``
long-coding-env.ts (1383 lines) and long-coding-env-lite.ts (1189 lines) define pick/pyLit/tline/Conv/conventions IDENTICALLY (verified: both define function conventions(seed) at line ~108-112, both define pyLit at line 70-75, both copy pick verbatim). The lite version's header says it's 'SAME surface EXACTLY' with 'half the functions.' A shared env-generator module (conventions + pick + pyLit + tline + the function-table parameterization) parameterized by a function-count knob would cut ~1200 l
🟡 captureFailures regex-couples the seam to three env files' output format [duplication] ``
surface-worker.ts:71-79 parses 'FAILING: name1, name2' out of the run_tests output via regex (/FAILING:\s*(.+)/i). That string is emitted independently by runTestsReport in hard-coding-env.ts:340, long-coding-env.ts, long-coding-env-lite.ts, and verkit-env.ts:141. If any env changes its report wording, captureFailures silently returns [] and the failuresAnalyst loses its signal (no error, just degraded steering). Two cleaner options: (a) surface.call could return a structured {report, failing?:
🟡 Topology-only arm loop partially reinvents src/runtime/run-benchmark.ts [duplication] ``
ablation.ts:246-267 inlines a per-task Strategy-driver loop (runAgentic per task, accumulate ti/to/usd/ms/shots/perTask) that is functionally the same as runBenchmark's cell loop at src/runtime/run-benchmark.ts:138-187. runBenchmark already provides pairedBootstrap, paretoFrontier, perStrategy summaries, and error-resilient per-task tracking. The ablation's driverSteer/optimize arms cannot use runBenchmark (supervise() topology is outside its Strategy seam), which justifies the custom runner — b
🎯 Usefulness Audit
🟠 persistentSurface is dead surface — defined, exported, zero call sites [integration] ``
grep across the whole repo for
persistentSurface(returns only its own definition (persistent-surface.ts:17) and a docstring composition example (counting-surface.ts:9). No file imports it. ablation.ts:155 wraps the env incountingSurface(opts.environment)only — never inpersistentSurface. Each supervisor worker therefore opens its own workspace (surface-worker.ts:117 → refine → depthStrategy → surface.open at strategy.ts:587), so the 'build-on-progress across fresh-context workers' mecha
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 3 non-blocking findings — 3715e00c
Full multi-shot audit completed 2/2 planned shots over 27 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-30T00:22:46Z · immutable trace
Premise check withheld merge —
|
…ines) Lead with what it is and the three things you do with it (chat turn / supervise / improve), each with a tiny example, then four plain-language bullets on how it works and links to the deep docs. Drops the dense internal vocabulary from the front page (it lives in docs/).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 4f43a6bb
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:24:45Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 7 (2 low, 5 weak-concern) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 399.7s (2 bridge agents) |
| Total | 399.8s |
💰 Value — sound-with-nits
A rigorous, cost-aware ablation of three agent topologies with statistical receipts and a contamination-proof eval substrate — all in examples/, built on existing substrate primitives; the bundled README rewrite is slightly off-title but genuinely improves the front door.
- What it does: Adds a cost-aware ablation runner that compares three agent topologies at matched budget — continuous (
refine), ralph (fresh-context-per-round, state in the file), and supervisor (a driver brain steering serial workers with a failures analyst) — over seed-derived, contamination-proof oscillation coding tasks and a real-library reconstruction (python-semver, renamed toverkit, 257 tests). Every - Goals it achieves: (1) Prove, with paired-bootstrap CI, that the supervisor's coordination beats a single continuous worker at matched budget on weak/mid models — and refute the compute-confound (continuous at 2x budget still 0%). (2) Localize the win: ralph loses, so fresh-respawn alone is not the mechanism; the driver's accumulated memory is. (3) Show the model-relative boundary (at frontier, coordination adds 0pp
- Assessment: Coherent and in the grain of the codebase. It composes existing substrate primitives exclusively —
supervise(),refine,defineStrategy, theAgenticSurfaceseam — and adds nosrc/surface (verified:git diff --stat -- 'src/*'is empty). The new pieces are minimal and seam-aligned:ralphis a 43-linedefineStrategybody in the same shape asadaptiveRefine/sampleThenRefine; `count - Better / existing approach: Looked for an existing equivalent before flagging none. Searched src/ for
persistentSurface|countingSurface|surface-zoo|withSurface|observingSurface|tracingSurface— no hits. The closest existing pattern iscaptureFailuresinsideexamples/ablation-suite/surface-worker.ts:54, which interceptssurface.callto remember the lastrun_testsoutput for the analyst; it is intentionally kept loca - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound-with-nits
A calibrated, end-to-end-runnable ablation suite correctly wired into the substrate's real extension points (supervise, defineStrategy, AgenticSurface, selfImprove); ships ahead-of-first-caller utilities but nothing dead.
- Integration: Tight. Every new module reaches a real substrate export:
ralphusesdefineStrategy(src/runtime/strategy.ts:789),selfImprovingSupervisorconfiguressupervise()(src/runtime/supervise/supervise.ts:116) with a BYO-executor seam (surface-worker.ts:175-179 → resolved first by ExecutorRegistry.resolve at src/runtime/supervise/runtime.ts:1235),countingSurface/persistentSurfacecorrectly im - Fit with existing patterns: Good, fits the codebase grain.
selfImprovingSupervisoris explicitly NOT a new primitive — it'ssupervise()configured (self-improving-supervisor.ts:1-17), matching the repo's 'one entrypoint, authored content over code paths' rule.ralphis adefineStrategyexactly likerefine/sample. The wrapper pattern incountingSurface/persistentSurfacemirrors the existing surface-zoo style. - Real-world viability: Mostly holds up. Error path handled (one task throwing is counted unresolved, arm continues — ablation.ts:268-277).
countingSurface.calltallies every call including errored ones (honest count). Per-armcounter.resetToolCounts()works because the outer arm loop and inner task loop are both sequentialfor...ofwithawait(ablation.ts:157, 205) — no race today. Two real-world caveats: (1) th - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
💰 Value Audit
🟡 README rewrite is orthogonal to the PR title [proportion] ``
The final commit
4f43a6b docs: rewrite README — simple, plain-language front door (371 → ~85 lines)is a real improvement (matches the AGENTS.md 'speak plainly' guidance, accurate entry points) but is unrelated to the ablation suite the PR title advertises. One-line FYI to split if desired; not a blocker — the rewrite is correct and the old README was bloated.
🟡 Three surface.call-intercepting wrappers now coexist [maintenance] ``
captureFailures(examples/ablation-suite/surface-worker.ts:54, local),countingSurface(counting-surface.ts:28), andpersistentSurface(persistent-surface.ts:17) all wrapAgenticSurfaceto add one orthogonal concern (last-output / per-name-counts / shared-workspace). Not duplication — each has a distinct purpose and the comments explicitly design for composition (countingSurface(persistentSurface(base))). Flag only as a pattern to watch: if a fourth appears, a generic `observingSurface
🎯 Usefulness Audit
🟡 runAblation is purely sequential; runBenchmark has bounded concurrency [ergonomics] ``
ablation.ts:157 and :205 are
for...of awaitloops with no concurrency, while src/runtime/run-benchmark.ts:111-127 ships a boundedpool()(default concurrency 3). For a 5-arm × n=24 × multi-worker-per-task run (each task spawns a driver + several LLM-tool-loop workers), wall-time is the bottleneck. Not blocking (n=24 was run), but a small available improvement: liftrunBenchmark's pool pattern for the task loop, leaving the arm loop serial (arms share the counter).
🟡 Shared countingSurface instance would race under future arm-level parallelism [robustness] ``
ablation.ts:155 creates one
counterreused across arms withresetToolCounts()between them (ablation.ts:204). Safe only because arms run serially. If a future change parallelizes arms (e.g. to fix the wall-time above), per-tool counts silently bleed across arms. Note for the reviewer: if concurrency is added, give each arm its owncountingSurface(opts.environment)wrapper.
🟡 persistentSurface exported but no caller in this PR [integration] ``
persistent-surface.ts is imported nowhere (grep finds only a doc reference in counting-surface.ts:9). It's documented as composable with countingSurface and is the structural lever the ralph-vs-supervisor contrast relies on (fresh respawn over an accumulating file), so it's ahead-of-first-caller rather than dead. Fine to ship; flag so a human knows the next eval run is its first consumer.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — 4f43a6bb
Full multi-shot audit completed 3/3 planned shots over 28 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-30T00:33:20Z · immutable trace
Premise check withheld merge —
|
A cross-harness × cross-model run of the WebCode benchmark (exa.ai/blog/webcode — web-search-grounded coding tasks, post-Aug-2025 APIs, hidden-test graded) expressed as the matrix axes + one runProfileMatrix call. The harness is the sandbox backend (claude-code/codex/opencode/gemini), the model is the LLM, each cell runs in its own sandbox with in-box web search and is scored on the hidden tests — no LLM judge. The runtime does the per-cell sandbox + grading; this file is just the axes + the call.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 360bd452
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:33:57Z
Benchmarked against Vercel AI SDK + Cloudflare Agents. Two gaps closed: - README now showcases the 27 examples grouped by task (Cloudflare's pattern) — we had a strong examples/ dir but linked it as an afterthought. - docs/README.md leads with a task-first 'Start here' path instead of opening on the insider architecture TOC.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 0831dfb6
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:37:11Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 2 (2 low) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 6.4s (2 bridge agents) |
| Total | 6.5s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
|
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 0831dfb6
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:47:10Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 2 (2 low) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 7.4s (2 bridge agents) |
| Total | 7.5s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
- webcode-matrix + ablation-suite now ship a runnable main/runner + a README each, so the only step left is to run them (creds + harnesses). - 6 deep docs (architecture, learning-flywheel, eval-substrate, execution-model, concepts, glossary) gain a plain-language 'In plain terms' intro for cold readers. - move the stale in-flight simplification-plan tracker under docs/research/.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved drewstone PR — 6e17daac
This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: drewstone_author · 2026-06-30T00:54:50Z
tangletools
left a comment
There was a problem hiding this comment.
⚪ Value Audit — audit-incomplete
| Verdict | audit-incomplete |
| Concerns | 2 (2 low) |
| Heuristic | 0.1s |
| Duplication | 0.0s |
| Interrogation | 15.5s (2 bridge agents) |
| Total | 15.6s |
💰 Value — error
value agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🎯 Usefulness — error
usefulness agent produced no parseable value-audit JSON.
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge error: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content; opencode/deepseek/deepseek-v4-pro: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/ablation-suite/ablation.ts
- console.log(
🟡 Cruft: magic number added examples/ablation-suite/ablation.ts
pad(`${(r.latencyMs / 1000 / Math.max(1, r.n)).toFixed(0)}s`, 9) +
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
What
A cost-aware ablation suite under
examples/ablation-suite/comparing three agent topologies at matched budget on contamination-proof coding tasks, plus the supervisor's real-time self-improvement analyst. Allexamples/— nosrc/changes.Topologies (the lever under test):
refine) — one worker, carries its conversation across shots.ralph-strategy.ts) — one persistent file, FRESH context each round (re-read + continue).supervise()viaselfImprovingSupervisor) — a driver brain spawning serial workers, with afailuresAnalystfeeding the driver each worker's still-failing tests.Evals: seed-derived, contamination-proof oscillation tasks (
long-coding-env*.ts— shared conventions → fix-one-break-another) + a real-library reconstruction (verkit-env.ts, renamed python-semver, 257 real tests).Observability (
counting-surface.ts): every arm reports resolve, pass-fraction score, tokens in/out, LLM calls, refine-shots, $, latency, and per-tool call counts.Findings (receipts)
Architecture
selfImprovingSupervisoris NOT a separate primitive — it issupervise()configured. Self-improvement = the authoredfailuresAnalyst(surfaces the worker's failing tests so the driver targets persistently-hard cases), fed tosupervise()'s existinganalyzeOnSettleknob. Two timescales, one entrypoint each: within-run (the analyst) and across-run (improve()/selfImprovewrapping it —gepa-driver-prompt.ts).tsc + biome clean; failing-tests-reach-driver verified by a $0 check.