Skip to content

fix(bench): isolate Terminal-Bench venv#414

Merged
drewstone merged 1 commit into
mainfrom
fix/terminal-bench-isolated-venv
Jun 29, 2026
Merged

fix(bench): isolate Terminal-Bench venv#414
drewstone merged 1 commit into
mainfrom
fix/terminal-bench-isolated-venv

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Summary

  • moves Terminal-Bench imports, dataset loading, and tb run execution to an isolated TERMINAL_BENCH_VENV / .venv-terminal-bench
  • mirrors the existing COMMIT0_VENV pattern so AppWorld can keep its Pydantic-1 shared env while Terminal-Bench/LiteLLM use Pydantic 2
  • adds an offline fail-loud adapter test for the missing isolated venv path

Why

Agent-lab R139 reached the Terminal-Bench adapter but failed preflight before scoring: shared bench/.venv had Pydantic 1.10.26 for AppWorld, while Terminal-Bench/LiteLLM require Pydantic >=2.10. This unblocks the public shell-task readiness lane without any lab-side special case.

Refs #413

Verification

  • pnpm install --frozen-lockfile
  • pnpm --dir bench install --frozen-lockfile
  • pnpm --dir bench exec tsx --test src/benchmarks/terminal-bench.test.mts
  • pnpm --dir bench exec tsx --test src/benchmarks/terminal-bench.test.mts src/benchmarks/commit0.test.mts src/benchmarks/appworld.test.mts
  • pnpm --dir bench exec tsc --noEmit --pretty false
  • pnpm typecheck
  • pnpm test
  • pnpm build

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved drewstone PR — e647c27f

This PR was opened by the trusted drewstone account.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: drewstone_author · 2026-06-29T23:02:23Z

@drewstone drewstone merged commit 6ec6939 into main Jun 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants