fix: independent temporal-leakage verifier for search_web by ethancjackson · Pull Request #161 · VectorInstitute/agentic-forecasting

ethancjackson · 2026-07-01T10:57:42Z

Summary

search_web (the shared news/context-retrieval tool in agent_factory.py, used by all 5 forecasting domains) was leaking post-cutoff information into historical backtests despite an explicit natural-language cutoff instruction — confirmed directly in a Langfuse trace (a5d33fccd8c99929c3784f774dda34e6, cutoff 2026-06-01, result citing WTI/Brent prices "by mid-June 2026"). The existing ContextRetrievalConfig docstring already called this out as "soft (LLM-judgment-based)... not a hard guarantee." A prior attempt at a hard date-filtered search API (Tavily) also leaked, which matches published research: date-restricted search is insufficient because underlying page content/metadata gets updated after original publish dates.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Added an independent temporal-leakage verifier to _build_search_tool in aieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py: after every grounded search_web call with a cutoff date, a separate LLM call (verifier_model, defaults to ADVANCED_MODEL — distinct from search_model — so it doesn't share the same knowledge-attribution blind spot) extracts discrete factual claims and judges each against the cutoff, strips violations, and reports a confidence score.
Retry loop: on a non-clean/low-confidence verdict, search_web retries (same query, with the previously flagged claims injected as explicit negative feedback) up to verifier_max_attempts (default 3) before returning an explicit [SEARCH_VERIFICATION_FAILED] ... sentinel — never silently returning unverified content.
New ContextRetrievalConfig fields: verifier_model, verifier_max_attempts (3), verifier_confidence_threshold (8/10 — kept tunable rather than hardcoded since LLM self-reported confidence isn't well-calibrated).
implementations/energy_oil_forecasting/analyst_agent/agent.py: threaded the three verifier knobs through all four news-enabled builders; hardened the search sub-agent's own instruction (reason step-by-step about claim recency from content, not source metadata; never supplement from background knowledge) as cheap defense-in-depth; root analyst instruction now explicitly handles the [SEARCH_VERIFICATION_FAILED] sentinel by proceeding on price history alone instead of guessing.
8 new unit tests in test_agent_factory.py covering immediate-accept, retry-then-accept (with feedback-injection verified), exhaustion→sentinel, both skip conditions (no cutoff / enforce_cutoff=False), default-model selection, configurable threshold, and verifier parse-failure handling.
Verifier calls get their own Langfuse span for free (same litellm.acompletion callback path as the existing search call) — no extra instrumentation needed.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

uv run pytest aieng-forecasting/tests/aieng/forecasting/methods/agentic -q — 89 passed (including the 8 new leakage-verifier tests), no regressions.
uv run ruff check clean on all three changed source files.
Verified against live production traces post-fix: confirmed the verifier now runs as its own span on every search_web call for both gemini-3.1-flash-lite-preview and gemini-3.5-flash news-agent variants (previously 1 child LLM call per search_web, now 2: search + verify). Directly observed it flagging and stripping borderline post-cutoff claims (e.g. "Brent trading in a range of approximately $65–$70 per barrel in September 2025" against a 2025-09-01 cutoff) and accepting the filtered result with confidence: 9. Sampled 200 search_web calls from the last 3 days across both models: 0 occurrences of the [SEARCH_VERIFICATION_FAILED] sentinel, no latency evidence of retries in that sample (so the retry-exhaustion path is implemented and tested but not yet observed in production).

Related Issues

N/A — follow-up to #160, which merged before this leakage issue was caught.

Deployment Notes

NB04 (04_systematic_backtest_eval.ipynb) is committed here mid-flight: it's actively re-running with the guard in place after clearing only the two news-agent predictors' cached results (all other predictors, e.g. LightGBM, stayed cached and untouched). Only the completed gemini-3.1-flash-lite-preview 2025-backtest prediction file is included in this commit; the gemini-3.5-flash backtest and both models' 2026-eval prediction files will follow in a follow-up commit once that run finishes — hence this is opened as a draft PR.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Backtest traces showed the WTI news agent's search_web tool leaking post-cutoff information despite the existing natural-language cutoff instruction — the same-model self-restraint it relied on isn't a hard guarantee, and a real trace confirmed it (Langfuse trace a5d33fccd8c99929c3784f774dda34e6: cutoff 2026-06-01, result stated WTI/Brent prices "by mid-June 2026"). search_web now runs an independent verifier call (different model by default) after every grounded search when a cutoff applies: it extracts discrete claims, judges each on content rather than source timestamps, strips violations, and retries (same query + flagged claims as negative feedback) up to 3 times before returning an explicit [SEARCH_VERIFICATION_FAILED] sentinel rather than ever silently returning unverified content. This lives in the shared agent_factory.py used by all 5 forecasting domains, so the fix applies everywhere search_web is used. Verified against live traces post-fix: the verifier is correctly wired as its own span, and is actively catching and stripping borderline post-cutoff claims (confirmed on both gemini-3.1-flash-lite-preview and gemini-3.5-flash news-agent variants). NB04 and its committed prediction-cache artifacts are re-running with the guard in place; the 2025 backtest result for the lite-preview news agent is included here, with the remaining variants to follow in a follow-up commit once that run completes. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

The independent verifier itself (agent_factory.py) already protects every domain unconditionally via shared defaults, but two complementary pieces were energy_oil-analyst-agent-only: - The recency-reasoning / no-background-knowledge prompt hardening on each domain's context-retrieval sub-agent instruction. - Root-agent handling of the [SEARCH_VERIFICATION_FAILED] sentinel (for starter-agent templates, this lives in the research-playbook skill doc that's loaded before search_web is called). Propagated the same short additions to boc_rate_decisions (analyst_agent + starter_agent), sp500_forecasting/starter_agent, food_price_forecasting/ starter_agent, and energy_oil_forecasting's own starter_agent/adaptive_agent (which each carry their own duplicate copy of the instruction text). getting_started/concierge_agent has context retrieval disabled, so it's unaffected. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

Local ad-hoc ruff-format checks (repo's newer uv-managed ruff 0.15.19) missed a one-line collapse that the CI-pinned pre-commit hook version catches. Verified by running the actual pre-commit hook suite locally before pushing. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>

ethancjackson and others added 3 commits July 1, 2026 06:55

ethancjackson marked this pull request as ready for review July 1, 2026 11:28

ethancjackson merged commit e516a8d into main Jul 1, 2026
2 checks passed

ethancjackson deleted the search-leakage-verifier branch July 1, 2026 11:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: independent temporal-leakage verifier for search_web#161

fix: independent temporal-leakage verifier for search_web#161
ethancjackson merged 3 commits into
mainfrom
search-leakage-verifier

ethancjackson commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ethancjackson commented Jul 1, 2026

Summary

Type of Change

Changes Made

Testing

Related Issues

Deployment Notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant