fix: independent temporal-leakage verifier for search_web#161
Merged
Conversation
Backtest traces showed the WTI news agent's search_web tool leaking post-cutoff information despite the existing natural-language cutoff instruction — the same-model self-restraint it relied on isn't a hard guarantee, and a real trace confirmed it (Langfuse trace a5d33fccd8c99929c3784f774dda34e6: cutoff 2026-06-01, result stated WTI/Brent prices "by mid-June 2026"). search_web now runs an independent verifier call (different model by default) after every grounded search when a cutoff applies: it extracts discrete claims, judges each on content rather than source timestamps, strips violations, and retries (same query + flagged claims as negative feedback) up to 3 times before returning an explicit [SEARCH_VERIFICATION_FAILED] sentinel rather than ever silently returning unverified content. This lives in the shared agent_factory.py used by all 5 forecasting domains, so the fix applies everywhere search_web is used. Verified against live traces post-fix: the verifier is correctly wired as its own span, and is actively catching and stripping borderline post-cutoff claims (confirmed on both gemini-3.1-flash-lite-preview and gemini-3.5-flash news-agent variants). NB04 and its committed prediction-cache artifacts are re-running with the guard in place; the 2025 backtest result for the lite-preview news agent is included here, with the remaining variants to follow in a follow-up commit once that run completes. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The independent verifier itself (agent_factory.py) already protects every domain unconditionally via shared defaults, but two complementary pieces were energy_oil-analyst-agent-only: - The recency-reasoning / no-background-knowledge prompt hardening on each domain's context-retrieval sub-agent instruction. - Root-agent handling of the [SEARCH_VERIFICATION_FAILED] sentinel (for starter-agent templates, this lives in the research-playbook skill doc that's loaded before search_web is called). Propagated the same short additions to boc_rate_decisions (analyst_agent + starter_agent), sp500_forecasting/starter_agent, food_price_forecasting/ starter_agent, and energy_oil_forecasting's own starter_agent/adaptive_agent (which each carry their own duplicate copy of the instruction text). getting_started/concierge_agent has context retrieval disabled, so it's unaffected. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Local ad-hoc ruff-format checks (repo's newer uv-managed ruff 0.15.19) missed a one-line collapse that the CI-pinned pre-commit hook version catches. Verified by running the actual pre-commit hook suite locally before pushing. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
search_web(the shared news/context-retrieval tool inagent_factory.py, used by all 5 forecasting domains) was leaking post-cutoff information into historical backtests despite an explicit natural-language cutoff instruction — confirmed directly in a Langfuse trace (a5d33fccd8c99929c3784f774dda34e6, cutoff2026-06-01, result citing WTI/Brent prices "by mid-June 2026"). The existingContextRetrievalConfigdocstring already called this out as "soft (LLM-judgment-based)... not a hard guarantee." A prior attempt at a hard date-filtered search API (Tavily) also leaked, which matches published research: date-restricted search is insufficient because underlying page content/metadata gets updated after original publish dates.Clickup Ticket(s): N/A
Type of Change
Changes Made
_build_search_toolinaieng-forecasting/aieng/forecasting/methods/agentic/agent_factory.py: after every groundedsearch_webcall with a cutoff date, a separate LLM call (verifier_model, defaults toADVANCED_MODEL— distinct fromsearch_model— so it doesn't share the same knowledge-attribution blind spot) extracts discrete factual claims and judges each against the cutoff, strips violations, and reports a confidence score.search_webretries (same query, with the previously flagged claims injected as explicit negative feedback) up toverifier_max_attempts(default 3) before returning an explicit[SEARCH_VERIFICATION_FAILED] ...sentinel — never silently returning unverified content.ContextRetrievalConfigfields:verifier_model,verifier_max_attempts(3),verifier_confidence_threshold(8/10 — kept tunable rather than hardcoded since LLM self-reported confidence isn't well-calibrated).implementations/energy_oil_forecasting/analyst_agent/agent.py: threaded the three verifier knobs through all four news-enabled builders; hardened the search sub-agent's own instruction (reason step-by-step about claim recency from content, not source metadata; never supplement from background knowledge) as cheap defense-in-depth; root analyst instruction now explicitly handles the[SEARCH_VERIFICATION_FAILED]sentinel by proceeding on price history alone instead of guessing.test_agent_factory.pycovering immediate-accept, retry-then-accept (with feedback-injection verified), exhaustion→sentinel, both skip conditions (no cutoff /enforce_cutoff=False), default-model selection, configurable threshold, and verifier parse-failure handling.litellm.acompletioncallback path as the existing search call) — no extra instrumentation needed.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
uv run pytest aieng-forecasting/tests/aieng/forecasting/methods/agentic -q— 89 passed (including the 8 new leakage-verifier tests), no regressions.uv run ruff checkclean on all three changed source files.search_webcall for bothgemini-3.1-flash-lite-previewandgemini-3.5-flashnews-agent variants (previously 1 child LLM call persearch_web, now 2: search + verify). Directly observed it flagging and stripping borderline post-cutoff claims (e.g. "Brent trading in a range of approximately $65–$70 per barrel in September 2025" against a2025-09-01cutoff) and accepting the filtered result withconfidence: 9. Sampled 200search_webcalls from the last 3 days across both models: 0 occurrences of the[SEARCH_VERIFICATION_FAILED]sentinel, no latency evidence of retries in that sample (so the retry-exhaustion path is implemented and tested but not yet observed in production).Related Issues
N/A — follow-up to #160, which merged before this leakage issue was caught.
Deployment Notes
NB04 (
04_systematic_backtest_eval.ipynb) is committed here mid-flight: it's actively re-running with the guard in place after clearing only the two news-agent predictors' cached results (all other predictors, e.g. LightGBM, stayed cached and untouched). Only the completedgemini-3.1-flash-lite-preview2025-backtest prediction file is included in this commit; thegemini-3.5-flashbacktest and both models' 2026-eval prediction files will follow in a follow-up commit once that run finishes — hence this is opened as a draft PR.Checklist