Lgbm experiment by ethancjackson · Pull Request #160 · VectorInstitute/agentic-forecasting

ethancjackson · 2026-07-01T09:36:25Z

Summary s

Expands the WTI systematic-backtest notebook (NB04) into a full head-to-head
competition — LightGBM ± a leak-safe covariate panel, and LLM-process/agent
methods run on both project models — then extends the 2026 protected-eval
window so it resolves through the most recent available data and runs the
complete 2025 backtest + 2026 eval end to end.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Promoted the SP500 leak-safe covariate builders into aieng/forecasting/data/features.py as a single source of truth; refactored sp500_forecasting/data.py to consume them (behaviour-preserving). Added build_wti_multivariate_service, an all-yfinance covariate panel for WTI (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango proxy, VIX) with graceful-skip on unavailable tickers.
Reworked NB04 into a predictor registry with per-line enabled toggles; added LightGBM and LightGBM + cov, and ran LLMP-Sampled, LLMP-Sampled + cov, LLMP-Grid, and the News Agent across both gemini-3.1-flash-lite-preview and gemini-3.5-flash.
Fixed AgentPredictor.predictor_id to fold in the model name when the proxy wraps it in a LiteLlm (previously the two agent models collided in the cache), and fixed score_backtest_results to score realised outcomes against the latest available data instead of spec.end.
Replaced the hard-coded "Core Takeaways" prose in NB04 with a narrative generated live from eval results (analysis.eval_narrative_md), plus new post-eval diagnostics: CRPS heatmap, leaderboard interval chart, per-origin forecast-vs-reality chart, and agent-rationale rendering with Langfuse trace links.
Fixed .env loading to use override=False everywhere so injected workspace credentials win over the repo-root .env; trimmed .env.example to optional personal keys now that bootcamp secrets live in the shell.
Extended energy_oil_eval.yaml's protected-eval window from 8 weekly origins (through 2026-03-23) to 18 (through 2026-06-01) — the latest origin whose longest (21-business-day) horizon still fully resolves against available data as of today (2026-06-30). 06_protected_eval.ipynb shares this spec and was updated to match.
Ran the full cold 2025 backtest (51 origins) + extended 2026 eval (18 origins) across all 12 registered predictors and committed the resulting notebook outputs and prediction caches.
Fixed a ruff A004 (import shadows builtin) lint error on the IPython.display import in NB04, consistent with the existing # noqa: A004 convention used elsewhere in the repo's notebooks.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

uv run pytest aieng-forecasting/tests -q — 373 passed, 7 skipped.
uv run pytest implementations/tests/energy_oil_forecasting -q — 11 passed.
make lint (ruff format + ruff check + mypy on aieng + nbqa-ruff) — all hooks passing.
Refreshed the WTI + covariate yfinance caches (Brent, nat gas, gasoline, gold, USD index, USL/USO curve proxy, VIX) through 2026-06-30 — several were stale at 2026-05-04 and would have silently starved the +cov predictors for the new eval origins.
Executed 04_systematic_backtest_eval.ipynb end to end in full mode (SMOKE_TEST = False) — all 12 predictors completed both the 2025 backtest and the extended 2026 eval; verified the last eval origin's 21-business-day horizon resolves exactly on 2026-06-30 (today), confirming the window is maximally extended without leaving any origin unresolved.

Screenshots/Recordings

N/A — see the new diagnostic visualizations (CRPS heatmap, leaderboard interval chart, per-origin forecast charts) rendered inline in 04_systematic_backtest_eval.ipynb Sections 7–11.

Related Issues

N/A

Deployment Notes

New prediction-cache artifacts under implementations/energy_oil_forecasting/data/predictions/energy_oil_backtest/ and .../energy_oil_eval/ (intentionally tracked, not gitignored) are still untracked locally as of this PR draft — git add them before pushing so reviewers/CI can load the completed run without re-executing the expensive LLM/agent calls.
No migration steps. The data/yfinance/ price cache is gitignored and regenerated locally via uv run python scripts/fetch_wti.py; no action needed in CI.

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

Expand the WTI systematic-backtest notebook to put a fuller slate of methods forward, with easy per-predictor on/off toggles. - Promote SP500's leak-safe covariate builders into the package (aieng/forecasting/data/features.py) as a single source of truth; refactor sp500_forecasting/data.py to consume them (behaviour-preserving). - Add build_wti_multivariate_service: an all-yfinance covariate panel for WTI (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango proxy, VIX), graceful-skip on unavailable tickers. - Rework NB04 into a predictor registry with enabled flags; add LightGBM and LightGBM+cov, and run LLMP-Sampled, LLMP-Sampled+cov, LLMP-Grid, and the News Agent across both gemini-3.1-flash-lite-preview and gemini-3.5-flash. - Fix AgentPredictor.predictor_id to fold in the model name when the proxy wraps it in a LiteLlm (previously the two agent models collided in the cache). - Fix score_backtest_results to score realised outcomes against the latest available data instead of spec.end, so MAE/coverage populate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Set override=False on every load_dotenv call so injected workspace credentials win over repo-root .env, and limit .env.example to optional personal keys (FRED) now that bootcamp secrets live in the shell. Co-authored-by: Cursor <cursoragent@cursor.com>

…zations Replace the hard-coded "Core Takeaways" prose (which presupposed the experiment outcome) with a narrative generated live from the eval results, and add interpretable post-eval visualizations so the leaderboard is legible rather than a single opaque number. analysis.py: - predictions_to_frame: tidy one-row-per (predictor x origin x horizon) frame (point, 80% interval, realised price, CRPS, coverage) - per_horizon_crps / leaderboard_with_uncertainty (mean +/- SE) - extract_agent_rationales, build_price_frame, predictor_family - eval_narrative_md: takeaways computed from the run (winner vs noise floor, decisive horizon, best family, calibration, small-sample caveat) viz.py: - make_crps_heatmap (predictor x horizon — where the ranking is decided) - make_leaderboard_interval_chart (real edge vs noise) - make_eval_forecast_chart (median + 80% band vs reality, per origin) - render_rationales_html (agent reasoning + Langfuse trace links) 04 notebook: new Sections 7-11 (diagnostics, behaviour, agent reasoning, computed takeaways, de-presupposed conceptual closer). All cells recompute from eval_results so switching SMOKE_TEST off and rerunning the full suite just works. Also snapshots incidental working-tree state: smoke-run prediction caches, regenerated curriculum baselines, and other WIP notebooks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

energy_oil_eval.yaml was capped at 2026-03-23 (8 weekly origins). With today at 2026-06-30, extend end to 2026-06-01 (18 origins) -- the latest origin whose longest horizon (21 business days) still fully resolves against available data (resolves exactly on 2026-06-30). Both NB04 and NB06 share this spec. - energy_oil_eval.yaml: end 2026-03-23 -> 2026-06-01; update header comment and description (8 -> 18 origins). - energy_oil_eval_smoke.yaml: fix stale origin-count comment (8 -> 18). - 04_systematic_backtest_eval.ipynb / 06_protected_eval.ipynb: update markdown describing the eval window (18 origins, Feb-Jun 2026). Also refreshed the WTI + covariate yfinance caches (Brent, nat gas, gasoline, gold, USD index, USL/USO curve proxy, VIX) through 2026-06-30 -- several were stale at 2026-05-04 and would have silently starved the +cov predictors for the new origins. Cache lives under data/yfinance/ (gitignored), not part of this commit. WIP checkpoint: the full 2025 backtest + extended 2026 eval (12 predictors x 69 origins, no prior prediction cache) is running now, so the notebook cell outputs below reflect an in-progress run. A follow-up commit will land the completed run's outputs. Co-authored-by: Cursor <cursoragent@cursor.com>

ethancjackson and others added 7 commits June 30, 2026 11:05

Merge remote-tracking branch 'origin/main' into lgbm-experiment

1ff9648

commit energy notebook 04 after full backtest run

8471521

commit energy notebook 04 after full backtest run plus lint

8402d0c

ethancjackson merged commit d068737 into main Jul 1, 2026
2 checks passed

ethancjackson deleted the lgbm-experiment branch July 1, 2026 10:51

ethancjackson mentioned this pull request Jul 1, 2026

fix: independent temporal-leakage verifier for search_web #161

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Lgbm experiment#160

Lgbm experiment#160
ethancjackson merged 7 commits into
mainfrom
lgbm-experiment

ethancjackson commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ethancjackson commented Jul 1, 2026

Summary s

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant