Lgbm experiment#160
Merged
Merged
Conversation
Expand the WTI systematic-backtest notebook to put a fuller slate of methods forward, with easy per-predictor on/off toggles. - Promote SP500's leak-safe covariate builders into the package (aieng/forecasting/data/features.py) as a single source of truth; refactor sp500_forecasting/data.py to consume them (behaviour-preserving). - Add build_wti_multivariate_service: an all-yfinance covariate panel for WTI (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango proxy, VIX), graceful-skip on unavailable tickers. - Rework NB04 into a predictor registry with enabled flags; add LightGBM and LightGBM+cov, and run LLMP-Sampled, LLMP-Sampled+cov, LLMP-Grid, and the News Agent across both gemini-3.1-flash-lite-preview and gemini-3.5-flash. - Fix AgentPredictor.predictor_id to fold in the model name when the proxy wraps it in a LiteLlm (previously the two agent models collided in the cache). - Fix score_backtest_results to score realised outcomes against the latest available data instead of spec.end, so MAE/coverage populate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Set override=False on every load_dotenv call so injected workspace credentials win over repo-root .env, and limit .env.example to optional personal keys (FRED) now that bootcamp secrets live in the shell. Co-authored-by: Cursor <cursoragent@cursor.com>
…zations Replace the hard-coded "Core Takeaways" prose (which presupposed the experiment outcome) with a narrative generated live from the eval results, and add interpretable post-eval visualizations so the leaderboard is legible rather than a single opaque number. analysis.py: - predictions_to_frame: tidy one-row-per (predictor x origin x horizon) frame (point, 80% interval, realised price, CRPS, coverage) - per_horizon_crps / leaderboard_with_uncertainty (mean +/- SE) - extract_agent_rationales, build_price_frame, predictor_family - eval_narrative_md: takeaways computed from the run (winner vs noise floor, decisive horizon, best family, calibration, small-sample caveat) viz.py: - make_crps_heatmap (predictor x horizon — where the ranking is decided) - make_leaderboard_interval_chart (real edge vs noise) - make_eval_forecast_chart (median + 80% band vs reality, per origin) - render_rationales_html (agent reasoning + Langfuse trace links) 04 notebook: new Sections 7-11 (diagnostics, behaviour, agent reasoning, computed takeaways, de-presupposed conceptual closer). All cells recompute from eval_results so switching SMOKE_TEST off and rerunning the full suite just works. Also snapshots incidental working-tree state: smoke-run prediction caches, regenerated curriculum baselines, and other WIP notebooks. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
energy_oil_eval.yaml was capped at 2026-03-23 (8 weekly origins). With today at 2026-06-30, extend end to 2026-06-01 (18 origins) -- the latest origin whose longest horizon (21 business days) still fully resolves against available data (resolves exactly on 2026-06-30). Both NB04 and NB06 share this spec. - energy_oil_eval.yaml: end 2026-03-23 -> 2026-06-01; update header comment and description (8 -> 18 origins). - energy_oil_eval_smoke.yaml: fix stale origin-count comment (8 -> 18). - 04_systematic_backtest_eval.ipynb / 06_protected_eval.ipynb: update markdown describing the eval window (18 origins, Feb-Jun 2026). Also refreshed the WTI + covariate yfinance caches (Brent, nat gas, gasoline, gold, USD index, USL/USO curve proxy, VIX) through 2026-06-30 -- several were stale at 2026-05-04 and would have silently starved the +cov predictors for the new origins. Cache lives under data/yfinance/ (gitignored), not part of this commit. WIP checkpoint: the full 2025 backtest + extended 2026 eval (12 predictors x 69 origins, no prior prediction cache) is running now, so the notebook cell outputs below reflect an in-progress run. A follow-up commit will land the completed run's outputs. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary s
Expands the WTI systematic-backtest notebook (NB04) into a full head-to-head
competition — LightGBM ± a leak-safe covariate panel, and LLM-process/agent
methods run on both project models — then extends the 2026 protected-eval
window so it resolves through the most recent available data and runs the
complete 2025 backtest + 2026 eval end to end.
Clickup Ticket(s): N/A
Type of Change
Changes Made
aieng/forecasting/data/features.pyas a single source of truth; refactoredsp500_forecasting/data.pyto consume them (behaviour-preserving). Addedbuild_wti_multivariate_service, an all-yfinance covariate panel for WTI (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango proxy, VIX) with graceful-skip on unavailable tickers.enabledtoggles; addedLightGBMandLightGBM + cov, and ranLLMP-Sampled,LLMP-Sampled + cov,LLMP-Grid, and the News Agent across bothgemini-3.1-flash-lite-previewandgemini-3.5-flash.AgentPredictor.predictor_idto fold in the model name when the proxy wraps it in aLiteLlm(previously the two agent models collided in the cache), and fixedscore_backtest_resultsto score realised outcomes against the latest available data instead ofspec.end.analysis.eval_narrative_md), plus new post-eval diagnostics: CRPS heatmap, leaderboard interval chart, per-origin forecast-vs-reality chart, and agent-rationale rendering with Langfuse trace links..envloading to useoverride=Falseeverywhere so injected workspace credentials win over the repo-root.env; trimmed.env.exampleto optional personal keys now that bootcamp secrets live in the shell.energy_oil_eval.yaml's protected-eval window from 8 weekly origins (through 2026-03-23) to 18 (through 2026-06-01) — the latest origin whose longest (21-business-day) horizon still fully resolves against available data as of today (2026-06-30).06_protected_eval.ipynbshares this spec and was updated to match.A004(import shadows builtin) lint error on theIPython.displayimport in NB04, consistent with the existing# noqa: A004convention used elsewhere in the repo's notebooks.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
uv run pytest aieng-forecasting/tests -q— 373 passed, 7 skipped.uv run pytest implementations/tests/energy_oil_forecasting -q— 11 passed.make lint(ruff format + ruff check + mypy onaieng+ nbqa-ruff) — all hooks passing.+covpredictors for the new eval origins.04_systematic_backtest_eval.ipynbend to end in full mode (SMOKE_TEST = False) — all 12 predictors completed both the 2025 backtest and the extended 2026 eval; verified the last eval origin's 21-business-day horizon resolves exactly on 2026-06-30 (today), confirming the window is maximally extended without leaving any origin unresolved.Screenshots/Recordings
N/A — see the new diagnostic visualizations (CRPS heatmap, leaderboard interval chart, per-origin forecast charts) rendered inline in
04_systematic_backtest_eval.ipynbSections 7–11.Related Issues
N/A
Deployment Notes
implementations/energy_oil_forecasting/data/predictions/energy_oil_backtest/and.../energy_oil_eval/(intentionally tracked, not gitignored) are still untracked locally as of this PR draft —git addthem before pushing so reviewers/CI can load the completed run without re-executing the expensive LLM/agent calls.data/yfinance/price cache is gitignored and regenerated locally viauv run python scripts/fetch_wti.py; no action needed in CI.Checklist