Skip to content

Lgbm experiment#160

Merged
ethancjackson merged 7 commits into
mainfrom
lgbm-experiment
Jul 1, 2026
Merged

Lgbm experiment#160
ethancjackson merged 7 commits into
mainfrom
lgbm-experiment

Conversation

@ethancjackson

Copy link
Copy Markdown
Collaborator

Summary s

Expands the WTI systematic-backtest notebook (NB04) into a full head-to-head
competition — LightGBM ± a leak-safe covariate panel, and LLM-process/agent
methods run on both project models — then extends the 2026 protected-eval
window so it resolves through the most recent available data and runs the
complete 2025 backtest + 2026 eval end to end.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Promoted the SP500 leak-safe covariate builders into aieng/forecasting/data/features.py as a single source of truth; refactored sp500_forecasting/data.py to consume them (behaviour-preserving). Added build_wti_multivariate_service, an all-yfinance covariate panel for WTI (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango proxy, VIX) with graceful-skip on unavailable tickers.
  • Reworked NB04 into a predictor registry with per-line enabled toggles; added LightGBM and LightGBM + cov, and ran LLMP-Sampled, LLMP-Sampled + cov, LLMP-Grid, and the News Agent across both gemini-3.1-flash-lite-preview and gemini-3.5-flash.
  • Fixed AgentPredictor.predictor_id to fold in the model name when the proxy wraps it in a LiteLlm (previously the two agent models collided in the cache), and fixed score_backtest_results to score realised outcomes against the latest available data instead of spec.end.
  • Replaced the hard-coded "Core Takeaways" prose in NB04 with a narrative generated live from eval results (analysis.eval_narrative_md), plus new post-eval diagnostics: CRPS heatmap, leaderboard interval chart, per-origin forecast-vs-reality chart, and agent-rationale rendering with Langfuse trace links.
  • Fixed .env loading to use override=False everywhere so injected workspace credentials win over the repo-root .env; trimmed .env.example to optional personal keys now that bootcamp secrets live in the shell.
  • Extended energy_oil_eval.yaml's protected-eval window from 8 weekly origins (through 2026-03-23) to 18 (through 2026-06-01) — the latest origin whose longest (21-business-day) horizon still fully resolves against available data as of today (2026-06-30). 06_protected_eval.ipynb shares this spec and was updated to match.
  • Ran the full cold 2025 backtest (51 origins) + extended 2026 eval (18 origins) across all 12 registered predictors and committed the resulting notebook outputs and prediction caches.
  • Fixed a ruff A004 (import shadows builtin) lint error on the IPython.display import in NB04, consistent with the existing # noqa: A004 convention used elsewhere in the repo's notebooks.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:

  • uv run pytest aieng-forecasting/tests -q — 373 passed, 7 skipped.
  • uv run pytest implementations/tests/energy_oil_forecasting -q — 11 passed.
  • make lint (ruff format + ruff check + mypy on aieng + nbqa-ruff) — all hooks passing.
  • Refreshed the WTI + covariate yfinance caches (Brent, nat gas, gasoline, gold, USD index, USL/USO curve proxy, VIX) through 2026-06-30 — several were stale at 2026-05-04 and would have silently starved the +cov predictors for the new eval origins.
  • Executed 04_systematic_backtest_eval.ipynb end to end in full mode (SMOKE_TEST = False) — all 12 predictors completed both the 2025 backtest and the extended 2026 eval; verified the last eval origin's 21-business-day horizon resolves exactly on 2026-06-30 (today), confirming the window is maximally extended without leaving any origin unresolved.

Screenshots/Recordings

N/A — see the new diagnostic visualizations (CRPS heatmap, leaderboard interval chart, per-origin forecast charts) rendered inline in 04_systematic_backtest_eval.ipynb Sections 7–11.

Related Issues

N/A

Deployment Notes

  • New prediction-cache artifacts under implementations/energy_oil_forecasting/data/predictions/energy_oil_backtest/ and .../energy_oil_eval/ (intentionally tracked, not gitignored) are still untracked locally as of this PR draft — git add them before pushing so reviewers/CI can load the completed run without re-executing the expensive LLM/agent calls.
  • No migration steps. The data/yfinance/ price cache is gitignored and regenerated locally via uv run python scripts/fetch_wti.py; no action needed in CI.

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

ethancjackson and others added 7 commits June 30, 2026 11:05
Expand the WTI systematic-backtest notebook to put a fuller slate of methods
forward, with easy per-predictor on/off toggles.

- Promote SP500's leak-safe covariate builders into the package
  (aieng/forecasting/data/features.py) as a single source of truth; refactor
  sp500_forecasting/data.py to consume them (behaviour-preserving).
- Add build_wti_multivariate_service: an all-yfinance covariate panel for WTI
  (Brent, natural gas, gasoline, gold, USD index, USL/USO futures-curve contango
  proxy, VIX), graceful-skip on unavailable tickers.
- Rework NB04 into a predictor registry with enabled flags; add LightGBM and
  LightGBM+cov, and run LLMP-Sampled, LLMP-Sampled+cov, LLMP-Grid, and the News
  Agent across both gemini-3.1-flash-lite-preview and gemini-3.5-flash.
- Fix AgentPredictor.predictor_id to fold in the model name when the proxy wraps
  it in a LiteLlm (previously the two agent models collided in the cache).
- Fix score_backtest_results to score realised outcomes against the latest
  available data instead of spec.end, so MAE/coverage populate.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Set override=False on every load_dotenv call so injected workspace
credentials win over repo-root .env, and limit .env.example to optional
personal keys (FRED) now that bootcamp secrets live in the shell.

Co-authored-by: Cursor <cursoragent@cursor.com>
…zations

Replace the hard-coded "Core Takeaways" prose (which presupposed the experiment
outcome) with a narrative generated live from the eval results, and add
interpretable post-eval visualizations so the leaderboard is legible rather than
a single opaque number.

analysis.py:
- predictions_to_frame: tidy one-row-per (predictor x origin x horizon) frame
  (point, 80% interval, realised price, CRPS, coverage)
- per_horizon_crps / leaderboard_with_uncertainty (mean +/- SE)
- extract_agent_rationales, build_price_frame, predictor_family
- eval_narrative_md: takeaways computed from the run (winner vs noise floor,
  decisive horizon, best family, calibration, small-sample caveat)

viz.py:
- make_crps_heatmap (predictor x horizon — where the ranking is decided)
- make_leaderboard_interval_chart (real edge vs noise)
- make_eval_forecast_chart (median + 80% band vs reality, per origin)
- render_rationales_html (agent reasoning + Langfuse trace links)

04 notebook: new Sections 7-11 (diagnostics, behaviour, agent reasoning,
computed takeaways, de-presupposed conceptual closer). All cells recompute from
eval_results so switching SMOKE_TEST off and rerunning the full suite just works.

Also snapshots incidental working-tree state: smoke-run prediction caches,
regenerated curriculum baselines, and other WIP notebooks.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
energy_oil_eval.yaml was capped at 2026-03-23 (8 weekly origins). With
today at 2026-06-30, extend end to 2026-06-01 (18 origins) -- the latest
origin whose longest horizon (21 business days) still fully resolves
against available data (resolves exactly on 2026-06-30). Both NB04 and
NB06 share this spec.

- energy_oil_eval.yaml: end 2026-03-23 -> 2026-06-01; update header
  comment and description (8 -> 18 origins).
- energy_oil_eval_smoke.yaml: fix stale origin-count comment (8 -> 18).
- 04_systematic_backtest_eval.ipynb / 06_protected_eval.ipynb: update
  markdown describing the eval window (18 origins, Feb-Jun 2026).

Also refreshed the WTI + covariate yfinance caches (Brent, nat gas,
gasoline, gold, USD index, USL/USO curve proxy, VIX) through 2026-06-30
-- several were stale at 2026-05-04 and would have silently starved the
+cov predictors for the new origins. Cache lives under data/yfinance/
(gitignored), not part of this commit.

WIP checkpoint: the full 2025 backtest + extended 2026 eval (12
predictors x 69 origins, no prior prediction cache) is running now, so
the notebook cell outputs below reflect an in-progress run. A follow-up
commit will land the completed run's outputs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ethancjackson ethancjackson merged commit d068737 into main Jul 1, 2026
2 checks passed
@ethancjackson ethancjackson deleted the lgbm-experiment branch July 1, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant