feat(interop): balance interop launch - composition-drift tutorial, contract tests, isolated interop CI job#613
Conversation
…ontract tests, isolated CI job Tutorial 26 (docs/tutorials/26_composition_drift_calibration.ipynb): the failure-mode companion to balance's balance_diff_diff_brfss tutorial. BRFSS-style smoking-ban DGP with exact population parallel trends (true ATT -3.0pp); treatment-correlated non-response drift biases design-weight CS to -4.13pp with clean pre-trends; per-wave national raking fails (-4.43pp); state-year raking with balance recovers -3.18pp. Native seam vs balance.interop.diff_diff adapter parity, estimator sweep, cross- package diagnostics. tests/test_balance_interop_contract.py pins the diff-diff surface balance[did] consumes (no balance import). tests/test_t26_*.py drift- guards the quoted numbers (importorskip balance; runs in the new isolated interop-notebooks job in notebooks.yml - py3.12, balance>=0.21, main notebooks job env unchanged, weekly cron = integration smoke). Docs handoff closes survey-roadmap Phase 8g: prep.rst calibration section, llms.txt/llms-full/llms-practitioner pointers, references (Deville & Sarndal 1992; Sarig et al. 2023), README line, doc-deps. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
|
Overall Assessment ✅ Looks good. No unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality No findings. Performance No findings. The expensive balance raking path is isolated to the new interop notebook job and a skipped-unless-balance drift test. Maintainability No findings. The public-surface contract tests are intentionally narrow and document why additive Tech Debt No findings. The calibration gap is now documented in Security No findings in changed files. The new workflow runs under Documentation/Tests No blocking findings. The added drift guard materially reduces stale-prose risk for the tutorial’s quoted numbers. Local test execution was not possible here because |
…view round 1) The DGP's mean-zero PSU-year shocks (deliberate - they create the design effects the diagnostics section reads) mean realized finite-population parallel trends hold in expectation, not exactly. Reword the covers list, honesty-contract bullet, DGP docstring (notebook + drift-test copy kept in sync), and the scenario-A readout; systematic bias claims now scoped to composition explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good. No unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality No findings. Performance No findings. The expensive balance raking path is isolated to the new Maintainability No findings. The new contract tests are intentionally narrow and pin the interop surface without importing Tech Debt No findings. The calibration handoff is now documented in Security No findings in the changed diff. The new workflow uses read-only contents permissions and does not introduce secrets. Documentation/Tests
|
…mmary churn (review round 2) Reword the two remaining 'exact population parallel trends' claims (CHANGELOG, docs/tutorials/README) plus the drift-test module docstring to the in-expectation framing; restore 22 docs/api/_autosummary/*.rst stubs to origin/main (regenerated locally by a sphinx run and swept in by git add -A - build artifacts, not part of this PR). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
…cts, not in this PR's scope) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good. No unmitigated P0/P1 findings. Executive Summary
Methodology No findings. Severity: N/A Code Quality No findings. Severity: N/A Performance No findings. Severity: N/A Maintainability No findings. Severity: N/A Tech Debt No findings. Severity: N/A Security No findings. Severity: N/A Documentation/Tests
|
Summary
docs/tutorials/26_composition_drift_calibration.ipynb) - the failure-mode companion to Meta balance'sbalance_diff_diff_brfsstutorial (theirbalance.interop.diff_diffadapter shipped in balance 0.21, facebookresearch/balance PR Lift Gate 6: cluster-aware CR2 Bell-McCaffrey contrast DOF for MultiPeriodDiD avg_att #465;pip install "balance[did]"pinsdiff-diff>=3.3,<4). A BRFSS-style smoking-ban DGP with population parallel trends true by construction (planted ATT -3.0pp; realized -2.98pp under a rarely-binding probability floor, computed from clipped potential outcomes and quoted as the truth line). Treatment-correlated non-response drift biases the design-weight Callaway-Sant'Anna ATT to -4.13pp with clean pre-trends and a fake growing dynamic profile; a per-wave national rake fails (-4.43pp - margins satisfied in aggregate while arm-level composition is untouched); per-state-year raking (BRFSS's own granularity, population-count totals) recovers -3.18pp. Shows the seam both ways (nativeSurveyDesign+aggregate_surveyvsbd.to_panel_for_did/bd.fit_did, with an exact-parity assert), a 3-estimator x 2-weighting sweep,survey_metadata+as_balance_diagnosticdiagnostics, and the Sant'Anna & Xu (2023) caveat with adrift_start_offset=-2reader exercise.tests/test_balance_interop_contract.py, imports no balance): pins the diff-diff surfacebalance[did]consumes -aggregate_surveyforwarded-params superset +(panel, SurveyDesign)return schema, theSurveyDesign15-field dataclass contract (+ TSL/replicate constructions), 17 estimator names + short aliases (CS/DiD/BJS/HAD) withsurvey_design=accepted by every promisedfit(), the_balance_adjustmentsetattr provenance side-channel, the CS pweight-only guard, and theSurveyMetadata.design_effect/effective_n/sum_weightsattribute names read byas_balance_diagnostic.tests/test_t26_composition_drift_calibration_drift.py): re-derives every tutorial-quoted number (planted/realized truth, all four ATTs, CI-excludes-truth, composition shares, native/adapter parity);pytest.importorskip("balance", minversion="0.21"); DGP duplicated per the t23 inline-DGP convention plus a t25-style notebook sync guard pinning constants AND load-bearing logic lines.interop-notebooksCI job (.github/workflows/notebooks.yml): python 3.12, installsbalance>=0.21, sameready-for-cijob gate and SHA-pinned actions; executes tutorial 26 via nbmake then runs the drift guard (its only CI home). The workflow's existing weekly cron now doubles as a cross-package integration smoke of diff-diff HEAD against latest PyPI balance. The main notebooks job env is byte-for-byte unchanged (26 added to its--ignorelist with rationale); the drift-test path was added to the workflow triggers;EXPECTED_JOBSregistry updated intests/test_openai_review.py. balance never becomes a package dependency or extra.docs/api/prep.rst; calibration bullets/pointers inllms.txt,llms-full.txt,llms-practitioner.txt; Deville & Sarndal (1992) + Sarig, Galili & Eilat (2023) indocs/references.rst; toctree +docs/tutorials/README.mdentries;docs/doc-deps.yamlregistration; one README Survey Support line; CHANGELOG.Methodology references (required if estimator / math changes)
diff_diff/source untouched exceptguides/*.txt). Tutorial methodology: Callaway & Sant'Anna (2021) staggered DiD on survey-aggregated panels; Deville & Sarndal (1992) calibration; Deming & Stephan (1940) raking; Sant'Anna & Xu (2023) compositional changes.Validation
tests/test_balance_interop_contract.py(29 tests, runs in the main suite),tests/test_t26_composition_drift_calibration_drift.py(8 tests; skips cleanly without balance, runs in the interop-notebooks job),tests/test_openai_review.py(job registry +1; full file green with-m '', 250 passed)sphinx -Wbuilds clean with the new toctree entrySecurity / privacy
🤖 Generated with Claude Code
https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa