Skip to content

feat(interop): balance interop launch - composition-drift tutorial, contract tests, isolated interop CI job#613

Merged
igerber merged 4 commits into
mainfrom
balance-integration
Jul 4, 2026
Merged

feat(interop): balance interop launch - composition-drift tutorial, contract tests, isolated interop CI job#613
igerber merged 4 commits into
mainfrom
balance-integration

Conversation

@igerber

@igerber igerber commented Jul 4, 2026

Copy link
Copy Markdown
Owner

Summary

  • Tutorial 26 (docs/tutorials/26_composition_drift_calibration.ipynb) - the failure-mode companion to Meta balance's balance_diff_diff_brfss tutorial (their balance.interop.diff_diff adapter shipped in balance 0.21, facebookresearch/balance PR Lift Gate 6: cluster-aware CR2 Bell-McCaffrey contrast DOF for MultiPeriodDiD avg_att #465; pip install "balance[did]" pins diff-diff>=3.3,<4). A BRFSS-style smoking-ban DGP with population parallel trends true by construction (planted ATT -3.0pp; realized -2.98pp under a rarely-binding probability floor, computed from clipped potential outcomes and quoted as the truth line). Treatment-correlated non-response drift biases the design-weight Callaway-Sant'Anna ATT to -4.13pp with clean pre-trends and a fake growing dynamic profile; a per-wave national rake fails (-4.43pp - margins satisfied in aggregate while arm-level composition is untouched); per-state-year raking (BRFSS's own granularity, population-count totals) recovers -3.18pp. Shows the seam both ways (native SurveyDesign + aggregate_survey vs bd.to_panel_for_did/bd.fit_did, with an exact-parity assert), a 3-estimator x 2-weighting sweep, survey_metadata + as_balance_diagnostic diagnostics, and the Sant'Anna & Xu (2023) caveat with a drift_start_offset=-2 reader exercise.
  • Interop contract tests (tests/test_balance_interop_contract.py, imports no balance): pins the diff-diff surface balance[did] consumes - aggregate_survey forwarded-params superset + (panel, SurveyDesign) return schema, the SurveyDesign 15-field dataclass contract (+ TSL/replicate constructions), 17 estimator names + short aliases (CS/DiD/BJS/HAD) with survey_design= accepted by every promised fit(), the _balance_adjustment setattr provenance side-channel, the CS pweight-only guard, and the SurveyMetadata.design_effect/effective_n/sum_weights attribute names read by as_balance_diagnostic.
  • Drift guard (tests/test_t26_composition_drift_calibration_drift.py): re-derives every tutorial-quoted number (planted/realized truth, all four ATTs, CI-excludes-truth, composition shares, native/adapter parity); pytest.importorskip("balance", minversion="0.21"); DGP duplicated per the t23 inline-DGP convention plus a t25-style notebook sync guard pinning constants AND load-bearing logic lines.
  • Isolated interop-notebooks CI job (.github/workflows/notebooks.yml): python 3.12, installs balance>=0.21, same ready-for-ci job gate and SHA-pinned actions; executes tutorial 26 via nbmake then runs the drift guard (its only CI home). The workflow's existing weekly cron now doubles as a cross-package integration smoke of diff-diff HEAD against latest PyPI balance. The main notebooks job env is byte-for-byte unchanged (26 added to its --ignore list with rationale); the drift-test path was added to the workflow triggers; EXPECTED_JOBS registry updated in tests/test_openai_review.py. balance never becomes a package dependency or extra.
  • Docs handoff closing survey-roadmap Phase 8g ("post-stratification / calibration: not yet documented"): new "Weight calibration with balance" section in docs/api/prep.rst; calibration bullets/pointers in llms.txt, llms-full.txt, llms-practitioner.txt; Deville & Sarndal (1992) + Sarig, Galili & Eilat (2023) in docs/references.rst; toctree + docs/tutorials/README.md entries; docs/doc-deps.yaml registration; one README Survey Support line; CHANGELOG.

Methodology references (required if estimator / math changes)

  • Method name(s): N/A - no estimator/math code changes (diff_diff/ source untouched except guides/*.txt). Tutorial methodology: Callaway & Sant'Anna (2021) staggered DiD on survey-aggregated panels; Deville & Sarndal (1992) calibration; Deming & Stephan (1940) raking; Sant'Anna & Xu (2023) compositional changes.
  • Paper / source link(s): https://arxiv.org/abs/2307.06024 (balance); https://arxiv.org/abs/2304.13925 (Sant'Anna & Xu); https://doi.org/10.1080/01621459.1992.10475217 (Deville & Sarndal)
  • Any intentional deviations from the source (and why): None

Validation

  • Tests added/updated: tests/test_balance_interop_contract.py (29 tests, runs in the main suite), tests/test_t26_composition_drift_calibration_drift.py (8 tests; skips cleanly without balance, runs in the interop-notebooks job), tests/test_openai_review.py (job registry +1; full file green with -m '', 250 passed)
  • Backtest / simulation / notebook evidence (if applicable): tutorial executes end-to-end in ~13s via nbmake (python 3.12 venv, balance 0.21.0, numpy 2.5 / pandas 3.0; groupby code kept pandas-version-portable); every prose-quoted number verified against executed outputs; native-vs-adapter ATT parity exact; sphinx -W builds clean with the new toctree entry

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa

…ontract tests, isolated CI job

Tutorial 26 (docs/tutorials/26_composition_drift_calibration.ipynb): the
failure-mode companion to balance's balance_diff_diff_brfss tutorial.
BRFSS-style smoking-ban DGP with exact population parallel trends (true
ATT -3.0pp); treatment-correlated non-response drift biases design-weight
CS to -4.13pp with clean pre-trends; per-wave national raking fails
(-4.43pp); state-year raking with balance recovers -3.18pp. Native seam
vs balance.interop.diff_diff adapter parity, estimator sweep, cross-
package diagnostics.

tests/test_balance_interop_contract.py pins the diff-diff surface
balance[did] consumes (no balance import). tests/test_t26_*.py drift-
guards the quoted numbers (importorskip balance; runs in the new
isolated interop-notebooks job in notebooks.yml - py3.12, balance>=0.21,
main notebooks job env unchanged, weekly cron = integration smoke).

Docs handoff closes survey-roadmap Phase 8g: prep.rst calibration
section, llms.txt/llms-full/llms-practitioner pointers, references
(Deville & Sarndal 1992; Sarig et al. 2023), README line, doc-deps.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

Overall Assessment

✅ Looks good. No unmitigated P0/P1 findings.

Executive Summary

  • No estimator implementation, variance, weighting, or default behavior changes were introduced in diff_diff/ source.
  • The new balance interop contract tests align with the existing registry surface for survey support and result aliases.
  • The interop notebook CI isolation is reasonable: balance stays out of package requirements and the main notebook job.
  • I found one P2 documentation/methodology accuracy issue in the tutorial prose: the DGP claims exact finite-population parallel trends while adding time-varying state/PSU shocks.
  • I could not run tests locally because the review environment lacks numpy; review is static against the diff and local source.

Methodology

  • Severity: P2
  • Location: docs/tutorials/26_composition_drift_calibration.ipynb:L27, docs/tutorials/26_composition_drift_calibration.ipynb:L35
  • Impact: The tutorial repeatedly says population parallel trends are “exact” and that all estimator bias comes from sample composition. But the DGP adds psu_shock[state, psu_idx, year_idx], which is time-varying at the state/year level. That makes exact realized finite-population parallel trends no longer guaranteed; they hold in expectation up to random shock noise.
  • Concrete fix: Either remove/make time-invariant the PSU shock in the truth-generating outcome, or soften the prose/docstring to “parallel trends hold in expectation; seed-specific random state-year shocks are small noise” and avoid saying all bias is composition-only.

Code Quality

No findings.

Performance

No findings. The expensive balance raking path is isolated to the new interop notebook job and a skipped-unless-balance drift test.

Maintainability

No findings. The public-surface contract tests are intentionally narrow and document why additive SurveyDesign field changes require conscious sync with balance.

Tech Debt

No findings. The calibration gap is now documented in docs/survey-roadmap.md and supported by tutorial/API docs.

Security

No findings in changed files. The new workflow runs under pull_request with read-only contents permissions and does not expose secrets.

Documentation/Tests

No blocking findings. The added drift guard materially reduces stale-prose risk for the tutorial’s quoted numbers. Local test execution was not possible here because numpy is not installed in the review environment.

…view round 1)

The DGP's mean-zero PSU-year shocks (deliberate - they create the design
effects the diagnostics section reads) mean realized finite-population
parallel trends hold in expectation, not exactly. Reword the covers list,
honesty-contract bullet, DGP docstring (notebook + drift-test copy kept
in sync), and the scenario-A readout; systematic bias claims now scoped
to composition explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9dfed2c9da184e955266dab611e19a313a427cd2


Overall Assessment

✅ Looks good. No unmitigated P0/P1 findings.

Executive Summary

  • No estimator implementation, variance, weighting, identification, or default-behavior changes were introduced in diff_diff/ source.
  • The prior notebook methodology wording issue is largely fixed: Tutorial 26 now correctly says parallel trends hold in expectation because PSU-year shocks add seed-specific noise.
  • One residual P2 documentation accuracy issue remains in changed release/tutorial index prose that still says “exact population parallel trends.”
  • The balance interop contract tests align with the current SurveyDesign, aggregate_survey, estimator alias, and survey_metadata surfaces.
  • I could not run tests locally because this review environment lacks numpy; review is static.

Methodology

  • Severity: P2
  • Location: CHANGELOG.md:L28, docs/tutorials/README.md:L140, supporting DGP code at tests/test_t26_composition_drift_calibration_drift.py:L103-L125
  • Impact: These changed docs still claim “exact population parallel trends,” but the DGP includes time-varying PSU-year shocks. The notebook/test docstrings now correctly frame this as parallel trends in expectation, with seed-specific noise but no systematic drift. The remaining “exact” wording overstates finite-population identification in release/tutorial summary text.
  • Concrete fix: Replace “exact population parallel trends” with “population parallel trends hold in expectation” or “no systematic arm-specific trends; PSU-year shocks add seed-specific noise” in both changed docs.

Code Quality

No findings.

Performance

No findings. The expensive balance raking path is isolated to the new interop-notebooks job and a balance-gated drift test.

Maintainability

No findings. The new contract tests are intentionally narrow and pin the interop surface without importing balance.

Tech Debt

No findings. The calibration handoff is now documented in docs/survey-roadmap.md, docs/api/prep.rst, and Tutorial 26.

Security

No findings in the changed diff. The new workflow uses read-only contents permissions and does not introduce secrets.

Documentation/Tests

  • Severity: P3
  • Location: docs/api/_autosummary/diff_diff.SyntheticControl.rst:L10-L18, docs/api/_autosummary/diff_diff.TwoStageDiD.rst:L24-L33
  • Impact: The regenerated autosummary pages drop some public estimator attributes that still exist as instance attributes or constructor parameters, e.g. SyntheticControl configuration fields and TwoStageDiD.vcov_type. This is a minor API-doc completeness regression, not a behavior issue.
  • Concrete fix: Restore the omitted autosummary attribute rows or document intentionally that estimator configuration is covered only through constructor/get_params docs.

igerber and others added 2 commits July 4, 2026 10:18
…mmary churn (review round 2)

Reword the two remaining 'exact population parallel trends' claims
(CHANGELOG, docs/tutorials/README) plus the drift-test module docstring
to the in-expectation framing; restore 22 docs/api/_autosummary/*.rst
stubs to origin/main (regenerated locally by a sphinx run and swept in
by git add -A - build artifacts, not part of this PR).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
…cts, not in this PR's scope)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01M3tqQVEHBRsQdNmmukBuNa
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a3f732db6c05aa6286a8efc5f51ee86bd9ed858b


Overall Assessment

✅ Looks good. No unmitigated P0/P1 findings.

Executive Summary

  • No estimator, weighting, variance/SE, or default-behavior code paths in diff_diff/ were changed.
  • The prior methodology wording issue is resolved: changed prose now says parallel trends hold in expectation, not “exact population parallel trends.”
  • The tutorial’s survey/calibration framing aligns with REGISTRY.md: SurveyDesign consumes pre-calibrated weights, and aggregate_survey returns pweight second-stage survey designs by default.
  • The balance interop contract tests pin the relevant public surface without importing balance.
  • One P3 test-maintainability gap remains in the notebook sync guard.
  • I could not run tests locally because this environment lacks numpy; review is static.

Methodology

No findings.

Severity: N/A
Impact: The affected methods are documentation/test coverage around aggregate_survey, SurveyDesign, and Callaway-Sant’Anna survey fits. No estimator math changed, and the docs match the Methodology Registry’s survey aggregation and upstream-calibration contract.
Concrete fix: N/A

Code Quality

No findings.

Severity: N/A
Impact: New tests are narrow and contract-oriented; no production code quality regression found.
Concrete fix: N/A

Performance

No findings.

Severity: N/A
Impact: The expensive balance/raking path is isolated to the new interop-notebooks job and skipped elsewhere via pytest.importorskip.
Concrete fix: N/A

Maintainability

No findings.

Severity: N/A
Impact: The interop surface is explicitly pinned, including aggregate_survey, SurveyDesign, estimator aliases, result mutability, and SurveyMetadata attributes.
Concrete fix: N/A

Tech Debt

No findings.

Severity: N/A
Impact: Calibration remains documented as upstream, consistent with the survey roadmap update.
Concrete fix: N/A

Security

No findings.

Severity: N/A
Impact: The new workflow uses read-only contents permissions and does not introduce secrets in the changed files.
Concrete fix: N/A

Documentation/Tests

  • Severity: P3
  • Location: tests/test_t26_composition_drift_calibration_drift.py:L50-L82, tests/test_t26_composition_drift_calibration_drift.py:L433-L451
  • Impact: The sync guard says the duplicated DGP constants must stay in sync with the notebook, but it only checks a subset of constants and load-bearing lines. A future notebook-only edit to an unchecked constant, such as STATE_RE_SD_PP, PSU_SHOCK_SD_PP, P_CLIP_PP, or response-shift arrays, could leave the duplicated drift-test DGP stale while nbmake still passes. This is informational because the current quoted numbers are still guarded by the balance-gated drift test.
  • Concrete fix: Either add all DGP constants to the needle list or compare a normalized/hashable copy of the full DGP code cell against the test fixture.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jul 4, 2026
@igerber igerber merged commit 0faa01b into main Jul 4, 2026
35 of 36 checks passed
@igerber igerber deleted the balance-integration branch July 4, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant