Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions .github/workflows/notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ on:
- 'diff_diff/**'
- 'pyproject.toml'
- '.github/workflows/notebooks.yml'
# the interop drift guard runs only in this workflow (balance installed)
- 'tests/test_t26_composition_drift_calibration_drift.py'
pull_request:
branches: [main]
types: [opened, synchronize, reopened, labeled, unlabeled]
Expand All @@ -16,6 +18,8 @@ on:
- 'diff_diff/**'
- 'pyproject.toml'
- '.github/workflows/notebooks.yml'
# the interop drift guard runs only in this workflow (balance installed)
- 'tests/test_t26_composition_drift_calibration_drift.py'
schedule:
# Weekly Sunday 6am UTC — smoke test that notebooks still execute cleanly
- cron: '0 6 * * 0'
Expand Down Expand Up @@ -58,11 +62,16 @@ jobs:
--nbmake-timeout=600 \
--ignore=docs/tutorials/06_power_analysis.ipynb \
--ignore=docs/tutorials/10_trop.ipynb \
--ignore=docs/tutorials/26_composition_drift_calibration.ipynb \
-v \
--tb=short
# Excluded notebooks (too slow for pure-Python CI without Rust backend):
# 06_power_analysis — SyntheticDiD simulate_power Monte Carlo (>600s)
# 10_trop — LOOCV grid search (>600s)
# Excluded notebooks (external interop dependency):
# 26_composition_drift_calibration — requires the balance package;
# runs in the isolated interop-notebooks job below so this job's
# minimal env keeps enforcing that tutorials add no dependencies

- name: Upload failed notebook outputs
if: failure()
Expand All @@ -71,3 +80,59 @@ jobs:
name: failed-notebook-outputs
path: docs/tutorials/*.ipynb
retention-days: 7

interop-notebooks:
name: Execute balance-interop notebook
# Same ready-for-ci label gate as execute-notebooks (keep in sync).
if: >-
github.event_name != 'pull_request'
|| (contains(github.event.pull_request.labels.*.name, 'ready-for-ci')
&& (github.event.action != 'labeled' && github.event.action != 'unlabeled'
|| github.event.label.name == 'ready-for-ci'))
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7

- name: Set up Python
uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6
with:
# 3.12+: balance's legacy numpy<2 / scipy<1.14 / scikit-learn<1.4
# pins apply only to python < 3.12
python-version: '3.12'

- name: Install dependencies
# balance is a tutorial-only dependency: it is installed ONLY in this
# isolated job, never in package requirements or the main notebooks
# job. The weekly cron on this workflow doubles as a cross-package
# integration smoke of diff-diff HEAD against latest PyPI balance.
run: |
pip install numpy pandas scipy matplotlib nbmake pytest ipykernel "balance>=0.21"
# Add repo root to Python path so Jupyter kernels can import diff_diff
# (pip install -e . requires the Rust/maturin toolchain; .pth avoids that)
python -c "import site; print(site.getsitepackages()[0])" | xargs -I{} sh -c 'echo "$PWD" > {}/diff_diff_dev.pth'

- name: Execute interop notebook
env:
DIFF_DIFF_BACKEND: python
run: |
pytest --nbmake docs/tutorials/26_composition_drift_calibration.ipynb \
--nbmake-timeout=600 \
-v \
--tb=short

- name: Run interop drift guard
# balance is present only in this job, so this is the drift test's
# CI home (it importorskips balance everywhere else).
env:
DIFF_DIFF_BACKEND: python
run: |
pytest tests/test_t26_composition_drift_calibration_drift.py -v --tb=short

- name: Upload failed notebook outputs
if: failure()
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
with:
name: failed-interop-notebook-outputs
path: docs/tutorials/26_composition_drift_calibration.ipynb
retention-days: 7
31 changes: 31 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [3.6.2] - 2026-07-03

### Added
- **balance interop launch: composition-drift tutorial + `interop-notebooks` CI job.** Meta's
`balance` package (>= 0.21) ships a one-way adapter `balance.interop.diff_diff`
(facebookresearch/balance PR #465) whose `balance[did]` extra pins `diff-diff>=3.3,<4`.
`docs/tutorials/26_composition_drift_calibration.ipynb` is the diff-diff-side companion to
balance's `balance_diff_diff_brfss` tutorial, telling the failure-mode half of the story: a
BRFSS-style smoking-ban DGP with no systematic arm-specific trends (parallel trends hold in
expectation; planted ATT -3.0pp, realized -2.98pp under a rarely-binding probability floor) where
treatment-correlated non-response drift biases the design-weight Callaway-Sant'Anna ATT to
~-4.1pp with *clean pre-trends*; a per-wave national rake fails (~-4.4pp - margins satisfied in
aggregate while arm-level composition is untouched); per-state raking with balance (BRFSS's own
granularity, population-count totals) recovers ~-3.2pp. Also covers the seam both ways (native
`SurveyDesign` + `aggregate_survey` vs `bd.to_panel_for_did`/`bd.fit_did`, exact-parity assert),
a 3-estimator x 2-weighting sweep, and `as_balance_diagnostic` cross-package diagnostics.
`tests/test_t26_composition_drift_calibration_drift.py` re-derives every quoted number
(auto-skips without balance). balance stays out of package requirements: the tutorial runs in a
new isolated `interop-notebooks` job in `notebooks.yml` (python 3.12, installs
`balance>=0.21`, also the drift guard's CI home; the workflow's weekly cron doubles as a
cross-package integration smoke against latest PyPI balance), and the main notebooks job env is
unchanged.
- **`balance.interop.diff_diff` contract tests.** `tests/test_balance_interop_contract.py` pins
the diff-diff surface Meta's balance adapter consumes, importing no balance code:
`aggregate_survey` forwarded-params superset + `(panel, SurveyDesign)` return schema, the
`SurveyDesign` 15-field dataclass contract (plus TSL / replicate constructions), estimator and
short-alias resolution (`CS`/`DiD`/`BJS`/`HAD`) with `survey_design=` accepted by all 17
promised `fit()` signatures, the `_balance_adjustment` setattr provenance side-channel
(guards against future `__slots__`), the CallawaySantAnna pweight-only guard, and the
`SurveyMetadata.design_effect`/`effective_n`/`sum_weights` attribute names read by
`as_balance_diagnostic`. Docs handoff closing the survey-roadmap Phase 8g gap: "Weight
calibration with balance" section in `docs/api/prep.rst`, calibration pointers in
`llms.txt`/`llms-full.txt`/`llms-practitioner.txt` and `README.md` Survey Support, and
Deville & Särndal (1992) + Sarig, Galili & Eilat (2023) in `docs/references.rst`.
- **`SyntheticControl` ADH-2015 §4 tail diagnostics** (two opt-in `SyntheticControlResults`
methods, closing the last two ADH-2015 §4 checklist items). `regression_weights()` reports the
implied donor weights `W^reg = X0a'(X0a X0a')^{-1} X1a` of the regression counterfactual
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ Most estimators accept an optional `survey_design` parameter (or `survey=` / `we
- **Variance methods**: Taylor Series Linearization (TSL via Binder 1983), replicate weights (BRR / Fay / JK1 / JKn / SDR), survey-aware bootstrap
- **Diagnostics**: DEFF per coefficient, effective n, subpopulation analysis, weight trimming, CV on estimates
- **Repeated cross-sections**: `CallawaySantAnna(panel=False)` for BRFSS, ACS, CPS
- **Weight calibration / raking**: upstream by design - pair with Meta's [balance](https://import-balance.org/) package, whose `balance.interop.diff_diff` adapter hands raked samples straight to diff-diff; see the [composition-drift tutorial](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html)

No other Python or R DiD package offers design-based variance estimation for modern heterogeneity-robust estimators.

Expand Down
16 changes: 16 additions & 0 deletions diff_diff/guides/llms-full.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1188,6 +1188,22 @@ read-throughs for compatibility with external adapters that
the canonical names; assume the flat aliases are present on every
staggered class unless explicitly noted otherwise.

**balance interop.** Meta's `balance` package (>= 0.21) ships the
one-way adapter `balance.interop.diff_diff` (`pip install "balance[did]"`,
pins `diff-diff>=3.3,<4`): `to_survey_design(sample)` builds a
`SurveyDesign` from a balance `Sample`'s active weight column plus the
convention columns `stratum`/`psu`/`fpc`; `to_panel_for_did(sample, by=,
outcomes=)` wraps `diff_diff.aggregate_survey` to collapse respondent
microdata into a unit-period panel plus second-stage design;
`fit_did(sample, estimator=, ...)` resolves any exported estimator by
name or short alias (`CS`/`DiD`/`BJS`/`HAD`) and forwards
`survey_design=`, attaching the source Sample to the result as
`_balance_adjustment` for provenance; `as_balance_diagnostic(sample,
res)` joins balance's ASMD/Kish-ESS with `res.survey_metadata`'s
DEFF/effective-n into one flat dict. The diff-diff surface it consumes
is pinned by `tests/test_balance_interop_contract.py`; the workflow is
demonstrated in Tutorial 26 (composition drift & calibration).

### DiDResults

Returned by `DifferenceInDifferences.fit()` and `TwoWayFixedEffects.fit()`.
Expand Down
5 changes: 5 additions & 0 deletions diff_diff/guides/llms-practitioner.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,11 @@ Key questions to answer:
units.
- Is there treatment effect heterogeneity you should preserve rather than
average over?
- If the data are a survey: are the weights calibrated (raked) at the
granularity of your comparison units? Non-response drift that correlates
with treatment timing does NOT difference out of a DiD; calibrate upstream
with Meta's balance package first — see Tutorial 26:
docs/tutorials/26_composition_drift_calibration.ipynb.

```python
# After estimation, the target parameter is available as:
Expand Down
2 changes: 2 additions & 0 deletions diff_diff/guides/llms.txt
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
- [16 Survey DiD](https://diff-diff.readthedocs.io/en/stable/tutorials/16_survey_did.html): Survey-weighted DiD — SurveyDesign, strata/PSU/FPC, replicate weights, subpopulation analysis, DEFF diagnostics
- [16 Wooldridge ETWFE](https://diff-diff.readthedocs.io/en/stable/tutorials/16_wooldridge_etwfe.html): Wooldridge (2023, 2025) ETWFE — saturated OLS, logit/Poisson (ASF-based ATT), aggregation types
- [22 HAD Survey-Weighted Workflow](https://diff-diff.readthedocs.io/en/stable/tutorials/22_had_survey_design.html): HeterogeneousAdoptionDiD + did_had_pretest_workflow under SurveyDesign(strata, psu, weights, fpc) — BRFSS-shape panel, modest SE inflation explanation, Phase 4.5 C0 QUG-deferred verdict
- [26 Composition Drift & Calibration](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html): When differential non-response biases the DiD itself — per-state raking with Meta's balance package, `balance.interop.diff_diff` adapter, raking-granularity lesson (requires `pip install balance`)

## Survey Support

Expand All @@ -110,6 +111,7 @@ Most estimators accept an optional `survey_design` parameter (`SyntheticControl`
- **Variance methods**: Taylor Series Linearization (TSL), replicate weights (BRR/Fay/JK1/JKn/SDR), survey-aware bootstrap
- **Diagnostics**: DEFF per coefficient, effective n, subpopulation analysis, weight trimming, CV on estimates
- **Repeated cross-sections**: `CallawaySantAnna(panel=False)` for BRFSS, ACS, CPS
- **Weight calibration / raking**: upstream by design — pair with Meta's [balance](https://import-balance.org/) package (>= 0.21), whose `balance.interop.diff_diff` adapter (`to_survey_design` / `to_panel_for_did` / `fit_did` / `as_balance_diagnostic`, `pip install "balance[did]"`) hands raked samples straight to diff-diff estimators; see [Tutorial 26](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html) for when calibration is essential for the causal estimand itself
- **Compatibility matrix**: [Survey Design Support](https://diff-diff.readthedocs.io/en/stable/choosing_estimator.html#survey-design-support)

No R or Python package offers design-based variance estimation for modern heterogeneity-robust DiD estimators. R's `did`, `fixest`, `synthdid`, and `didimputation` accept flat weight vectors only.
Expand Down
35 changes: 35 additions & 0 deletions docs/api/prep.rst
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,41 @@ Example
# treatment="treated", time="post", survey_design=stage2,
# )

Weight calibration with balance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``SurveyDesign`` expects **pre-calibrated** weights: post-stratification,
raking, and calibration are deliberately out of scope for diff-diff and
remain upstream. Meta's `balance <https://import-balance.org/>`_ package
(>= 0.21) is the recommended companion - it rakes survey samples to
population margins and ships a dedicated adapter,
``balance.interop.diff_diff``, that hands the calibrated sample straight
to diff-diff (``to_survey_design`` / ``to_panel_for_did`` / ``fit_did`` /
``as_balance_diagnostic``; installable via ``pip install "balance[did]"``).

The handoff needs no adapter if you prefer the native seam - calibrated
weights are just a column::

design = SurveyDesign(weights="raked_wt", strata="strat", psu="psu")
panel, stage2 = aggregate_survey(
microdata, by=["state", "year"], outcomes="smoking_rate",
survey_design=design,
)
result = CallawaySantAnna().fit(
panel, outcome="smoking_rate_mean", unit="state", time="year",
first_treat="g", survey_design=stage2,
)

**When calibration matters for the causal estimand** (not just
descriptives): non-response drift that is differential by treatment arm
and time does *not* difference out of a DiD. See the
:doc:`composition-drift tutorial <../tutorials/26_composition_drift_calibration>`
for a worked BRFSS-style failure mode - including why raking granularity
must match the comparison units (state-level raking, as BRFSS itself
does, not a pooled national rake) - and the companion
`balance tutorial <https://github.com/facebookresearch/balance/blob/main/tutorials/balance_diff_diff_brfss.ipynb>`_
for the robust case (common drift) and descriptive-estimand repair.

Data Validation
---------------

Expand Down
6 changes: 6 additions & 0 deletions docs/doc-deps.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -846,6 +846,9 @@ sources:
type: roadmap
- path: docs/tutorials/16_survey_did.ipynb
type: tutorial
- path: docs/tutorials/26_composition_drift_calibration.ipynb
type: tutorial
note: "balance interop: calibration handoff + composition-drift failure mode"
- path: README.md
section: "Survey Support"
type: user_guide
Expand Down Expand Up @@ -953,6 +956,9 @@ sources:
docs:
- path: docs/api/prep.rst
type: api_reference
- path: docs/tutorials/26_composition_drift_calibration.ipynb
type: tutorial
note: "aggregate_survey is the seam balance.interop.diff_diff wraps"
- path: docs/practitioner_getting_started.rst
type: user_guide
- path: docs/practitioner_decision_tree.rst
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@ Quick Links
tutorials/21_had_pretest_workflow
tutorials/22_had_survey_design
tutorials/23_spillover_tva
tutorials/26_composition_drift_calibration

.. toctree::
:maxdepth: 1
Expand Down
8 changes: 8 additions & 0 deletions docs/references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,14 @@ Survey-Design Inference (Taylor-Series Linearization)

The "when to weight" framework distinguishing precision, endogenous-sampling, and population-effect motivations for survey weights; cited in REGISTRY.md ``## Survey Data Support`` -> "Weighted Estimation".

- **Deville, J.-C. & Särndal, C.-E. (1992).** "Calibration Estimators in Survey Sampling." *Journal of the American Statistical Association*, 87(418), 376-382. https://doi.org/10.1080/01621459.1992.10475217

The calibration/raking framework underlying post-stratified survey weights. diff-diff deliberately keeps calibration upstream (``SurveyDesign`` expects pre-calibrated weights); the ``docs/api/prep.rst`` "Weight calibration with balance" section and Tutorial 26 document the handoff.

- **Sarig, T., Galili, T. & Eilat, R. (2023).** "balance - a Python package for balancing biased data samples." *arXiv:2307.06024* (stat.CO). https://arxiv.org/abs/2307.06024

Meta's ``balance`` package, the recommended upstream calibration companion. Its ``balance.interop.diff_diff`` adapter (balance >= 0.21) hands raked samples to diff-diff's survey-aware estimators; Tutorial 26 (``docs/tutorials/26_composition_drift_calibration.ipynb``) demonstrates the workflow, and ``tests/test_balance_interop_contract.py`` pins the consumed surface.

Placebo Tests and DiD Diagnostics
---------------------------------

Expand Down
18 changes: 11 additions & 7 deletions docs/survey-roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,13 +109,17 @@ Files: `benchmarks/R/benchmark_realdata_*.R`, `tests/test_survey_real_data.py`,

- **Multi-stage design**: not yet documented. Single-stage (strata + PSU)
is sufficient per Lumley (2004) Section 2.2.
- **Post-stratification / calibration**: not yet documented. `SurveyDesign`
expects pre-calibrated weights. `samplics` is the most complete Python
option (post-stratification, raking, GREG) but is in read-only mode —
active development has moved to `svy`, which is not yet publicly
released. `weightipy` is actively maintained for raking. Weight
calibration is out of scope for diff-diff today, though building this
capability is a future possibility.
- **Post-stratification / calibration**: DOCUMENTED (2026-07). `SurveyDesign`
expects pre-calibrated weights; calibration stays upstream by design. The
recommended companion is Meta's `balance` package (>= 0.21), which ships a
dedicated `balance.interop.diff_diff` adapter (`pip install "balance[did]"`).
The handoff is documented in `docs/api/prep.rst` ("Weight calibration with
balance"), demonstrated end-to-end in
`docs/tutorials/26_composition_drift_calibration.ipynb` (including when
calibration is essential for the causal estimand, not just descriptives),
and the consumed diff-diff surface is pinned by
`tests/test_balance_interop_contract.py`. `samplics` (read-only; successor
`svy` not yet released) and `weightipy` remain alternatives.

### Phase 10: Survey Completeness (v2.9.0–v3.0)

Expand Down
Loading
Loading