igerber · igerber · Jul 4, 2026 · Jul 4, 2026 · Jul 4, 2026 · Jul 4, 2026
@@ -8,6 +8,8 @@ on:
       - 'diff_diff/**'
       - 'pyproject.toml'
       - '.github/workflows/notebooks.yml'
+      # the interop drift guard runs only in this workflow (balance installed)
+      - 'tests/test_t26_composition_drift_calibration_drift.py'
   pull_request:
     branches: [main]
     types: [opened, synchronize, reopened, labeled, unlabeled]
@@ -16,6 +18,8 @@ on:
       - 'diff_diff/**'
       - 'pyproject.toml'
       - '.github/workflows/notebooks.yml'
+      # the interop drift guard runs only in this workflow (balance installed)
+      - 'tests/test_t26_composition_drift_calibration_drift.py'
   schedule:
     # Weekly Sunday 6am UTC — smoke test that notebooks still execute cleanly
     - cron: '0 6 * * 0'
@@ -58,11 +62,16 @@ jobs:
             --nbmake-timeout=600 \
             --ignore=docs/tutorials/06_power_analysis.ipynb \
             --ignore=docs/tutorials/10_trop.ipynb \
+            --ignore=docs/tutorials/26_composition_drift_calibration.ipynb \
             -v \
             --tb=short
           # Excluded notebooks (too slow for pure-Python CI without Rust backend):
           #   06_power_analysis — SyntheticDiD simulate_power Monte Carlo (>600s)
           #   10_trop — LOOCV grid search (>600s)
+          # Excluded notebooks (external interop dependency):
+          #   26_composition_drift_calibration — requires the balance package;
+          #   runs in the isolated interop-notebooks job below so this job's
+          #   minimal env keeps enforcing that tutorials add no dependencies
 
       - name: Upload failed notebook outputs
         if: failure()
@@ -71,3 +80,59 @@ jobs:
           name: failed-notebook-outputs
           path: docs/tutorials/*.ipynb
           retention-days: 7
+
+  interop-notebooks:
+    name: Execute balance-interop notebook
+    # Same ready-for-ci label gate as execute-notebooks (keep in sync).
+    if: >-
+      github.event_name != 'pull_request'
+      || (contains(github.event.pull_request.labels.*.name, 'ready-for-ci')
+      && (github.event.action != 'labeled' && github.event.action != 'unlabeled'
+      || github.event.label.name == 'ready-for-ci'))
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@9c091bb21b7c1c1d1991bb908d89e4e9dddfe3e0 # v7
+
+      - name: Set up Python
+        uses: actions/setup-python@ece7cb06caefa5fff74198d8649806c4678c61a1 # v6
+        with:
+          # 3.12+: balance's legacy numpy<2 / scipy<1.14 / scikit-learn<1.4
+          # pins apply only to python < 3.12
+          python-version: '3.12'
+
+      - name: Install dependencies
+        # balance is a tutorial-only dependency: it is installed ONLY in this
+        # isolated job, never in package requirements or the main notebooks
+        # job. The weekly cron on this workflow doubles as a cross-package
+        # integration smoke of diff-diff HEAD against latest PyPI balance.
+        run: |
+          pip install numpy pandas scipy matplotlib nbmake pytest ipykernel "balance>=0.21"
+          # Add repo root to Python path so Jupyter kernels can import diff_diff
+          # (pip install -e . requires the Rust/maturin toolchain; .pth avoids that)
+          python -c "import site; print(site.getsitepackages()[0])" | xargs -I{} sh -c 'echo "$PWD" > {}/diff_diff_dev.pth'
+
+      - name: Execute interop notebook
+        env:
+          DIFF_DIFF_BACKEND: python
+        run: |
+          pytest --nbmake docs/tutorials/26_composition_drift_calibration.ipynb \
+            --nbmake-timeout=600 \
+            -v \
+            --tb=short
+
+      - name: Run interop drift guard
+        # balance is present only in this job, so this is the drift test's
+        # CI home (it importorskips balance everywhere else).
+        env:
+          DIFF_DIFF_BACKEND: python
+        run: |
+          pytest tests/test_t26_composition_drift_calibration_drift.py -v --tb=short
+
+      - name: Upload failed notebook outputs
+        if: failure()
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7
+        with:
+          name: failed-interop-notebook-outputs
+          path: docs/tutorials/26_composition_drift_calibration.ipynb
+          retention-days: 7
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -20,6 +20,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [3.6.2] - 2026-07-03
 
 ### Added
+- **balance interop launch: composition-drift tutorial + `interop-notebooks` CI job.** Meta's
+  `balance` package (>= 0.21) ships a one-way adapter `balance.interop.diff_diff`
+  (facebookresearch/balance PR #465) whose `balance[did]` extra pins `diff-diff>=3.3,<4`.
+  `docs/tutorials/26_composition_drift_calibration.ipynb` is the diff-diff-side companion to
+  balance's `balance_diff_diff_brfss` tutorial, telling the failure-mode half of the story: a
+  BRFSS-style smoking-ban DGP with no systematic arm-specific trends (parallel trends hold in
+  expectation; planted ATT -3.0pp, realized -2.98pp under a rarely-binding probability floor) where
+  treatment-correlated non-response drift biases the design-weight Callaway-Sant'Anna ATT to
+  ~-4.1pp with *clean pre-trends*; a per-wave national rake fails (~-4.4pp - margins satisfied in
+  aggregate while arm-level composition is untouched); per-state raking with balance (BRFSS's own
+  granularity, population-count totals) recovers ~-3.2pp. Also covers the seam both ways (native
+  `SurveyDesign` + `aggregate_survey` vs `bd.to_panel_for_did`/`bd.fit_did`, exact-parity assert),
+  a 3-estimator x 2-weighting sweep, and `as_balance_diagnostic` cross-package diagnostics.
+  `tests/test_t26_composition_drift_calibration_drift.py` re-derives every quoted number
+  (auto-skips without balance). balance stays out of package requirements: the tutorial runs in a
+  new isolated `interop-notebooks` job in `notebooks.yml` (python 3.12, installs
+  `balance>=0.21`, also the drift guard's CI home; the workflow's weekly cron doubles as a
+  cross-package integration smoke against latest PyPI balance), and the main notebooks job env is
+  unchanged.
+- **`balance.interop.diff_diff` contract tests.** `tests/test_balance_interop_contract.py` pins
+  the diff-diff surface Meta's balance adapter consumes, importing no balance code:
+  `aggregate_survey` forwarded-params superset + `(panel, SurveyDesign)` return schema, the
+  `SurveyDesign` 15-field dataclass contract (plus TSL / replicate constructions), estimator and
+  short-alias resolution (`CS`/`DiD`/`BJS`/`HAD`) with `survey_design=` accepted by all 17
+  promised `fit()` signatures, the `_balance_adjustment` setattr provenance side-channel
+  (guards against future `__slots__`), the CallawaySantAnna pweight-only guard, and the
+  `SurveyMetadata.design_effect`/`effective_n`/`sum_weights` attribute names read by
+  `as_balance_diagnostic`. Docs handoff closing the survey-roadmap Phase 8g gap: "Weight
+  calibration with balance" section in `docs/api/prep.rst`, calibration pointers in
+  `llms.txt`/`llms-full.txt`/`llms-practitioner.txt` and `README.md` Survey Support, and
+  Deville & Särndal (1992) + Sarig, Galili & Eilat (2023) in `docs/references.rst`.
 - **`SyntheticControl` ADH-2015 §4 tail diagnostics** (two opt-in `SyntheticControlResults`
   methods, closing the last two ADH-2015 §4 checklist items). `regression_weights()` reports the
   implied donor weights `W^reg = X0a'(X0a X0a')^{-1} X1a` of the regression counterfactual

diff --git a/README.md b/README.md
@@ -137,6 +137,7 @@ Most estimators accept an optional `survey_design` parameter (or `survey=` / `we
 - **Variance methods**: Taylor Series Linearization (TSL via Binder 1983), replicate weights (BRR / Fay / JK1 / JKn / SDR), survey-aware bootstrap
 - **Diagnostics**: DEFF per coefficient, effective n, subpopulation analysis, weight trimming, CV on estimates
 - **Repeated cross-sections**: `CallawaySantAnna(panel=False)` for BRFSS, ACS, CPS
+- **Weight calibration / raking**: upstream by design - pair with Meta's [balance](https://import-balance.org/) package, whose `balance.interop.diff_diff` adapter hands raked samples straight to diff-diff; see the [composition-drift tutorial](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html)
 
 No other Python or R DiD package offers design-based variance estimation for modern heterogeneity-robust estimators.
 

diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
@@ -1188,6 +1188,22 @@ read-throughs for compatibility with external adapters that
 the canonical names; assume the flat aliases are present on every
 staggered class unless explicitly noted otherwise.
 
+**balance interop.** Meta's `balance` package (>= 0.21) ships the
+one-way adapter `balance.interop.diff_diff` (`pip install "balance[did]"`,
+pins `diff-diff>=3.3,<4`): `to_survey_design(sample)` builds a
+`SurveyDesign` from a balance `Sample`'s active weight column plus the
+convention columns `stratum`/`psu`/`fpc`; `to_panel_for_did(sample, by=,
+outcomes=)` wraps `diff_diff.aggregate_survey` to collapse respondent
+microdata into a unit-period panel plus second-stage design;
+`fit_did(sample, estimator=, ...)` resolves any exported estimator by
+name or short alias (`CS`/`DiD`/`BJS`/`HAD`) and forwards
+`survey_design=`, attaching the source Sample to the result as
+`_balance_adjustment` for provenance; `as_balance_diagnostic(sample,
+res)` joins balance's ASMD/Kish-ESS with `res.survey_metadata`'s
+DEFF/effective-n into one flat dict. The diff-diff surface it consumes
+is pinned by `tests/test_balance_interop_contract.py`; the workflow is
+demonstrated in Tutorial 26 (composition drift & calibration).
+
 ### DiDResults
 
 Returned by `DifferenceInDifferences.fit()` and `TwoWayFixedEffects.fit()`.

diff --git a/diff_diff/guides/llms-practitioner.txt b/diff_diff/guides/llms-practitioner.txt
@@ -48,6 +48,11 @@ Key questions to answer:
   units.
 - Is there treatment effect heterogeneity you should preserve rather than
   average over?
+- If the data are a survey: are the weights calibrated (raked) at the
+  granularity of your comparison units? Non-response drift that correlates
+  with treatment timing does NOT difference out of a DiD; calibrate upstream
+  with Meta's balance package first — see Tutorial 26:
+  docs/tutorials/26_composition_drift_calibration.ipynb.
 
 ```python
 # After estimation, the target parameter is available as:

diff --git a/diff_diff/guides/llms.txt b/diff_diff/guides/llms.txt
@@ -101,6 +101,7 @@ Full practitioner guide: call `diff_diff.get_llm_guide("practitioner")`
 - [16 Survey DiD](https://diff-diff.readthedocs.io/en/stable/tutorials/16_survey_did.html): Survey-weighted DiD — SurveyDesign, strata/PSU/FPC, replicate weights, subpopulation analysis, DEFF diagnostics
 - [16 Wooldridge ETWFE](https://diff-diff.readthedocs.io/en/stable/tutorials/16_wooldridge_etwfe.html): Wooldridge (2023, 2025) ETWFE — saturated OLS, logit/Poisson (ASF-based ATT), aggregation types
 - [22 HAD Survey-Weighted Workflow](https://diff-diff.readthedocs.io/en/stable/tutorials/22_had_survey_design.html): HeterogeneousAdoptionDiD + did_had_pretest_workflow under SurveyDesign(strata, psu, weights, fpc) — BRFSS-shape panel, modest SE inflation explanation, Phase 4.5 C0 QUG-deferred verdict
+- [26 Composition Drift & Calibration](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html): When differential non-response biases the DiD itself — per-state raking with Meta's balance package, `balance.interop.diff_diff` adapter, raking-granularity lesson (requires `pip install balance`)
 
 ## Survey Support
 
@@ -110,6 +111,7 @@ Most estimators accept an optional `survey_design` parameter (`SyntheticControl`
 - **Variance methods**: Taylor Series Linearization (TSL), replicate weights (BRR/Fay/JK1/JKn/SDR), survey-aware bootstrap
 - **Diagnostics**: DEFF per coefficient, effective n, subpopulation analysis, weight trimming, CV on estimates
 - **Repeated cross-sections**: `CallawaySantAnna(panel=False)` for BRFSS, ACS, CPS
+- **Weight calibration / raking**: upstream by design — pair with Meta's [balance](https://import-balance.org/) package (>= 0.21), whose `balance.interop.diff_diff` adapter (`to_survey_design` / `to_panel_for_did` / `fit_did` / `as_balance_diagnostic`, `pip install "balance[did]"`) hands raked samples straight to diff-diff estimators; see [Tutorial 26](https://diff-diff.readthedocs.io/en/stable/tutorials/26_composition_drift_calibration.html) for when calibration is essential for the causal estimand itself
 - **Compatibility matrix**: [Survey Design Support](https://diff-diff.readthedocs.io/en/stable/choosing_estimator.html#survey-design-support)
 
 No R or Python package offers design-based variance estimation for modern heterogeneity-robust DiD estimators. R's `did`, `fixest`, `synthdid`, and `didimputation` accept flat weight vectors only.

diff --git a/docs/api/prep.rst b/docs/api/prep.rst
@@ -366,6 +366,41 @@ Example
    #     treatment="treated", time="post", survey_design=stage2,
    # )
 
+Weight calibration with balance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``SurveyDesign`` expects **pre-calibrated** weights: post-stratification,
+raking, and calibration are deliberately out of scope for diff-diff and
+remain upstream. Meta's `balance <https://import-balance.org/>`_ package
+(>= 0.21) is the recommended companion - it rakes survey samples to
+population margins and ships a dedicated adapter,
+``balance.interop.diff_diff``, that hands the calibrated sample straight
+to diff-diff (``to_survey_design`` / ``to_panel_for_did`` / ``fit_did`` /
+``as_balance_diagnostic``; installable via ``pip install "balance[did]"``).
+
+The handoff needs no adapter if you prefer the native seam - calibrated
+weights are just a column::
+
+   design = SurveyDesign(weights="raked_wt", strata="strat", psu="psu")
+   panel, stage2 = aggregate_survey(
+       microdata, by=["state", "year"], outcomes="smoking_rate",
+       survey_design=design,
+   )
+   result = CallawaySantAnna().fit(
+       panel, outcome="smoking_rate_mean", unit="state", time="year",
+       first_treat="g", survey_design=stage2,
+   )
+
+**When calibration matters for the causal estimand** (not just
+descriptives): non-response drift that is differential by treatment arm
+and time does *not* difference out of a DiD. See the
+:doc:`composition-drift tutorial <../tutorials/26_composition_drift_calibration>`
+for a worked BRFSS-style failure mode - including why raking granularity
+must match the comparison units (state-level raking, as BRFSS itself
+does, not a pooled national rake) - and the companion
+`balance tutorial <https://github.com/facebookresearch/balance/blob/main/tutorials/balance_diff_diff_brfss.ipynb>`_
+for the robust case (common drift) and descriptive-estimand repair.
+
 Data Validation
 ---------------
 

diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml
@@ -846,6 +846,9 @@ sources:
         type: roadmap
       - path: docs/tutorials/16_survey_did.ipynb
         type: tutorial
+      - path: docs/tutorials/26_composition_drift_calibration.ipynb
+        type: tutorial
+        note: "balance interop: calibration handoff + composition-drift failure mode"
       - path: README.md
         section: "Survey Support"
         type: user_guide
@@ -953,6 +956,9 @@ sources:
     docs:
       - path: docs/api/prep.rst
         type: api_reference
+      - path: docs/tutorials/26_composition_drift_calibration.ipynb
+        type: tutorial
+        note: "aggregate_survey is the seam balance.interop.diff_diff wraps"
       - path: docs/practitioner_getting_started.rst
         type: user_guide
       - path: docs/practitioner_decision_tree.rst

diff --git a/docs/index.rst b/docs/index.rst
@@ -84,6 +84,7 @@ Quick Links
    tutorials/21_had_pretest_workflow
    tutorials/22_had_survey_design
    tutorials/23_spillover_tva
+   tutorials/26_composition_drift_calibration
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/references.rst b/docs/references.rst
@@ -106,6 +106,14 @@ Survey-Design Inference (Taylor-Series Linearization)
 
   The "when to weight" framework distinguishing precision, endogenous-sampling, and population-effect motivations for survey weights; cited in REGISTRY.md ``## Survey Data Support`` -> "Weighted Estimation".
 
+- **Deville, J.-C. & Särndal, C.-E. (1992).** "Calibration Estimators in Survey Sampling." *Journal of the American Statistical Association*, 87(418), 376-382. https://doi.org/10.1080/01621459.1992.10475217
+
+  The calibration/raking framework underlying post-stratified survey weights. diff-diff deliberately keeps calibration upstream (``SurveyDesign`` expects pre-calibrated weights); the ``docs/api/prep.rst`` "Weight calibration with balance" section and Tutorial 26 document the handoff.
+
+- **Sarig, T., Galili, T. & Eilat, R. (2023).** "balance - a Python package for balancing biased data samples." *arXiv:2307.06024* (stat.CO). https://arxiv.org/abs/2307.06024
+
+  Meta's ``balance`` package, the recommended upstream calibration companion. Its ``balance.interop.diff_diff`` adapter (balance >= 0.21) hands raked samples to diff-diff's survey-aware estimators; Tutorial 26 (``docs/tutorials/26_composition_drift_calibration.ipynb``) demonstrates the workflow, and ``tests/test_balance_interop_contract.py`` pins the consumed surface.
+
 Placebo Tests and DiD Diagnostics
 ---------------------------------
 

diff --git a/docs/survey-roadmap.md b/docs/survey-roadmap.md
@@ -109,13 +109,17 @@ Files: `benchmarks/R/benchmark_realdata_*.R`, `tests/test_survey_real_data.py`,
 
 - **Multi-stage design**: not yet documented. Single-stage (strata + PSU)
   is sufficient per Lumley (2004) Section 2.2.
-- **Post-stratification / calibration**: not yet documented. `SurveyDesign`
-  expects pre-calibrated weights. `samplics` is the most complete Python
-  option (post-stratification, raking, GREG) but is in read-only mode —
-  active development has moved to `svy`, which is not yet publicly
-  released. `weightipy` is actively maintained for raking. Weight
-  calibration is out of scope for diff-diff today, though building this
-  capability is a future possibility.
+- **Post-stratification / calibration**: DOCUMENTED (2026-07). `SurveyDesign`
+  expects pre-calibrated weights; calibration stays upstream by design. The
+  recommended companion is Meta's `balance` package (>= 0.21), which ships a
+  dedicated `balance.interop.diff_diff` adapter (`pip install "balance[did]"`).
+  The handoff is documented in `docs/api/prep.rst` ("Weight calibration with
+  balance"), demonstrated end-to-end in
+  `docs/tutorials/26_composition_drift_calibration.ipynb` (including when
+  calibration is essential for the causal estimand, not just descriptives),
+  and the consumed diff-diff surface is pinned by
+  `tests/test_balance_interop_contract.py`. `samplics` (read-only; successor
+  `svy` not yet released) and `weightipy` remain alternatives.
 
 ### Phase 10: Survey Completeness (v2.9.0–v3.0)