Skip to content

Fix TWFE within-transformation bug and add methodology review#139

Merged
igerber merged 5 commits intomainfrom
twfe-method-review
Feb 9, 2026
Merged

Fix TWFE within-transformation bug and add methodology review#139
igerber merged 5 commits intomainfrom
twfe-method-review

Conversation

@igerber
Copy link
Owner

@igerber igerber commented Feb 8, 2026

Summary

  • Bug fix: The TwoWayFixedEffects interaction term D_i × Post_t was not being within-transformed before regression, producing incorrect ATT (~1/3 of true value) for multi-period panels with binary post indicator. Fixed by applying within-transformation to the interaction term alongside outcome and covariates, matching R's fixest::feols() behavior.
  • Add comprehensive TWFE methodology review test suite (tests/test_methodology_twfe.py) with 33 tests across 6 phases
  • Add dedicated TWFE benchmark scripts (Python + R) and benchmark runner integration
  • Update METHODOLOGY_REVIEW.md: TWFE status from "Not Started" to "Complete"
  • Update docs/methodology/REGISTRY.md: check requirements boxes, add FWL theorem note
  • Fix test_twfe_with_absorbed_covariate to use correct TWFE specification after the bug fix

Methodology references (required if estimator / math changes)

  • Method name(s): Two-Way Fixed Effects (TWFE) with within-transformation
  • Paper / source link(s): Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data, 2nd ed., Ch. 10
  • Any intentional deviations from the source (and why): None — the fix aligns Python with R's fixest::feols() behavior. By the Frisch-Waugh-Lovell theorem, all regressors must be projected out of the fixed effects space alongside the dependent variable.

Validation

  • Tests added/updated:
    • tests/test_methodology_twfe.py (33 new tests)
    • tests/test_estimators.py (updated test_twfe_with_absorbed_covariate)
  • R comparison: ATT matches fixest::feols(y ~ treated:post | unit + post, cluster=~unit) within 0.1%, SE within 1%
  • All 335 core tests pass (0 failures), full suite 1128 passed

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

The interaction term D_i × Post_t was not being within-transformed before
regression, causing incorrect ATT estimates (~1/3 of true value) for
multi-period panels with binary post indicator. Fixed by applying the same
within-transformation to the interaction term, matching R's fixest::feols().

Also adds comprehensive methodology review: 33 tests across 6 phases
(algebra verification, R comparison, edge cases, SE verification, wild
bootstrap, params/results), TWFE benchmarks, and documentation updates.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Feb 8, 2026

Overall Assessment
⚠️ Needs changes

Executive Summary

  • TwoWayFixedEffects estimator change aligns with the Methodology Registry’s within-transformation requirement (FWL); no methodological mismatch found.
  • Two test validity gaps could give false confidence in inference parity (HC1 comparison uses different FE; R p-value uses non-fixest df).
  • Minor performance overhead from a second within-transformation pass per fit.

Methodology

  • No issues found.

Code Quality

  • No issues found.

Performance

  • P3 — Impact: TWFE now does two full within-transform passes (two groupbys), adding avoidable overhead on large panels. Fix: include _treatment_post in the same within-transform call (or extend _within_transform to accept extra variables) so outcome/covariates/interaction are demeaned in one pass. (diff_diff/twfe.py:L99-L109)

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • P2 — Impact: HC1 comparison claims “same demeaned regression,” but it demeans by period while TWFE demeans by post, so the comparison is apples-to-oranges and can mask SE regressions. Fix: demean with time="post" and use n_times = data["post"].nunique() to mirror TWFE. (tests/test_methodology_twfe.py:L752-L777)
  • P2 — Impact: R comparison p-value is computed with df = nobs - k instead of fixest’s own pvalue() (or Python’s df-adjustment), so the p-value parity test can drift across versions/cluster settings. Fix: use pvalue(result)[att_idx] or coeftable(result) from fixest, or explicitly compute with the same df logic as Python if that’s the intended comparator. (tests/test_methodology_twfe.py:L317-L356)

…ixest coeftable

- Merge two within-transform passes into one call (P3 perf fix)
- Fix HC1 SE test to demean by unit+post (not period) matching TWFE
- Rewrite SE test as cluster-vs-HC1 difference check with manual cluster verification
- Use fixest coeftable() for R p-value instead of manual pt() computation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 8782073ae473ee02a85b1d5949d389703d266eb5


Overall assessment: ⚠️ Needs changes

Executive summary

  • The TWFE interaction demeaning fix aligns with FWL and the registry update.
  • The staggered‑treatment warning is now marked as “verified,” but it only triggers when time is multi‑period—contradicting the documented requirement that time be binary post.
  • The new staggered‑warning test uses an unsupported time="period" configuration, so it doesn’t validate the supported API behavior.

Methodology

  • P1 — Missing assumption check / inconsistent warning behavior
    Impact: Users who follow the documented requirement to pass a binary time (post indicator) will not get a staggered‑treatment warning, even if adoption is staggered. This undermines the registry checklist claiming the warning is satisfied and can lead to unflagged biased TWFE estimates.
    Concrete fix: Either (a) enforce time as binary and add a separate period/first_treat parameter (or accept time‑varying treatment=D_it) for staggered detection, or (b) update the registry checklist and warning text to clarify it only applies when time is the actual period column.
    Refs: diff_diff/twfe.py:L285-L317, docs/methodology/REGISTRY.md:L234-L260

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P2 — Test covers unsupported configuration
    Impact: test_staggered_treatment_warning validates the warning using time="period" even though the registry now states only binary time is supported. This gives false confidence about the warning behavior in the supported API.
    Concrete fix: Add a validation test that either (a) enforces binary time and asserts a clear error/warning, or (b) explicitly documents and tests a supported multi‑period path if you decide to support it.
    Refs: tests/test_methodology_twfe.py:L486-L517, docs/methodology/REGISTRY.md:L249-L258

The staggered treatment warning in _check_staggered_treatment only fires
when `time` has >2 unique values. With binary `time="post"`, all treated
units appear at time=1, making staggering undetectable. This updates docs,
tests, and the registry to honestly reflect this limitation rather than
claiming it as verified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 8, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 8, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f140cf94b120e35e5d8cd01c96ef3f42e79b93b


Overall assessment: ⚠️ Needs changes

Executive summary:

  • P1: time is still accepted as multi‑period without a guard; after the fix it now multiplies raw time, which can silently estimate treated × period instead of treated × post when users pass actual periods.
  • The within‑transformation fix aligns TWFE with FWL and the updated Registry; tests add strong coverage and R parity checks.
  • P3: one new edge‑case test claims FE collinearity but actually checks covariate collinearity, leaving the FE‑collinearity case untested.

Methodology

  • Severity: P1 — Impact: TWFE assumes a binary post indicator, but fit() still accepts multi‑period time; with the new raw interaction, users who pass actual periods will estimate treated × period (not D_i × Post_t), violating the documented identification and producing biased ATT. — Fix: validate time is binary (or add a separate post parameter); otherwise raise a clear ValueError or emit a warning that multi‑period time is unsupported. (diff_diff/twfe.py:99-109)

Code Quality

  • No issues found.

Performance

  • No issues found.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P3 — Impact: test_treatment_collinear_with_fe_raises_error claims to test FE collinearity but instead creates a duplicate covariate (bad_cov) and only exercises rank‑deficient covariate handling; the stated FE‑collinearity case remains unverified. — Fix: either change the data to induce true FE collinearity without the duplicate covariate, or rename/adjust the test/docstring to reflect covariate collinearity. (tests/test_methodology_twfe.py:630-657)

- Add UserWarning in TWFE fit() when time column has >2 unique values,
  advising users to create a binary post indicator
- Add test_multiperiod_time_warning and test_binary_time_no_multiperiod_warning
- Update staggered test to also assert multi-period warning fires
- Rename test_treatment_collinear_with_fe_raises_error to
  test_covariate_collinear_with_interaction_raises_error (matches actual behavior)
- Update REGISTRY.md and METHODOLOGY_REVIEW.md to reflect changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 9, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 30068582ed36fb7129639fd2b4d298700049904d


Overall Assessment

  • ⚠️ Needs changes

Executive Summary

  • TWFE interaction is now correctly within-transformed and aligns with FWL/registry.
  • Missing guard for non‑0/1 binary time can silently scale ATT in common 2‑period datasets coded as years.
  • New R‑based methodology tests are comprehensive but may slow CI when R is available.

Methodology

  • Severity: P1 | Impact: time with exactly two unique values but not {0,1} (e.g., {2020,2021}) bypasses the warning and yields a scaled D×Post, violating the registry’s binary 0/1 assumption and biasing ATT. | Fix: validate time values are {0,1}/{False,True}; otherwise warn+normalize or raise. diff_diff/twfe.py:L96-L117, docs/methodology/REGISTRY.md:L249-L251

Code Quality

  • No issues found in modified code paths.

Performance

  • No production runtime issues identified.

Maintainability

  • No issues found.

Tech Debt

  • No issues found.

Security

  • No issues found.

Documentation/Tests

  • Severity: P3 | Impact: R comparison tests spawn Rscript per test, which can add minutes and intermittent failures in CI environments with R installed. | Fix: cache R results in a session-scoped fixture or consolidate into a single parametrized test run; optionally mark as slow. tests/test_methodology_twfe.py:L317-L371
  • Tests not run (review only).

…fixtures

- Add elif block in TWFE.fit() warning when time has 2 unique values
  not in {0,1} or {False,True} (ATT is correct, warning is advisory)
- Add 3 tests: non-binary time warning, boolean time no-warning, and
  ATT/SE/p-value invariance to time encoding ({0,1} vs {2020,2021})
- Extract _run_r_feols_twfe to module level and cache R results in
  session-scoped fixtures, reducing Rscript invocations from 6 to 2
- Update REGISTRY.md and METHODOLOGY_REVIEW.md edge case documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@igerber
Copy link
Owner Author

igerber commented Feb 9, 2026

/ai-review

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

🔁 AI review rerun (requested by @igerber)

Head SHA: 98841406ada1efabaec4c06dadec13a6c6739919


Overall assessment: ✅ Looks good

Executive summary

  • TwoWayFixedEffects (TWFE) now demeans the treatment×post interaction alongside outcome/covariates, matching FWL and the Methodology Registry. (diff_diff/twfe.py:L124-L143, docs/methodology/REGISTRY.md:L224-L237)
  • New time-value warnings and staggered-treatment limitation docs/tests align with registry expectations; no P0/P1 methodology mismatches found. (diff_diff/twfe.py:L96-L119, docs/methodology/REGISTRY.md:L249-L257)
  • Comprehensive TWFE methodology tests and benchmarks added; no estimator-logic regressions observed in the diff.

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • P3 | Impact: R/fixest parity tests are skipped when R/fixest aren’t available, so CI without R won’t execute the new methodology checks. | Fix: add/ensure a CI job with R + fixest (or enforce DIFF_DIFF_R not set to skip) so tests/test_methodology_twfe.py runs end-to-end. (tests/test_methodology_twfe.py:L38-L70)

@igerber igerber merged commit d82e0fb into main Feb 9, 2026
7 checks passed
@igerber igerber deleted the twfe-method-review branch February 9, 2026 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant