igerber · igerber · Feb 9, 2026 · Feb 8, 2026 · Feb 8, 2026 · Feb 8, 2026
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -153,14 +153,69 @@ Each estimator in diff-diff should be periodically reviewed to ensure:
 | Module | `twfe.py` |
 | Primary Reference | Wooldridge (2010), Ch. 10 |
 | R Reference | `fixest::feols()` |
-| Status | Not Started |
-| Last Review | - |
+| Status | **Complete** |
+| Last Review | 2026-02-08 |
+
+**Verified Components:**
+- [x] Within-transformation algebra: `y_it - ȳ_i - ȳ_t + ȳ` matches hand calculation (rtol=1e-12)
+- [x] ATT matches manual demeaned OLS (rtol=1e-10)
+- [x] ATT matches `DifferenceInDifferences` on 2-period data (rtol=1e-10)
+- [x] Covariates are also within-transformed (sum to zero within unit/time groups)
+- [x] R comparison: ATT matches `fixest::feols(y ~ treated:post | unit + post, cluster=~unit)` (rtol<0.1%)
+- [x] R comparison: Cluster-robust SE match (rtol<1%)
+- [x] R comparison: P-value match (atol<0.01)
+- [x] R comparison: CI bounds match (rtol<1%)
+- [x] R comparison: ATT and SE match with covariate (same tolerances)
+- [x] Edge case: Staggered treatment triggers `UserWarning`
+- [x] Edge case: Auto-clusters at unit level (SE matches explicit `cluster="unit"`)
+- [x] Edge case: DF adjustment for absorbed FE matches manual `solve_ols()` with `df_adjustment`
+- [x] Edge case: Covariate collinear with interaction raises `ValueError` ("cannot be identified")
+- [x] Edge case: Covariate collinearity warns but ATT remains finite
+- [x] Edge case: `rank_deficient_action="error"` raises `ValueError`
+- [x] Edge case: `rank_deficient_action="silent"` emits no warnings
+- [x] Edge case: Unbalanced panel produces valid results (finite ATT, positive SE)
+- [x] Edge case: Missing unit column raises `ValueError`
+- [x] Integration: `decompose()` returns `BaconDecompositionResults`
+- [x] SE: Cluster-robust SE >= HC1 SE
+- [x] SE: VCoV positive semi-definite
+- [x] Wild bootstrap: Valid inference (finite SE, p-value in [0,1])
+- [x] Wild bootstrap: All weight types (rademacher, mammen, webb) produce valid inference
+- [x] Wild bootstrap: `inference="wild_bootstrap"` routes correctly
+- [x] Params: `get_params()` returns all inherited parameters
+- [x] Params: `set_params()` modifies attributes
+- [x] Results: `summary()` contains "ATT"
+- [x] Results: `to_dict()` contains att, se, t_stat, p_value, n_obs
+- [x] Results: residuals + fitted = demeaned outcome (not raw)
+- [x] Edge case: Multi-period time emits UserWarning advising binary post indicator
+- [x] Edge case: Non-{0,1} binary time emits UserWarning (ATT still correct)
+- [x] Edge case: ATT invariant to time encoding ({0,1} vs {2020,2021} produces identical results)
+
+**Key Implementation Detail:**
+The interaction term `D_i × Post_t` must be within-transformed (demeaned) alongside the outcome,
+consistent with the Frisch-Waugh-Lovell (FWL) theorem: all regressors and the outcome must be
+projected out of the fixed effects space. R's `fixest::feols()` does this automatically when
+variables appear to the left of the `|` separator.
 
 **Corrections Made:**
-- (None yet)
+- **Bug fix: interaction term must be within-transformed** (found during review). The previous
+  implementation used raw (un-demeaned) `D_i × Post_t` in the demeaned regression. This gave
+  correct results only for 2-period panels where `post == period`. For multi-period panels
+  (e.g., 4 periods with binary `post`), the raw interaction had incorrect correlation with
+  demeaned Y, producing ATT approximately 1/3 of the true value. Fixed by applying the same
+  within-transformation to the interaction term before regression. This matches R's
+  `fixest::feols()` behavior. (`twfe.py` lines 99-113)
 
 **Outstanding Concerns:**
-- (None yet)
+- **Multi-period `time` parameter**: Multi-period time values (e.g., 1,2,3,4) produce
+  `treated × period_number` instead of `treated × post_indicator`, which is not the standard
+  D_it treatment indicator. A `UserWarning` is emitted when `time` has >2 unique values.
+  For binary time with non-{0,1} values (e.g., {2020, 2021}), the ATT is mathematically
+  correct (the within-transformation absorbs the scaling), but a warning recommends 0/1
+  encoding for clarity. Users with multi-period data should create a binary `post` column.
+- **Staggered treatment warning**: The warning only fires when `time` has >2 unique values
+  (i.e., actual period numbers). With binary `time="post"`, all treated units appear to start
+  treatment at `time=1`, making staggering undetectable. Users with staggered designs should
+  use `decompose()` or `CallawaySantAnna` directly for proper diagnostics.
 
 ---
 

diff --git a/benchmarks/R/benchmark_twfe.R b/benchmarks/R/benchmark_twfe.R
@@ -0,0 +1,132 @@
+#!/usr/bin/env Rscript
+# Benchmark: Two-Way Fixed Effects (R `fixest` package with absorbed FE)
+#
+# This uses fixest::feols() with absorbed unit + post FE and unit-level clustering,
+# matching the Python TwoWayFixedEffects estimator's approach.
+#
+# Usage:
+#   Rscript benchmark_twfe.R --data path/to/data.csv --output path/to/results.json
+
+library(fixest)
+library(jsonlite)
+library(data.table)
+
+# Parse command line arguments
+args <- commandArgs(trailingOnly = TRUE)
+
+parse_args <- function(args) {
+  result <- list(
+    data = NULL,
+    output = NULL
+  )
+
+  i <- 1
+  while (i <= length(args)) {
+    if (args[i] == "--data") {
+      result$data <- args[i + 1]
+      i <- i + 2
+    } else if (args[i] == "--output") {
+      result$output <- args[i + 1]
+      i <- i + 2
+    } else {
+      i <- i + 1
+    }
+  }
+
+  if (is.null(result$data) || is.null(result$output)) {
+    stop("Usage: Rscript benchmark_twfe.R --data <path> --output <path>")
+  }
+
+  return(result)
+}
+
+config <- parse_args(args)
+
+# Load data
+message(sprintf("Loading data from: %s", config$data))
+data <- fread(config$data)
+
+# Run benchmark
+message("Running TWFE estimation with absorbed FE...")
+start_time <- Sys.time()
+
+# TWFE with absorbed unit + post fixed effects, clustered at unit level
+# This matches Python's TwoWayFixedEffects:
+#   - Within-transformation removes unit and time (post) FE
+#   - Cluster-robust SE at unit level (automatic)
+model <- feols(
+  outcome ~ treated:post | unit + post,
+  data = data,
+  cluster = ~unit
+)
+
+estimation_time <- as.numeric(difftime(Sys.time(), start_time, units = "secs"))
+
+# Extract results
+coef_name <- "treated:post"
+coefs <- coef(model)
+ses <- se(model)
+pvals <- pvalue(model)
+ci <- confint(model)
+
+# Find the treatment effect coefficient
+if (coef_name %in% names(coefs)) {
+  att <- coefs[coef_name]
+  att_se <- ses[coef_name]
+  att_pval <- pvals[coef_name]
+  att_ci <- ci[coef_name, ]
+} else {
+  # Try alternative name formats
+  idx <- grep("treated.*post|post.*treated", names(coefs))
+  if (length(idx) > 0) {
+    att <- coefs[idx[1]]
+    att_se <- ses[idx[1]]
+    att_pval <- pvals[idx[1]]
+    att_ci <- ci[idx[1], ]
+    coef_name <- names(coefs)[idx[1]]
+  } else {
+    stop("Could not find treatment effect coefficient")
+  }
+}
+
+# Format output
+results <- list(
+  estimator = "fixest::feols (absorbed FE)",
+  cluster = "unit",
+
+  # Treatment effect
+  att = unname(att),
+  se = unname(att_se),
+  pvalue = unname(att_pval),
+  ci_lower = unname(att_ci[1]),
+  ci_upper = unname(att_ci[2]),
+  coef_name = coef_name,
+
+  # Model statistics
+  model_stats = list(
+    r_squared = summary(model)$r2,
+    adj_r_squared = summary(model)$adj.r2,
+    n_obs = model$nobs
+  ),
+
+  # Timing
+  timing = list(
+    estimation_seconds = estimation_time,
+    total_seconds = estimation_time
+  ),
+
+  # Metadata
+  metadata = list(
+    r_version = R.version.string,
+    fixest_version = as.character(packageVersion("fixest")),
+    n_units = length(unique(data$unit)),
+    n_periods = length(unique(data$post)),
+    n_obs = nrow(data)
+  )
+)
+
+# Write output
+message(sprintf("Writing results to: %s", config$output))
+write_json(results, config$output, auto_unbox = TRUE, pretty = TRUE, digits = 15)
+
+message(sprintf("Completed in %.3f seconds", estimation_time))
diff --git a/benchmarks/python/benchmark_twfe.py b/benchmarks/python/benchmark_twfe.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+"""
+Benchmark: TwoWayFixedEffects (diff-diff TwoWayFixedEffects class).
+
+This benchmarks the actual TwoWayFixedEffects class with within-transformation,
+as opposed to benchmark_basic.py which uses DifferenceInDifferences with formula.
+
+Usage:
+    python benchmark_twfe.py --data path/to/data.csv --output path/to/results.json
+"""
+
+import argparse
+import json
+import os
+import sys
+from pathlib import Path
+
+# IMPORTANT: Parse --backend and set environment variable BEFORE importing diff_diff
+def _get_backend_from_args():
+    """Parse --backend argument without importing diff_diff."""
+    parser = argparse.ArgumentParser(add_help=False)
+    parser.add_argument("--backend", default="auto", choices=["auto", "python", "rust"])
+    args, _ = parser.parse_known_args()
+    return args.backend
+
+_requested_backend = _get_backend_from_args()
+if _requested_backend in ("python", "rust"):
+    os.environ["DIFF_DIFF_BACKEND"] = _requested_backend
+
+# NOW import diff_diff and other dependencies (will see the env var)
+import pandas as pd
+
+# Add parent to path for imports
+sys.path.insert(0, str(Path(__file__).parent.parent.parent))
+
+from diff_diff import TwoWayFixedEffects, HAS_RUST_BACKEND
+from benchmarks.python.utils import Timer
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Benchmark TwoWayFixedEffects estimator")
+    parser.add_argument("--data", required=True, help="Path to input CSV data")
+    parser.add_argument("--output", required=True, help="Path to output JSON results")
+    parser.add_argument(
+        "--backend", default="auto", choices=["auto", "python", "rust"],
+        help="Backend to use: auto (default), python (pure Python), rust (Rust backend)"
+    )
+    return parser.parse_args()
+
+
+def get_actual_backend() -> str:
+    """Return the actual backend being used based on HAS_RUST_BACKEND."""
+    return "rust" if HAS_RUST_BACKEND else "python"
+
+
+def main():
+    args = parse_args()
+
+    actual_backend = get_actual_backend()
+    print(f"Using backend: {actual_backend}")
+
+    # Load data
+    print(f"Loading data from: {args.data}")
+    data = pd.read_csv(args.data)
+
+    # Run benchmark using TwoWayFixedEffects (within-transformation approach)
+    print("Running TWFE estimation...")
+
+    twfe = TwoWayFixedEffects(robust=True)  # auto-clusters at unit level
+
+    with Timer() as timer:
+        results = twfe.fit(
+            data,
+            outcome="outcome",
+            treatment="treated",
+            time="post",
+            unit="unit",
+        )
+
+    att = results.att
+    se = results.se
+    pvalue = results.p_value
+    ci = results.conf_int
+
+    total_time = timer.elapsed
+
+    # Build output
+    output = {
+        "estimator": "diff_diff.TwoWayFixedEffects",
+        "backend": actual_backend,
+        "cluster": "unit",
+        # Treatment effect
+        "att": float(att),
+        "se": float(se),
+        "pvalue": float(pvalue),
+        "ci_lower": float(ci[0]),
+        "ci_upper": float(ci[1]),
+        # Model statistics
+        "model_stats": {
+            "n_obs": len(data),
+            "n_units": len(data["unit"].unique()),
+            "n_periods": len(data["post"].unique()),
+        },
+        # Timing
+        "timing": {
+            "estimation_seconds": total_time,
+            "total_seconds": total_time,
+        },
+        # Metadata
+        "metadata": {
+            "n_units": len(data["unit"].unique()),
+            "n_periods": len(data["post"].unique()),
+            "n_obs": len(data),
+        },
+    }
+
+    # Write output
+    print(f"Writing results to: {args.output}")
+    output_path = Path(args.output)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w") as f:
+        json.dump(output, f, indent=2)
+
+    print(f"Completed in {total_time:.3f} seconds")
+    return output
+
+
+if __name__ == "__main__":
+    main()