---
title: "Fuzzy difference-in-differences with Rfuzzydid"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Fuzzy difference-in-differences with Rfuzzydid}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r srr-methods-standards, include = FALSE, eval = FALSE}
#' @srrstats {G1.3} This vignette defines the package terminology used in the
#' fuzzy DID estimators.
#' @srrstats {RE1.4} This vignette documents the identifying assumptions and
#' consequences of violations for the regression-style formula interface.
#' @noRd
NULL
```

# Purpose

This article is a practical guide to the estimators implemented in
`Rfuzzydid`. It follows de Chaisemartin and D'Haultfoeuille (2018) and the
Stata companion by de Chaisemartin, D'Haultfoeuille, and Guyonvarch (2019).
Those papers are the primary references for the formal identification results;
this vignette keeps the focus on what to run and how to read the output.

A fuzzy DID design has the same comparison structure as a standard DID design,
but the treatment group does not necessarily move from untreated to fully
treated. Instead, the treatment rate increases more in one group than in the
comparison group. The first stage is the DID in treatment:

\[
\Delta_D =
E[D\mid G=1,T=1] - E[D\mid G=1,T=0]
- E[D\mid G=0,T=1] + E[D\mid G=0,T=0].
\]

The estimators in `Rfuzzydid` scale outcome changes by this first stage. In the
binary-treatment, two-group, two-period case, the target is a local average
treatment effect for switchers: units whose treatment status changes because
their group-period cell receives the stronger treatment-intensity change.

# Choosing an estimator

`fuzzydid()` exposes the main estimators from the papers through logical
options. You can request more than one estimator in the same call.

| Option | Estimator | Use when |
|---|---|---|
| `did = TRUE` | Wald-DID | You want the familiar DID Wald ratio as a benchmark, and the stronger treatment-effect stability restrictions are credible. |
| `tc = TRUE` | Wald-TC | You want an alternative LATE estimator that uses common-trend restrictions within baseline-treatment subgroups and does not require stable treatment effects. |
| `cic = TRUE` | Wald-CIC | You want the fuzzy extension of changes-in-changes, using distributional restrictions rather than a mean-only common-trends restriction. |
| `lqte = TRUE` | LQTE | You want local quantile treatment effects; this uses the same style of restrictions as Wald-CIC. |

The Wald-DID ratio is easy to interpret, but the papers show that it needs
extra restrictions on treatment effects to identify the LATE in fuzzy designs.
Wald-TC and Wald-CIC are useful alternatives when those restrictions are not
plausible. They rely on different identifying assumptions, so disagreement
between estimators is substantively informative rather than just a numerical
detail.

The Stata companion recommends using equality tests and placebo-style numerator
checks when the design allows them. In `Rfuzzydid`, use `eqtest = TRUE` when at
least two LATE estimators are requested. Use `numerator = TRUE` for reduced-form
placebo checks in periods where the treatment first stage should be zero.

# Basic R workflow

The formula is `outcome ~ treatment + covariates`. The `group` and `time`
arguments name the group and period variables.

Internally, `fuzzydid()` parses the formula into one numeric outcome, one
numeric treatment, and optional covariates. Numeric covariates are treated as
continuous predictors. Factor, character, and logical covariates are treated as
qualitative predictors and expanded to indicator columns with the first sorted
level omitted. With `sieves = TRUE`, continuous covariates are expanded to
polynomial basis terms; when `sieveorder = NULL`, the order is selected by a
deterministic five-fold cross-validation rule.

Rows with `NA` or `NaN` in analysis variables are removed by complete-case
filtering. `Inf` and `-Inf` are rejected because they are not meaningful support
points for the empirical means and quantile maps. Use `tagobs = TRUE` to return
the retained-row mask.

```{r basic-workflow}
library(Rfuzzydid)

make_cell <- function(g, t, n, p_d) {
  d <- rbinom(n, size = 1, prob = p_d)
  y <- 1 + 0.4 * g + 0.3 * t + 1.5 * d + rnorm(n, sd = 0.2)
  data.frame(y = y, d = d, g = g, t = t)
}

set.seed(4)
df <- rbind(
  make_cell(g = 0, t = 0, n = 80, p_d = 0.20),
  make_cell(g = 0, t = 1, n = 80, p_d = 0.30),
  make_cell(g = 1, t = 0, n = 80, p_d = 0.25),
  make_cell(g = 1, t = 1, n = 80, p_d = 0.70)
)

fit <- fuzzydid(
  data = df,
  formula = y ~ d,
  group = "g",
  time = "t",
  did = TRUE,
  tc = TRUE,
  cic = TRUE,
  eqtest = TRUE,
  breps = 50,
  seed = 1
)

summary(fit)
```

The `late` component contains point estimates, bootstrap standard errors, and
percentile bootstrap confidence intervals. The `matrices` component mirrors the
matrix-style results returned by the Stata command.

```{r extract-results}
fit$late
fit$eqtest
```

# Reading the main options

## Treatment categories

For ordered or multi-valued treatments, `newcateg` groups treatment values into
ordered bins before estimating Wald-TC or Wald-CIC:

```{r newcateg-example, eval = FALSE}
fuzzydid(
  data = df,
  formula = wage ~ schooling,
  group = "g",
  time = "t",
  tc = TRUE,
  cic = TRUE,
  newcateg = c(5, 8, 11, 14, 1000)
)
```

This follows the Stata article's treatment of applications where the treatment
has many support points.

## Partial identification

When the comparison group does not have a stable treatment rate, the Wald-TC
point estimand may be unavailable. The papers derive bounds for that setting,
and `partial = TRUE` requests the corresponding TC bounds:

```{r partial-example, eval = FALSE}
fuzzydid(
  data = df,
  formula = y ~ d,
  group = "g",
  time = "t",
  tc = TRUE,
  partial = TRUE,
  breps = 50,
  seed = 1
)
```

## Covariates

When the identifying restrictions are more credible after conditioning on
covariates, include those variables on the right-hand side of the formula.
`modelx` controls the parametric first-step models used for Wald-DID and
Wald-TC. With binary treatment, supply two entries: one for outcome conditional
expectations and one for treatment conditional expectations.

```{r covariate-example, eval = FALSE}
fuzzydid(
  data = df,
  formula = y ~ d + x1 + x2,
  group = "g",
  time = "t",
  did = TRUE,
  tc = TRUE,
  modelx = c("ols", "logit")
)
```

For continuous controls, `sieves = TRUE` requests nonparametric sieve
adjustment. If `sieveorder = NULL`, `Rfuzzydid` selects the order by deterministic
five-fold cross-validation.

## Inference

By default, `fuzzydid()` computes bootstrap standard errors and confidence
intervals. Use `breps` to set the number of replications and `seed` for
reproducibility. Use `cluster` for one-way clustered bootstrap inference.

`nose = TRUE` skips bootstrap inference and returns point estimates only. This
is useful while checking data construction, but final empirical results should
usually report uncertainty.

The returned object is a causal estimand summary, not a predictive model.
Accordingly, it supports estimate extractors such as `coef()`, `confint()`,
`vcov()`, `nobs()`, `formula()`, `generics::tidy()`, and `generics::glance()`,
but it does not define observation-level fitted values, residuals, or
predictions.

Runtime is approximately linear in the number of bootstrap replications. Within
each replication, the no-covariate estimators are dominated by group-period
subsetting and empirical distribution calculations. Covariate-adjusted
estimators additionally fit nuisance OLS/logit/probit models, and sieve
estimation grows with the number of generated basis terms.

# Translating from Stata

The core Stata command

```stata
fuzzydid y g t d, did tc cic breps(50)
```

maps to:

```{r stata-map, eval = FALSE}
fuzzydid(
  data = df,
  formula = y ~ d,
  group = "g",
  time = "t",
  did = TRUE,
  tc = TRUE,
  cic = TRUE,
  breps = 50
)
```

For multiple periods, the Stata command uses a forward group variable. In
`Rfuzzydid`, pass the backward group through `group` and the forward group
through `group_forward`:

```{r group-forward-example, eval = FALSE}
fuzzydid(
  data = panel_df,
  formula = y ~ d,
  group = "G_t",
  group_forward = "G_tplus1",
  time = "t",
  did = TRUE,
  tc = TRUE
)
```

See `vignette("stata-parity", package = "Rfuzzydid")` for direct R/Stata parity
examples, and `vignette("paper-replication", package = "Rfuzzydid")` for an
INPRES replication exercise.

# Practical checklist

Before treating an estimate as a causal effect, check the design:

- The treatment first stage should be in the expected direction and clearly
  away from zero.
- The group and time variables should encode the comparison used by the
  identifying argument, not merely convenient labels in the raw data.
- If the control group's treatment rate is stable, Wald-TC and Wald-CIC are
  natural alternatives to Wald-DID.
- If several estimators are reported, large disagreements should be interpreted
  as evidence that at least one identifying restriction may be inappropriate.
- If pre-treatment periods exist, reduced-form numerator checks can be used as
  placebo tests.

# References

de Chaisemartin, C. and D'Haultfoeuille, X. (2018). *Fuzzy
Differences-in-Differences*. *Review of Economic Studies*, 85(2): 999-1028.
doi:10.1093/restud/rdx049.

de Chaisemartin, C., D'Haultfoeuille, X., and Guyonvarch, Y. (2019). *Fuzzy
Differences-in-Differences with Stata*. *Stata Journal*, 19(2): 435-458.
doi:10.1177/1536867X19854019.