Estimands: ATE, ATT, LATE etc.
==============================

Every causal estimator targets a specific *estimand*. Choosing the wrong estimator for your question gives a valid answer to the wrong
question. The four estimands formative works with are ATE, ATT, LATE, and LATE at the cutoff. One method can only target one estimand.

.. list-table::
   :header-rows: 1
   :widths: 20 30 50

   * - Estimand
     - Full name
     - Method
   * - **ATE**
     - Average Treatment Effect
     - :class:`~formative.OLSObservational`, :class:`~formative.RCT`
   * - **ATT**
     - Average Treatment Effect on the Treated
     - :class:`~formative.PropensityScoreMatching`, :class:`~formative.DiD`
   * - **LATE**
     - Local Average Treatment Effect
     - :class:`~formative.IV2SLS`
   * - **LATE at the cutoff**
     - Local Average Treatment Effect at the cutoff
     - :class:`~formative.RDD`

----

ATE: Average Treatment Effect
--------------------------------

**What it answers:** If we randomly assigned treatment to everyone in the population, what
would the average change in outcome be?

.. math::

   \text{ATE} = \mathbb{E}[Y(1) - Y(0)]

:math:`Y(1)` is the potential outcome under treatment and :math:`Y(0)` under control, for
the same unit. ATE averages this difference across *all units*, treated and untreated alike.

**When it makes sense:** When you want a policy-relevant effect for the whole population —
e.g., "what would happen if we rolled this programme out to everyone?". Note that this is
usually the question you want answered, but strictly speaking only OLS and RCT can answer it
in formative.

RCTs can do ATE because you are randomising the treatment, so the characteristics of the
treatment and control groups are assumed to be balanced. If the ATE is 1.5, the estimate
applies to the control group too, because there is no reason to assume the units in the control
group are systematically different from the treated group.

Assuming a perfect world where all confounders are observed, OLS can also theoretically do ATE. The units
in the treatment groups are not randomised, but if you adjust for all confounders that affect
both treatment and outcome, you can recover the ATE.

**Methods that estimate ATE:**

- **OLS Observational** — estimates ATE by adjusting for confounders identified via the
  backdoor criterion. Requires all relevant confounders to be observed.
- **RCT** — randomisation makes treated and control groups exchangeable, so the simple
  difference in means is an unbiased ATE estimate with no adjustment required.

----

ATT: Average Treatment Effect on the Treated
-----------------------------------------------

**What it answers:** Among units that actually received treatment, how much did treatment
change their outcome?

.. math::

   \text{ATT} = \mathbb{E}[Y(1) - Y(0) \mid \text{treated}]

ATT conditions on the treated group. It asks what the treated units would have experienced
had they *not* been treated. This is a counterfactual that is never directly observed.

**When it makes sense:** When you care specifically about the effect for those who
self-selected into treatment, or when the treated group is the policy-relevant population —
e.g., "did the training programme benefit the workers who enrolled?". Note that this is a
"narrower" question than ATE, and the answer may not generalise to the broader population.
However, in practice, ATT is often used to estimate ATE, even if theoretically it is not
the same thing.

Matching can do ATT but not ATE. For each treated unit, it finds one or more untreated units
that look similar on observables. This constructs the missing counterfactual, i.e.
what would the treated unit have experienced without treatment. In order to recover ATE,
you would have to do the same for untreated units. Finding treated matches for any given
control unit is hard; usually the treatment group is much smaller than the control group,
so finding a good match for every control unit becomes a much harder problem to solve.

The same restriction applies to DiD, but for a different reason. DiD constructs the
counterfactual for the treated group by using the control group's time trend.
The parallel trends assumption says the two groups would have trended together,
but it says nothing about what the treatment effect would be for the control group
if they had been treated. In order to get DiD to recover ATE, you would have to
assume that the treatment effect is the same for both groups, which is a strong assumption.
For example, say you launched an app in a few countries first. You probably launched assuming
the app would be more beneficial in those countries than in the others, so the ATT is likely
greater than the ATE.

**Methods that estimate ATT:**

- **Propensity Score Matching** — each treated unit is matched to the most similar control
  unit by propensity score, and the ATT is the average outcome difference across matched
  pairs.
- **DiD** — compares how outcomes changed over time for the treated group versus a control
  group. Under parallel trends, the common time trend cancels out, leaving the ATT as the
  interaction coefficient on group × time.

**ATE vs ATT:** When treatment is randomly assigned (as in an RCT), ATE = ATT, because the
treated group is a random draw from the population. In observational settings they can
differ — people who select into treatment often do so because the treatment is particularly
beneficial for them (positive selection), making ATT > ATE.

----

LATE: Local Average Treatment Effect
---------------------------------------

**What it answers:** Among units whose treatment status was *moved* by the instrument, what
was the effect of treatment?

.. math::

   \text{LATE} = \mathbb{E}[Y(1) - Y(0) \mid \text{complier}]

IV estimation with an instrument :math:`Z` isolates only the variation in treatment caused
by :math:`Z`. Only "compliers" — units who take treatment when
:math:`Z = 1` and not when :math:`Z = 0` — contribute to the estimate.

**When it makes sense:** When a clean instrument is available and you are willing to
interpret the result as the effect for compliers. If compliers are representative of the
broader population, LATE ≈ ATE. If not, the LATE may be very different from the ATE.

That is why IV is said to recover LATE, not ATE. In order to get ATE, you would need to know
the treatment effect for never-takers and always-takers too, which is not possible without
additional assumptions.

**Methods that estimate LATE:**

- **IV / 2SLS** — the Wald estimator (reduced form divided by first stage) identifies the
  LATE under the standard IV assumptions: relevance, exclusion restriction, independence,
  and monotonicity.

**LATE vs ATE vs ATT:** LATE is the narrowest estimand. It applies only to the complier
subpopulation, which is typically latent (you cannot directly observe who the compliers are).
Whether the LATE generalises depends on how similar compliers are to the rest of the
population. This cannot be answered from the data alone.

----

LATE at the cutoff: Local Average Treatment Effect at the cutoff
-----------------------------------------------------------------

**What it answers:** Among units just at the threshold of the running variable, what is
the effect of crossing from one side to the other?

.. math::

   \text{LATE at cutoff} = \lim_{\epsilon \to 0^+} \mathbb{E}[Y(1) - Y(0) \mid c \le X < c + \epsilon] - \mathbb{E}[Y(1) - Y(0) \mid c - \epsilon < X < c]

More intuitively: as the running variable :math:`X` approaches the cutoff :math:`c` from
either side, what is the jump in expected outcome?

RDD identifies this by fitting separate linear regressions on each side of the cutoff and
measuring the discontinuous jump at :math:`c`. Crucially, treatment is assigned deterministically
by the rule :math:`X \ge c`, so near the cutoff, units on either side are effectively comparable —
they differ only in whether they just cleared the threshold. This local exchangeability is what
makes identification possible without randomisation.

**When it makes sense:** When treatment is assigned by a threshold rule on a continuous running
variable — test score cutoffs, income eligibility limits, age thresholds — and you want to
know the causal effect for units near that threshold.

**The key limitation:** The estimate is *local*. It applies only to units near the cutoff,
not to units far from it. Whether the effect generalises to the rest of the distribution depends
on how much treatment effects vary with the running variable, which cannot be tested from the
data alone. If the running variable is a test score and the cutoff is the 60th percentile,
the LATE at the cutoff says nothing about the effect for units at the 30th or 90th percentile.

**LATE at the cutoff vs LATE (IV):** Both are "local" effects, but in different senses.
IV's LATE is local to compliers — a latent subpopulation defined by their response to the
instrument. RDD's LATE at the cutoff is local to a *region* of the running variable — units
near the threshold. If the bandwidth is wide, more units are included but the local
exchangeability assumption becomes harder to justify. If the bandwidth is narrow, the
assumption is more defensible but the estimate has higher variance.

**Methods that estimate LATE at the cutoff:**

- **RDD** — fits a local linear regression on both sides of the cutoff. The coefficient on
  the treatment indicator gives the jump in outcome at the threshold, controlling for the
  slope of the running variable separately on each side.