---
title: "spefa: spatial stochastic frontier with fixed effects and endogeneity"
author: "Massimo Giannini (University of Rome Tor Vergata, Department of Enterprise Engineering)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{spefa: user guide}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", eval = FALSE)
```

## 1. What this package estimates

`spefa` implements the maximum-likelihood estimator developed in

> Giannini, M. (2025). *A spatial stochastic frontier model with fixed effects
> and endogenous environmental variables.* **Spatial Economic Analysis**,
> 20(3), 420–441. doi:10.1080/17421772.2024.2414962

The model is a stochastic production (or cost) frontier on panel data that
**simultaneously** accounts for three features that the earlier literature
treated separately:

* **individual fixed effects** \(\alpha_i\), removed by first differencing
  (following Wang and Ho, 2010), which avoids the incidental-parameters
  problem and sidesteps the singularity of the within transformation in
  spatial panels;
* **endogenous regressors and environmental variables**, corrected through a
  Gaussian control function obtained from a Cholesky factorisation of the
  error covariance (following Kutlu, Tran and Tsionas, 2020);
* **spatial spillovers**, through a spatial-autoregressive (SAR) term
  \(\lambda W y\).

For unit \(i\) and period \(t\) the frontier is

$$ y_{it} = \alpha_i + \lambda \sum_j w_{ij} y_{jt} + x_{it}'\beta
            - u_{it} + v_{it}, \qquad u_{it} = h_{it}\,u_i^{*}, $$

with \(h_{it} = \exp(z_{it}'\phi) > 0\) a multiplicative scaling and a
time-invariant \(u_i^{*} \sim N^{+}(\mu,\sigma_u^2)\) (half-normal when
\(\mu = 0\)). Endogenous variables \(q_j\) obey a reduced form
\(q_{j,it} = z_{j,it}'\delta_j + \varepsilon_{j,it}\), and endogeneity is the
correlation \(\rho_j = \mathrm{corr}(\varepsilon_j, v)\). The full derivation
of the likelihood is in Giannini (2025); the package implements it in closed
form and maximises it numerically.

The implementation depends only on base R (the `stats` package).

## 2. Installation

```{r}
install.packages("spefa_0.1.0.tar.gz", repos = NULL, type = "source")
library(spefa)
```

## 3. The data you supply

The estimator needs a **balanced** panel and a spatial weight matrix:

* a `data.frame` in long format with a unit identifier and a time identifier;
* the outcome `y` and the frontier regressors `X` (these may include
  endogenous variables);
* for each endogenous variable, one or more **instruments**;
* the variables that enter the inefficiency **scaling** \(h_{it}\);
* an \(N \times N\) weight matrix `W`, with rows and columns ordered by the
  **sorted unit id**, and (typically) row-normalised.

## 4. The `spefa()` function and its options

```{r}
spefa(formula, data, index, W,
      endogenous = NULL, scaling = NULL, mu = FALSE,
      control = list(maxit = 500, reltol = 1e-9))
```

| Argument | Meaning |
|---|---|
| `formula` | Frontier equation, e.g. `y ~ x1 + q1`. The intercept is removed by first differencing, so it is irrelevant. Endogenous regressors are listed here like any other regressor. |
| `data` | A balanced panel `data.frame`. |
| `index` | Length-2 character vector `c(unit, time)`, e.g. `c("id","time")`. |
| `W` | The \(N\times N\) spatial weight matrix (sorted-id order). |
| `endogenous` | Named list mapping each endogenous variable to its instruments, e.g. `list(q1 = ~ z1, q2 = ~ z2 + z3)`. Use `NULL` for no endogeneity (the model is then a plain spatial SF with fixed effects). An endogenous variable may appear in the frontier, in the scaling, or both. |
| `scaling` | One-sided formula of the variables entering \(h_{it}=\exp(z_{it}'\phi)\), e.g. `~ x2 + q2`. There is **no intercept** (it is absorbed into \(\sigma_u\)). Needs at least one time-varying term. |
| `mu` | `FALSE` (default) gives a half-normal inefficiency (\(\mu=0\)); `TRUE` estimates the truncation point \(\mu\) (truncated-normal). |
| `control` | List passed to `stats::optim` (`maxit`, `reltol`). |

## 5. What is returned, and the available methods

`spefa()` returns an object of class `"spefa"`. The estimated parameter vector
contains the frontier slopes \(\beta\), the instrument slopes \(\delta_j\), the
scaling coefficients \(\phi\), the reduced-form variances
\(\sigma^2_{\varepsilon_j}\), the endogeneity correlations \(\rho_j\),
\(\sigma^2_v\), \(\sigma^2_u\), (optionally \(\mu\)) and the spatial parameter
\(\lambda\).

| Method | Returns |
|---|---|
| `summary(fit)` | Coefficient table with standard errors, \(z\)-statistics and \(p\)-values; log-likelihood, AIC, BIC; convergence code; \(\lambda\) and the endogeneity correlations \(\rho\). |
| `coef(fit)`, `vcov(fit)` | Point estimates and the (delta-method) covariance matrix. |
| `logLik(fit)`, `AIC(fit)`, `BIC(fit)`, `nobs(fit)` | Standard fit statistics. |
| `efficiency(fit, spatial = TRUE)` | Conditional-mean (Jondrow et al., 1982) inefficiency, optionally rescaled by the spatial multiplier \((I-\lambda W)^{-1}\). Returns the \(N\times T\) inefficiency matrix `u`, the efficiency matrix `eff = exp(-u)`, and the per-unit `Eu_i`. |
| `impacts(fit)` | Direct, indirect and total marginal effects (LeSage and Pace, 2009) of each frontier regressor through \((I-\lambda W)^{-1}\). |

Standard errors come from the Hessian at the optimum (computed on the
unconstrained scale) propagated to the natural parameters by the delta method.

## 6. A worked example

A small demo panel (`spefademo`, \(N = 40\), \(T = 12\)) ships with the
package and is used here. The data-generating
values are \(\beta=(0.5,0.5)\), \(\delta=(1,1)\), \(\phi=(0.3,0.3)\),
\(\sigma^2_\varepsilon=0.09\), \(\sigma^2_v=0.04\), \(\rho=(0.5,0.5)\),
\(\sigma^2_u=0.09\), \(\lambda=0.5\); `q1` and `q2` are endogenous, `q1` enters
the frontier and `q2` the scaling.

```{r}
library(spefa)
data(spefademo)

fit <- spefa(y ~ x1 + q1, data = spefademo, index = c("id", "time"),
             W = spefaW, endogenous = list(q1 = ~ z1, q2 = ~ z2),
             scaling = ~ x2 + q2)

summary(fit)
```

Indicative output from a larger sample (\(N=150\), to show parameter recovery; the bundled \(N=40\) panel gives the same point estimates with wider standard errors):

```
               Estimate Std.Error  z value  Pr(>|z|)
x1              0.5002   0.0027    182.2    <2e-16 ***
q1              0.5007   0.0032    155.2    <2e-16 ***
delta.q1.z1     0.9991   0.0056    179.8    <2e-16 ***
delta.q2.z2     1.0029   0.0054    186.6    <2e-16 ***
x2              0.2995   0.0120     24.9    <2e-16 ***
q2              0.3051   0.0120     25.5    <2e-16 ***
rho.q1          0.4763   0.0134     35.5    <2e-16 ***
rho.q2          0.5095   0.0130     39.2    <2e-16 ***
sigma2_v        0.0391   0.0010     38.5    <2e-16 ***
sigma2_u        0.0879   0.0162      5.4    5e-08  ***
lambda          0.5027   0.0044    114.5    <2e-16 ***

Spatial parameter lambda = 0.5027
Endogeneity correlations rho (q1, q2): 0.476, 0.509
```

```{r}
impacts(fit)            # direct / indirect / total marginal effects
ef <- efficiency(fit)   # spatially-corrected technical efficiency
summary(as.vector(ef$eff))
```

## 7. Reading the results

* **Endogeneity.** A \(\rho_j\) statistically different from zero signals
  endogeneity of variable \(j\); the \(z\)-test in the coefficient table is a
  direct test of \(H_0:\rho_j=0\). When all \(\rho_j=0\) the control-function
  correction vanishes and the estimator collapses to the exogenous case.
* **Spatial dependence.** \(\lambda\) measures the strength of spillovers; the
  `impacts()` table splits each regressor's effect into a direct (own-unit) and
  an indirect (spillover) component, with total \(\approx \beta_k/(1-\lambda)\)
  under row-normalised `W`.
* **Efficiency.** `efficiency(fit)$eff` gives unit-by-period technical
  efficiency in \((0,1]\); with `spatial = TRUE` the scores incorporate
  neighbours' inefficiency through \((I-\lambda W)^{-1}\).

## 8. Practical notes and identification

* The panel must be **balanced**, and the rows/columns of `W` must follow the
  sorted unit id used in `data`.
* The inefficiency parameters \(\phi\) and \(\sigma_u\) are identified only
  when the **scaling variables vary over time** (otherwise first differencing
  removes the inefficiency) and when the inefficiency is non-negligible. If the
  inefficiency is very small, expect \(\phi\) and \(\sigma_u\) to be weakly
  identified.
* The endogeneity correction maintains the usual control-function assumptions:
  joint normality of the structural and reduced-form errors, and inefficiency
  independent of the regressors.
* The likelihood evaluates the reduced-form (instrument) density on the data
  and counts the Gaussian normalising constant once per unit; the
  truncated-normal terms use the log-CDF for numerical stability.

## References

Giannini, M. (2025). A spatial stochastic frontier model with fixed effects and
endogenous environmental variables. *Spatial Economic Analysis*, 20(3),
420–441.

Jondrow, J., Lovell, C. A. K., Materov, I. S., & Schmidt, P. (1982). On the
estimation of technical inefficiency in the stochastic frontier production
function model. *Journal of Econometrics*, 19(2–3), 233–238.

Kutlu, L., Tran, K. C., & Tsionas, M. G. (2020). A spatial stochastic frontier
model with endogenous frontier and environmental variables. *European Journal
of Operational Research*, 286(1), 389–399.

LeSage, J., & Pace, R. K. (2009). *Introduction to Spatial Econometrics*. CRC
Press.

Wang, H.-J., & Ho, C.-W. (2010). Estimating fixed-effect panel stochastic
frontier models by model transformation. *Journal of Econometrics*, 157(2),
286–296.