Jackknife-after-Bootstrap for Linear Regression

Conduct Jackknife-after-Bootstrap (JaB) algorithm for linear regression (Martin and Roberts 2010) .

Usage

jab_lm(
  mod,
  stat = "rstudent",
  quant.lower = 0.05,
  quant.upper = 0.95,
  B = 3100,
  package.name = NULL,
  stat.args = NULL
)

Arguments

mod: An lm object. Output from stats::lm().
stat: A character string of the function name used to calculate the desired centrality statistic. The function must input an lm model object as is first argument and output a length \(n\) vector of the statistic of interest.
quant.lower: A numeric between 0 and 1 used as the lower cutoff in the JaB algorithm. Default is 0.05. Must be smaller than quant.upper.
quant.upper: A numeric between 0 and 1 used as the upper cutoff in the JaB algorithm. Default is 0.95. Must be larger than quant.lower.
B: Number of bootstrap samples. Default is 3100
package.name: A character string of the name of the package that func.name function is in. If left as NULL if the function will be called as loaded in the users environment.
stat.args: A named list of additional arguments the func.name function may need beyond the mod object.

Value

A data frame with 5 columns.

"row.ID": Row Number of the observation in the data set used for mod.
"lower": Lower quantile cutoff as determined by quant.lower.
"upper": Upper quantile cutoff as determined by quant.upper.
"orig": The original influence/outlier statistic calculated by stat.
"influential": Logical flagging if the observation is influential or not. TRUE if orig < lower or orig > upper. FALSE otherwise.

Details

The JaB algorithm, proposed by Martin and Roberts (2010) and further described by Beyaztas and Alin (2013) , detects influential/outlier points in linear regression models. The algorithm is as follows:

Let \(\gamma\) (stat) be the diagnostic statistic of interest. Fit the model (mod) and calculate \(\gamma_i\) for \(i=1,…,n\).
Construct \(B\) bootstrap samples, with replacement, from the original data set.
For \(i = 1,…,n\),
- 3.1: Let \(B_{(i)}\) be the set of all bootstrap samples that did not contain data point \(i\).
- 3.2: For each sample in \(B_{(i)}\), fit the regression model then calculate the \(n\) values of \(\gamma_{i, (b)}\). Aggregate them into one vector \(\Gamma_i\).
- 3.1: Calculate suitable quantiles of \(\Gamma_i\) (quant.lower and quant.upper). If \(\gamma_i\) is outside of this range, flag point \(i\) as influential.

The default for quant.lower and quant.upper is 0.05 and 0.95, respectively. This means that if \(\gamma_i\) is in the center 90% of \(\Gamma_i\), it will not be flagged as influential.

Some influence statistics, such as the likelihood distance from Cook (1986) , are only positive values and large values imply influence. In these scenarios, it is appropriate to set quant.lower to 0 and quant.upper to some suitable quantile (say 0.90) so that point \(i\) is only flagged when \(\gamma_i\) is in the upper quantiles of \(\Gamma_i\).

As described in Martin and Roberts (2010) , to have approximately 1000 bootstrap samples in \(B_{(i)}\), we need \(Be^{1}\approx 3000\) bootstrap samples.

See vignette("jab-regression") for more details and examples.

References

Beyaztas U, Alin A (2013). “Jackknife-after-Bootstrap Method for Detection of Influential Observations in Linear Regression Models.” Communications in Statistics-Simulation and Computation, 42(6), 1256--1267.

Cook RD (1986). “Assessment of Local Influence.” Journal of the Royal Statistical Society. Series B (Methodological), 48(2), 133--169. ISSN 00359246, http://www.jstor.org/stable/2345711.

Martin MA, Roberts S (2010). “Jackknife-after-Bootstrap Regression Influence Diagnostics.” Journal of Nonparametric Statistics, 22(2), 257-269. doi:10.1080/10485250903287906 .

Examples

library(stats)
data("LifeCycleSavings")

mod <- lm(sr ~ ., data = LifeCycleSavings)

# JaB with DFFITS
result1 <- jab_lm(mod,
                  stat = "dffits",
                  quant.lower = 0.025,
                  quant.upper = 0.975,
                  B = 3100)
result1[result1$influential, ]
#>        row.ID      lower     upper       orig influential
#> Japan      23 -0.5962788 0.6242944  0.8596508        TRUE
#> Zambia     46 -0.6127221 0.6599902  0.7482351        TRUE
#> Libya      49 -0.5972119 0.6806704 -1.1601334        TRUE


# define the likelihood distance as influence statistic (Cook, 1986)
infl_like <- function(mod){

  n <- length(mod$fitted.values)
  p <- length(mod$coefficients)

  ti <- rstudent(mod)
  h <- hatvalues(mod)

  p1 <- log( (n/(n-1)) * ((n-p-1) / (ti^2 +n-p-1)) )
  p2 <- ti^2 * (n-1) / (1-h) / (n-p-1)

  return(n*p1 + p2 - 1)
}

# JaB with Likelihood Distance
result2 <- jab_lm(mod,
                  stat = "infl_like",
                  quant.lower = 0.00,
                  quant.upper = 0.95,
                  B = 3100)
#> Error in jab_lm(mod, stat = "infl_like", quant.lower = 0, quant.upper = 0.95,     B = 3100): The functioninfl_likedoes not exist in your environment.
result2[result2$influential, ]
#> Error in eval(expr, envir, enclos): object 'result2' not found