Conduct Jackknife-after-Bootstrap (JaB) algorithm for linear regression (Martin and Roberts 2010) .
Usage
jab_lm(
mod,
stat = "rstudent",
quant.lower = 0.05,
quant.upper = 0.95,
B = 3100,
package.name = NULL,
stat.args = NULL
)
Arguments
- mod
An
lm
object. Output fromstats::lm()
.- stat
A character string of the function name used to calculate the desired centrality statistic. The function must input an lm model object as is first argument and output a length \(n\) vector of the statistic of interest.
- quant.lower
A numeric between 0 and 1 used as the lower cutoff in the JaB algorithm. Default is 0.05. Must be smaller than
quant.upper
.- quant.upper
A numeric between 0 and 1 used as the upper cutoff in the JaB algorithm. Default is 0.95. Must be larger than
quant.lower
.- B
Number of bootstrap samples. Default is 3100
- package.name
A character string of the name of the package that
func.name
function is in. If left asNULL
if the function will be called as loaded in the users environment.- stat.args
A named list of additional arguments the
func.name
function may need beyond themod
object.
Value
A data frame with 5 columns.
"row.ID": Row Number of the observation in the data set used for
mod
."lower": Lower quantile cutoff as determined by
quant.lower
."upper": Upper quantile cutoff as determined by
quant.upper
."orig": The original influence/outlier statistic calculated by
stat
."influential": Logical flagging if the observation is influential or not. TRUE if
orig
<lower
ororig
>upper
. FALSE otherwise.
Details
The JaB algorithm, proposed by Martin and Roberts (2010) and further described by Beyaztas and Alin (2013) , detects influential/outlier points in linear regression models. The algorithm is as follows:
Let \(\gamma\) (
stat
) be the diagnostic statistic of interest. Fit the model (mod
) and calculate \(\gamma_i\) for \(i=1,…,n\).Construct \(B\) bootstrap samples, with replacement, from the original data set.
For \(i = 1,…,n\),
3.1: Let \(B_{(i)}\) be the set of all bootstrap samples that did not contain data point \(i\).
3.2: For each sample in \(B_{(i)}\), fit the regression model then calculate the \(n\) values of \(\gamma_{i, (b)}\). Aggregate them into one vector \(\Gamma_i\).
3.1: Calculate suitable quantiles of \(\Gamma_i\) (
quant.lower
andquant.upper
). If \(\gamma_i\) is outside of this range, flag point \(i\) as influential.
The default for quant.lower
and quant.upper
is 0.05 and 0.95, respectively.
This means that if \(\gamma_i\) is in the center 90% of \(\Gamma_i\), it
will not be flagged as influential.
Some influence statistics, such as the likelihood distance from Cook (1986)
,
are only positive values and large values imply influence. In these scenarios,
it is appropriate to set quant.lower
to 0 and quant.upper
to some suitable
quantile (say 0.90) so that point \(i\) is only flagged when \(\gamma_i\)
is in the upper quantiles of \(\Gamma_i\).
As described in Martin and Roberts (2010) , to have approximately 1000 bootstrap samples in \(B_{(i)}\), we need \(Be^{1}\approx 3000\) bootstrap samples.
See vignette("jab-regression")
for more details and examples.
References
Beyaztas U, Alin A (2013).
“Jackknife-after-Bootstrap Method for Detection of Influential Observations in Linear Regression Models.”
Communications in Statistics-Simulation and Computation, 42(6), 1256--1267.
Cook RD (1986).
“Assessment of Local Influence.”
Journal of the Royal Statistical Society. Series B (Methodological), 48(2), 133--169.
ISSN 00359246, http://www.jstor.org/stable/2345711.
Martin MA, Roberts S (2010).
“Jackknife-after-Bootstrap Regression Influence Diagnostics.”
Journal of Nonparametric Statistics, 22(2), 257-269.
doi:10.1080/10485250903287906
.
Examples
library(stats)
data("LifeCycleSavings")
mod <- lm(sr ~ ., data = LifeCycleSavings)
# JaB with DFFITS
result1 <- jab_lm(mod,
stat = "dffits",
quant.lower = 0.025,
quant.upper = 0.975,
B = 3100)
result1[result1$influential, ]
#> row.ID lower upper orig influential
#> Japan 23 -0.5962788 0.6242944 0.8596508 TRUE
#> Zambia 46 -0.6127221 0.6599902 0.7482351 TRUE
#> Libya 49 -0.5972119 0.6806704 -1.1601334 TRUE
# define the likelihood distance as influence statistic (Cook, 1986)
infl_like <- function(mod){
n <- length(mod$fitted.values)
p <- length(mod$coefficients)
ti <- rstudent(mod)
h <- hatvalues(mod)
p1 <- log( (n/(n-1)) * ((n-p-1) / (ti^2 +n-p-1)) )
p2 <- ti^2 * (n-1) / (1-h) / (n-p-1)
return(n*p1 + p2 - 1)
}
# JaB with Likelihood Distance
result2 <- jab_lm(mod,
stat = "infl_like",
quant.lower = 0.00,
quant.upper = 0.95,
B = 3100)
#> Error in jab_lm(mod, stat = "infl_like", quant.lower = 0, quant.upper = 0.95, B = 3100): The functioninfl_likedoes not exist in your environment.
result2[result2$influential, ]
#> Error in eval(expr, envir, enclos): object 'result2' not found