Bootstrap Illustration
Overview
This report contains an illustration of the bootstrap method, originally developed by Bradley Efron (1979a; 1979b, 1982).
Setting the Environment
Setting the Initial Parameters
n <- 1000
mean <- 0
sd <- 1
Theoretical Distribution
We start with a theoretical normal distribution with mean (\(\mu\)) 0 and standard deviation (\(\sigma^{2}\)) 1, representing the theoretical distribution of the population.
Definition 1 (Normal Distribution) The normal distribution has two parameters, usually denoted by \(\mu\) and \(\sigma^{2}\), which are its mean and variance. The pdf [probability density function] of the normal distribution with mean \(\mu\) and variance \(\sigma^{2}\) (usually denoted by \(\text{n}(\mu, \sigma^{2})\)) is given by: (Casella & Berger, 2002, p. 102)
\[ f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma}} \, e^{-\frac{(x - \mu)^2}{2\sigma^2}}, \quad -\infty < x < \infty \tag{1}\]
Code
normal_dist <- function(x, mean = 0, sd = 1) {
checkmate::assert_numeric(x)
checkmate::assert_number(mean)
checkmate::assert_number(sd)
(1 / (sqrt(2 * pi * sd))) * exp(1)^(- (x - mean)^2 / (2 * sd^2))
}
Code
dplyr::tibble(
x = seq(-5, 5, length.out = n),
y = normal_dist(x, mean, sd)
) |>
ggplot2::ggplot(ggplot2::aes(x, y)) +
ggpattern::geom_area_pattern(
pattern = "stripe",
pattern_color = "transparent",
pattern_fill = brandr::get_brand_color_tint(750, "black"),
pattern_spacing = 0.015,
color = brandr::get_brand_color("primary"),
fill = "transparent",
linewidth = 2
) +
ggplot2::labs(x = "Theoretical Normal", y = "Density")
Sample
A non-random sample is drawn from a normally distributed population, intentionally biased toward higher extreme values to illustrate the effects of sampling bias.
Definition 2 (Random Sample) The random variables \(X_{1}, \ldots, X_{n}\) are called a random sample of size \(n\) from the population \(f(x)\) if \(X_{1}, \ldots, X_{n}\) are mutually independent random variables and the marginal pdf [probability density function] or pmf of each \(X_{i}\) is the same function \(f(x)\). Alternatively, \(X_{1}, \ldots, X_{n}\) are called independent and identically distributed random variables with pdf or pmf \(f(x)\). This is commonly abbreviated to iid random variables. (Casella & Berger, 2002, p. 207)
Definition 3 (Random Sample) A random sample is a collection of random variables \(X_{1}, X_{2}, \ldots, X_{n}\), that have the same probability distribution and are mutually independent. (Dekking et al., 2005, p. 246)
Population Data
pop_data <- rnorm(n * 100, mean = mean, sd = sd)
Code
pop_data |>
summarytools::descr() |>
as.data.frame() |>
tibble::rownames_to_column("name") |>
tibble::as_tibble() |>
dplyr::rename(value = pop_data)
Code
dplyr::tibble(x = pop_data) |>
ggplot2::ggplot(ggplot2::aes(x, ggplot2::after_stat(density))) +
ggpattern::geom_histogram_pattern(
pattern = "stripe",
pattern_color = "transparent",
pattern_fill = brandr::get_brand_color_tint(750, "black"),
pattern_spacing = 0.015,
color = brandr::get_brand_color("gray"),
fill = "transparent",
linewidth = 0.5,
bins = 30
) +
ggplot2::geom_density(
color = brandr::get_brand_color("primary"),
linewidth = 2,
fill = NA
) +
ggplot2::xlim(-5, 5) +
ggplot2::labs(
x = "Population data",
y = "Density"
)
Bias Function
Code
bias_fun <- function(x, shape_1 = 0.45, shape_2 = 0.5, max_rescale = 0.95) {
checkmate::assert_numeric(x)
checkmate::assert_number(shape_1)
checkmate::assert_number(shape_2)
checkmate::assert_number(max_rescale, lower = 0.01, upper = 1)
x <- scales::rescale(x, to = c(0, max_rescale))
dplyr::if_else(
x <= 0.5,
dbeta(0.5, shape1 = shape_1, shape2 = shape_2),
dbeta(x, shape1 = shape_1, shape2 = shape_2)
)
}
Code
dplyr::tibble(
x = seq(0, 1, length.out = n),
y = bias_fun(x)
) |>
ggplot2::ggplot(ggplot2::aes(x, y)) +
ggpattern::geom_area_pattern(
pattern = "stripe",
pattern_color = "transparent",
pattern_fill = brandr::get_brand_color_tint(750, "black"),
pattern_spacing = 0.015,
color = brandr::get_brand_color("primary"),
fill = "transparent",
linewidth = 2
) +
ggplot2::labs(x = "Quantiles", y = "Probability weight")
Sample Data
Code
data |>
summarytools::descr() |>
as.data.frame() |>
tibble::rownames_to_column("name") |>
tibble::as_tibble() |>
dplyr::rename(value = data)
Code
dplyr::tibble(x = data) |>
ggplot2::ggplot(ggplot2::aes(x, ggplot2::after_stat(density))) +
ggpattern::geom_histogram_pattern(
pattern = "stripe",
pattern_color = "transparent",
pattern_fill = brandr::get_brand_color_tint(750, "black"),
pattern_spacing = 0.015,
color = brandr::get_brand_color("gray"),
fill = "transparent",
linewidth = 0.5,
bins = 30
) +
ggplot2::geom_density(
color = brandr::get_brand_color("primary"),
linewidth = 2,
fill = NA
) +
ggplot2::xlim(-5, 5) +
ggplot2::labs(
x = "Sample data",
y = "Density"
)
Bootstrap-Based t-Test
Finally, we apply the bootstrap method to estimate a confidence interval for the sample mean and conduct a t-test (Student, 1908), treating the sample mean as an estimate of the population mean.
We compare these bootstrap-based results to those from the traditional theory-based t-test, which relies on the assumption that the sample is drawn from a normally distributed population.
The bootstrap is based on a simple, yet powerful, idea (whose mathematics can get quite involved)1. In statistics, we learn about the characteristics of the population by taking samples. As the sample represents the population, analogous characteristics of the sample should give us information about the population characteristics. The bootstrap helps us learn about the sample characteristics by taking resamples (that is, we retake samples from the original sample) and use this information to infer to the population. The bootstrap was developed by Efron in the late 1970s, with the original ideas appearing in Efron (1979a; 1979b) and the monograph by Efron (1982). See also Efron (1998) for more recent thoughts and developments. (Casella & Berger, 2002, p. 478)
In Example 1.2.20 we calculated all possible averages of four numbers selected from
2, 4, 9, 12
, where we drew the numbers with replacement. This is the simplest form of the bootstrap, sometimes referred to as the nonparametric bootstrap. (Casella & Berger, 2002, p. 478)
This kind of sampling is called with replacement because the value chosen at any stage is “replaced” in the population and is available for choice again at the next stage. (Casella & Berger, 2002, p. 209)
Theory-Based t-Test (Base R)
\[ \begin{cases} \text{H}_{0}: \mu = 0 \\ \text{H}_{a}: \mu \neq 0 \\ \end{cases} \]
data |>
stats::t.test(
alternative = "two.sided",
conf.level = 0.95,
mu = mean(pop_data)
)
#>
#> One Sample t-test
#>
#> data: data
#> t = 5.5329682, df = 999, p-value = 0.00000004021649
#> alternative hypothesis: true mean is not equal to 0.000324767383
#> 95 percent confidence interval:
#> 0.1242178757 0.2603959763
#> sample estimates:
#> mean of x
#> 0.192306926
Theory-Based t-Test (infer
)
dplyr::tibble(x = data) |>
infer::t_test(
response = x,
alternative = "two.sided",
mu = mean(pop_data),
conf.level = 0.95
) |>
dplyr::mutate(dplyr::across(dplyr::everything(), as.character)) |>
tidyr::pivot_longer(dplyr::everything())
Bootstrap Sample Mean CI (infer
)
Code
null_dist |>
infer::get_confidence_interval(
level = 0.95,
point_estimate = observed_statistic
)
Bootstrap-Based t-Test (infer
)
Code
ci <- null_dist |>
infer::get_confidence_interval(
level = 0.95,
point_estimate = observed_statistic
)
ci
Code
null_dist |>
infer::get_p_value(
obs_stat = observed_statistic,
direction = "two.sided"
)
#> Warning: Please be cautious in reporting a p-value of 0. This result is an
#> approximation based on the number of `reps` chosen in the `generate()` step.
#> ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
Code
null_dist |>
infer::visualize(bins = 30) +
infer::shade_p_value(
obs_stat = observed_statistic,
direction = "two-sided",
color = brandr::get_brand_color("primary"),
fill = brandr::get_brand_color("light-orange")
) +
ggplot2::geom_vline(
xintercept = ci$lower_ci,
color = brandr::get_brand_color("gray"),
linewidth = 0.5,
linetype = "dashed"
) +
ggplot2::geom_vline(
xintercept = ci$upper_ci,
color = brandr::get_brand_color("gray"),
linewidth = 0.5,
linetype = "dashed"
) +
ggplot2::labs(
title = NULL,
x = "Null distribution of the hypothetical mean",
y = "Frequency"
)
Bootstrap Sample Mean CI (Independent)
mean(means)
#> [1] 0.1919669725
quantile(means, 0.025)
#> 2.5%
#> 0.1267206149
quantile(means, 0.975)
#> 97.5%
#> 0.2608197494
Bootstrap-Based t-Test (Independent)
mean(means)
#> [1] -0.0007116986917
quantile(means, 0.025)
#> 2.5%
#> -0.06702770832
quantile(means, 0.975)
#> 97.5%
#> 0.06243307435
Code
dplyr::tibble(x = means) |>
ggplot2::ggplot(ggplot2::aes(x)) +
ggpattern::geom_histogram_pattern(
pattern_color = "transparent",
pattern_fill = brandr::get_brand_color("white"),
color = brandr::get_brand_color("gray"),
fill = "transparent",
linewidth = 0.5,
bins = 30
) +
ggplot2::geom_vline(
xintercept = quantile(means, 0.975),
color = brandr::get_brand_color("gray"),
linewidth = 0.5,
linetype = "dashed"
) +
ggplot2::geom_vline(
xintercept = quantile(means, 0.025),
color = brandr::get_brand_color("gray"),
linewidth = 0.5,
linetype = "dashed"
) +
ggplot2::geom_vline(
xintercept = mean(data),
color = brandr::get_brand_color("primary"),
linewidth = 2,
linetype = "solid"
) +
ggplot2::labs(
x = "Null distribution of the hypothetical mean",
y = "Frequency"
)
License
The content is licensed under CC0 1.0 Universal, placing these materials in the public domain. You may freely copy, modify, distribute, and use this work, even for commercial purposes, without permission or attribution.
Other References
Books
Bootstrap for Dummies
References
Footnotes
See Lehmann (1999, Section 6.5) for a most readable introduction.↩︎