Appendix D — Sample Balance
D.1 Overview
This document outlines the procedure for balancing the sample used in the hypothesis test.
As with all analyses in this thesis, the process is fully reproducible and was conducted using the R programming language (R Core Team, n.d.) along with the Quarto publishing system (Allaire et al., n.d.).
D.2 Setting the Enviroment
D.3 Loading and Processing the Sample Data
You can visualize the data wrangling pipeline at the code repository in the _targets.R
file.
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
data <-
targets::tar_read("filtered_data", store = here::here("_targets")) |>
dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
dplyr::select(country, region, state, sex, age, msf_sc) |>
tidyr::drop_na(state, sex, age) |>
dplyr::mutate(
age_group = dplyr::case_when(
age >= 18 & age < 20 ~ "18-19",
age >= 20 & age < 25 ~ "20-24",
age >= 25 & age < 30 ~ "25-29",
age >= 30 & age < 40 ~ "30-39",
age >= 40 & age < 50 ~ "40-49",
age >= 50 & age < 60 ~ "50-59",
age >= 60 & age < 65 ~ "60-64",
age >= 65 ~ "65+"
),
age_group = factor(
age_group,
levels = c(
"18-19", "20-24", "25-29", "30-39", "40-49", "50-59", "60-64",
"65+"
),
ordered = TRUE
)
) |>
dplyr::relocate(age_group, .after = age)
data
D.4 Getting the Population Data
The sample data was obtained in 2017 from October 15th to 21st by a broadcast of the online questionnaire on a popular Brazil’s Sunday TV show with national reach (Rede Globo, 2017). Here the Brazilian Institute of Geography and Statistics’s (IBGE) Continuous National Household Sample Survey (PNAD Contínua) its used for a estimate of the population distribution in this timeframe (Instituto Brasileiro de Geografia e Estatística, n.d.). The data can be accessed at the IBGE’s SIDRA platform (IBGE’s Table 6407).
The balance is made considering the distribution of the population related only to the variables state
, sex
and age
.
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
prettycheck:::assert_internet()
ibge_6407_data <-
sidrar::get_sidra(
api = paste0(
"/t/6407/n1/all/n3/all/v/606/p/2017/c2/all/c58/1144,1145,",
"1152,2793,3299,3300,3301,3302,6798"
)
) |>
dplyr::as_tibble() |>
dplyr::select(
dplyr::all_of(
c(
"Valor", "Brasil e Unidade da Federação", "Ano", "Sexo",
"Grupo de idade"
)
)
) |>
dplyr::rename(
n = Valor,
state = `Brasil e Unidade da Federação`,
year = Ano,
sex = Sexo,
age_group = `Grupo de idade`
) |>
dplyr::filter(
state != "Brasil",
sex != "Total",
age_group != "60 anos ou mais"
) |>
dplyr::arrange(state, sex, age_group) |>
dplyr::mutate(
year = as.integer(year),
country = "Brazil",
region = get_brazil_region(state, "state"),
sex = dplyr::case_when(
sex == "Homens" ~ "Male",
sex == "Mulheres" ~ "Female"
),
sex = factor(sex, ordered = FALSE),
age_group = dplyr::case_when(
age_group == "18 a 19 anos" ~ "18-19",
age_group == "20 a 24 anos" ~ "20-24",
age_group == "25 a 29 anos" ~ "25-29",
age_group == "30 a 39 anos" ~ "30-39",
age_group == "40 a 49 anos" ~ "40-49",
age_group == "50 a 59 anos" ~ "50-59",
age_group == "60 a 64 anos" ~ "60-64",
age_group == "65 anos ou mais" ~ "65+"
),
age_group = factor(age_group, ordered = TRUE),
n = as.integer(n * 1000)
) |>
dplyr::relocate(year, country, region, state, sex, age_group, n) |>
dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
dplyr::mutate(
n_rel = n / sum(n),
n_per = (n / sum(n)) * 100
)
ibge_6407_data
Code
ibge_6407_data |> dplyr::glimpse()
#> Rows: 336
#> Columns: 9
#> $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
#> $ country <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil…
#> $ region <chr> "Northeast", "Northeast", "Northeast", "Northeast", "Nort…
#> $ state <chr> "Alagoas", "Alagoas", "Alagoas", "Alagoas", "Alagoas", "A…
#> $ sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Female, F…
#> $ age_group <ord> 18-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-64, 65+, 18-…
#> $ n <int> 54000, 140000, 131000, 236000, 201000, 140000, 57000, 127…
#> $ n_rel <dbl> 0.00037816714988, 0.00098043335154, 0.00091740549323, 0.0…
#> $ n_per <dbl> 0.037816714988, 0.098043335154, 0.091740549323, 0.1652730…
Code
ibge_6407_data |> summarise_data("age_group")
Code
summarise_data(ibge_6407_data, "sex")
Code
ibge_6407_data |> summarise_data("state")
Code
summarise_data(ibge_6407_data, "region")
Code
ibge_6407_data |> summarise_data("country")
D.5 Comparing Sample Data With the Population Data
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
compare_sample(data, ibge_6407_data, "age_group")
Code
compare_sample(data, ibge_6407_data, "sex")
Code
compare_sample(data, ibge_6407_data, "state")
Code
compare_sample(data, ibge_6407_data, "region")
Code
compare_sample(data, ibge_6407_data, "country")
D.6 Adjusting Sample Data with the Population Data
The tables above indicate an overrepresentation of states like São Paulo and Rio de Janeiro, while states such as Amapá and Tocantins are underrepresented. To achieve a balanced sample, a cell weighting procedure was use (see Kalton & Flores-Cervantes, 2003) based on the characteristics of state
, sex
and age group
, using data from the 2017 PNAD as a reference.
\[ \text{Cell weighting}: \cfrac{\% \ \text{2017 PNAD}}{\% \ \text{Survey sample}} \]
Code
ibge_6407_data |>
dplyr::select(state, sex, age_group, n) |>
tidyr::pivot_wider(names_from = state, values_from = n) |>
janitor::adorn_totals(c("row", "col")) |>
janitor::adorn_percentages("col") |>
janitor::adorn_pct_formatting(digits = 3) |>
janitor::adorn_ns() |>
dplyr::as_tibble()
See https://www.tidyverse.org/blog/2022/05/case-weights/ to learn more.
Code
weights_data <-
ibge_6407_data |>
dplyr::left_join(
data |>
dplyr::summarise(
n = dplyr::n(),
.by = c("country", "region", "state", "sex", "age_group")
) |>
dplyr::mutate(
n_rel = n / sum(n),
n_per = (n / sum(n)) * 100
),
by = c("country", "region", "state", "sex", "age_group"),
suffix = c("_pnad", "_sample")
) |>
dplyr::relocate(year, .before = country) |>
dplyr::mutate(
cell_weight = parsnip::importance_weights(n_per_pnad / n_per_sample)
# inv_prob_weight = parsnip::importance_weights(1 / (n / sum(n)))
) |>
dplyr::select(
country, region, state, sex, age_group, cell_weight
) |>
dplyr::arrange(state, sex, age_group)
weights_data
Code
test_data <-
data |>
dplyr::mutate(
msf_sc =
msf_sc |>
lubritime:::link_to_timeline(
threshold = hms::parse_hms("12:00:00")
) |>
as.numeric()
) |>
dplyr::left_join(
weights_data,
by = c("country", "region", "state", "sex", "age_group")
)
D.7 Testing Fiting a Model with the Weights
model <-
parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode("regression")
workflow <-
workflows::workflow() |>
workflows::add_case_weights(cell_weight) |>
workflows::add_formula(msf_sc ~ sex + age) |>
workflows::add_model(model)
workflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> msf_sc ~ sex + age
#>
#> ── Case Weights ─────────────────────────────────────────────────────────────
#> cell_weight
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
fit <- workflow |> parsnip::fit(test_data)
Code
fit |>
broom::tidy() |>
janitor::adorn_rounding(5)
Code
fit |>
broom::glance() |>
tidyr::pivot_longer(cols = dplyr::everything()) |>
janitor::adorn_rounding(10)
Code
fit_engine <- fit |> parsnip::extract_fit_engine()
fit_engine |> summary()
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data, weights = weights)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -48700.087 -2574.721 -181.577 2571.318 48926.513
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 19828.531597 69.775678 284.17541 < 2.22e-16 ***
#> sexMale 367.277867 39.336490 9.33682 < 2.22e-16 ***
#> age -123.263571 1.736254 -70.99398 < 2.22e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4568.365 on 65821 degrees of freedom
#> Multiple R-squared: 0.07261391, Adjusted R-squared: 0.07258573
#> F-statistic: 2576.877 on 2 and 65821 DF, p-value: < 2.2204e-16
fit_engine_2 <-
lm(
formula = msf_sc ~ sex + age,
weights = cell_weight,
data = test_data
)
fit_engine_2 |> summary()
#>
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data, weights = cell_weight)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -48700.087 -2574.721 -181.577 2571.318 48926.513
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 19828.531597 69.775678 284.17541 < 2.22e-16 ***
#> sexMale 367.277867 39.336490 9.33682 < 2.22e-16 ***
#> age -123.263571 1.736254 -70.99398 < 2.22e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4568.365 on 65821 degrees of freedom
#> Multiple R-squared: 0.07261391, Adjusted R-squared: 0.07258573
#> F-statistic: 2576.877 on 2 and 65821 DF, p-value: < 2.2204e-16
Code
report::report(fit_engine_2)
#> We fitted a linear model (estimated using OLS) to predict msf_sc with sex
#> and age (formula: msf_sc ~ sex + age). The model explains a statistically
#> significant and weak proportion of variance (R2 = 0.07, F(2, 65821) =
#> 2576.88, p < .001, adj. R2 = 0.07). The model's intercept, corresponding to
#> sex = Female and age = 0, is at 19828.53 (95% CI [19691.77, 19965.29],
#> t(65821) = 284.18, p < .001). Within this model:
#>
#> - The effect of sex [Male] is statistically significant and positive (beta =
#> 367.28, 95% CI [290.18, 444.38], t(65821) = 9.34, p < .001; Std. beta =
#> 0.07, 95% CI [0.06, 0.08])
#> - The effect of age is statistically significant and negative (beta =
#> -123.26, 95% CI [-126.67, -119.86], t(65821) = -70.99, p < .001; Std. beta =
#> -0.27, 95% CI [-0.27, -0.26])
#>
#> Standardized parameters were obtained by fitting the model on a standardized
#> version of the dataset. 95% Confidence Intervals (CIs) and p-values were
#> computed using a Wald t-distribution approximation.
lm(formula = msf_sc ~ sex + age, data = test_data) |> summary()
#>
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15994.4448 -3559.2346 -366.0323 3310.4309 17395.6007
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 20028.288311 71.784441 279.00598 < 2.22e-16 ***
#> sexMale 512.589746 41.567515 12.33150 < 2.22e-16 ***
#> age -127.047687 2.120366 -59.91782 < 2.22e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5033.417 on 65821 degrees of freedom
#> Multiple R-squared: 0.05319379, Adjusted R-squared: 0.05316502
#> F-statistic: 1848.989 on 2 and 65821 DF, p-value: < 2.2204e-16