Appendix D — Sample Balance
D.1 Overview
This document outlines the procedure for balancing the sample used in the hypothesis test.
As with all analyses in this thesis, the process is fully reproducible and was conducted using the R programming language (R Core Team, n.d.) along with the Quarto publishing system (Allaire et al., n.d.).
D.2 Setting the Enviroment
D.3 Loading and Processing the Sample Data
Below, we load the data that was used for the hypothesis test before the balance procedure. You can visualize the data munging pipeline in the _targets.R
file.
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
data <-
targets::tar_read("filtered_data", store = here::here("_targets")) |>
dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
dplyr::select(country, region, state, sex, age, msf_sc) |>
tidyr::drop_na(state, sex, age) |>
dplyr::mutate(
age_group = dplyr::case_when(
age >= 18 & age < 20 ~ "18-19",
age >= 20 & age < 25 ~ "20-24",
age >= 25 & age < 30 ~ "25-29",
age >= 30 & age < 40 ~ "30-39",
age >= 40 & age < 50 ~ "40-49",
age >= 50 & age < 60 ~ "50-59",
age >= 60 & age < 65 ~ "60-64",
age >= 65 ~ "65+"
),
age_group = factor(
age_group,
levels = c(
"18-19", "20-24", "25-29", "30-39", "40-49", "50-59", "60-64",
"65+"
),
ordered = TRUE
)
) |>
dplyr::relocate(age_group, .after = age)
data
D.4 Getting the Population Data
The sample data was obtained in 2017 from October 15th to 21st by a broadcast of the online questionnaire on a popular Brazil’s Sunday TV show with national reach (Rede Globo, 2017). Here we use Brazilian Institute of Geography and Statistics’s (IBGE) Continuous National Household Sample Survey (PNAD Contínua) for a estimate of the population distribution in this timeframe (Instituto Brasileiro de Geografia e Estatística, n.d.). The data can be accessed at the IBGE’s SIDRA platform (IBGE’s Table 6407).
The balance is made considering the distribution of the population related only to the variables state
, sex
and age
.
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
prettycheck:::assert_internet()
ibge_6407_data <-
sidrar::get_sidra(
api = paste0(
"/t/6407/n1/all/n3/all/v/606/p/2017/c2/all/c58/1144,1145,",
"1152,2793,3299,3300,3301,3302,6798"
)
) |>
dplyr::as_tibble() |>
dplyr::select(
dplyr::all_of(
c(
"Valor", "Brasil e Unidade da Federação", "Ano", "Sexo",
"Grupo de idade"
)
)
) |>
dplyr::rename(
n = Valor,
state = `Brasil e Unidade da Federação`,
year = Ano,
sex = Sexo,
age_group = `Grupo de idade`
) |>
dplyr::filter(
state != "Brasil",
sex != "Total",
age_group != "60 anos ou mais"
) |>
dplyr::arrange(state, sex, age_group) |>
dplyr::mutate(
year = as.integer(year),
country = "Brazil",
region = get_brazil_region(state, "state"),
sex = dplyr::case_when(
sex == "Homens" ~ "Male",
sex == "Mulheres" ~ "Female"
),
sex = factor(sex, ordered = FALSE),
age_group = dplyr::case_when(
age_group == "18 a 19 anos" ~ "18-19",
age_group == "20 a 24 anos" ~ "20-24",
age_group == "25 a 29 anos" ~ "25-29",
age_group == "30 a 39 anos" ~ "30-39",
age_group == "40 a 49 anos" ~ "40-49",
age_group == "50 a 59 anos" ~ "50-59",
age_group == "60 a 64 anos" ~ "60-64",
age_group == "65 anos ou mais" ~ "65+"
),
age_group = factor(age_group, ordered = TRUE),
n = as.integer(n * 1000)
) |>
dplyr::relocate(year, country, region, state, sex, age_group, n) |>
dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
dplyr::mutate(
n_rel = n / sum(n),
n_per = (n / sum(n)) * 100
)
ibge_6407_data
Code
ibge_6407_data |> dplyr::glimpse()
#> Rows: 336
#> Columns: 9
#> $ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
#> $ country <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil…
#> $ region <chr> "Northeast", "Northeast", "Northeast", "Northeast", "Nort…
#> $ state <chr> "Alagoas", "Alagoas", "Alagoas", "Alagoas", "Alagoas", "A…
#> $ sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Female, F…
#> $ age_group <ord> 18-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-64, 65+, 18-…
#> $ n <int> 54000, 140000, 131000, 236000, 201000, 140000, 57000, 127…
#> $ n_rel <dbl> 0.000378167, 0.000980433, 0.000917405, 0.001652731, 0.001…
#> $ n_per <dbl> 0.0378167, 0.0980433, 0.0917405, 0.1652731, 0.1407622, 0.…
Code
ibge_6407_data |> summarise_data("age_group")
Code
summarise_data(ibge_6407_data, "sex")
Code
ibge_6407_data |> summarise_data("state")
Code
summarise_data(ibge_6407_data, "region")
Code
ibge_6407_data |> summarise_data("country")
D.5 Comparing Sample Data With the Population Data
Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.
Code
compare_sample(data, ibge_6407_data, "age_group")
Code
compare_sample(data, ibge_6407_data, "sex")
Code
compare_sample(data, ibge_6407_data, "state")
Code
compare_sample(data, ibge_6407_data, "region")
Code
compare_sample(data, ibge_6407_data, "country")
D.6 Adjusting Sample Data with the Population Data
The tables above indicate an overrepresentation of states like São Paulo and Rio de Janeiro, while states such as Amapá and Tocantins are underrepresented. To achieve a balanced sample, we applied a cell weighting procedure (see Kalton & Flores-Cervantes, 2003) based on the characteristics of state
, sex
and age group
, using data from the 2017 PNAD as a reference.
\[ \text{Cell weighting}: \cfrac{\% \ \text{2017 PNAD}}{\% \ \text{Survey sample}} \]
Code
ibge_6407_data |>
dplyr::select(state, sex, age_group, n) |>
tidyr::pivot_wider(names_from = state, values_from = n) |>
janitor::adorn_totals(c("row", "col")) |>
janitor::adorn_percentages("col") |>
janitor::adorn_pct_formatting(digits = 3) |>
janitor::adorn_ns() |>
dplyr::as_tibble()
See https://www.tidyverse.org/blog/2022/05/case-weights/ to learn more.
Code
weights_data <-
ibge_6407_data |>
dplyr::left_join(
data |>
dplyr::summarise(
n = dplyr::n(),
.by = c("country", "region", "state", "sex", "age_group")
) |>
dplyr::mutate(
n_rel = n / sum(n),
n_per = (n / sum(n)) * 100
),
by = c("country", "region", "state", "sex", "age_group"),
suffix = c("_pnad", "_sample")
) |>
dplyr::relocate(year, .before = country) |>
dplyr::mutate(
cell_weight = parsnip::importance_weights(n_per_pnad / n_per_sample)
# inv_prob_weight = parsnip::importance_weights(1 / (n / sum(n)))
) |>
dplyr::select(
country, region, state, sex, age_group, cell_weight
) |>
dplyr::arrange(state, sex, age_group)
weights_data
Code
test_data <-
data |>
dplyr::mutate(
msf_sc =
msf_sc |>
lubritime:::link_to_timeline(
threshold = hms::parse_hms("12:00:00")
) |>
as.numeric()
) |>
dplyr::left_join(
weights_data,
by = c("country", "region", "state", "sex", "age_group")
)
D.7 Fiting a Model with Weights
model <-
parsnip::linear_reg() |>
parsnip::set_engine("lm") |>
parsnip::set_mode("regression")
workflow <-
workflows::workflow() |>
workflows::add_case_weights(cell_weight) |>
workflows::add_formula(msf_sc ~ sex + age) |>
workflows::add_model(model)
workflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#>
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> msf_sc ~ sex + age
#>
#> ── Case Weights ─────────────────────────────────────────────────────────────
#> cell_weight
#>
#> ── Model ────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#>
#> Computational engine: lm
fit <- workflow |> parsnip::fit(test_data)
Code
fit |>
broom::tidy() |>
janitor::adorn_rounding(5)
Code
fit |>
broom::glance() |>
tidyr::pivot_longer(cols = dplyr::everything()) |>
janitor::adorn_rounding(10)
Code
fit_engine <- fit |> parsnip::extract_fit_engine()
fit_engine |> summary()
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data, weights = weights)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -48700 -2575 -182 2571 48927
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 106228.53 69.78 1522.43 <2e-16 ***
#> sexMale 367.28 39.34 9.34 <2e-16 ***
#> age -123.26 1.74 -70.99 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4570 on 65821 degrees of freedom
#> Multiple R-squared: 0.0726, Adjusted R-squared: 0.0726
#> F-statistic: 2.58e+03 on 2 and 65821 DF, p-value: <2e-16
fit_engine_2 <-
lm(
formula = msf_sc ~ sex + age,
weights = cell_weight,
data = test_data
)
fit_engine_2 |> summary()
#>
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data, weights = cell_weight)
#>
#> Weighted Residuals:
#> Min 1Q Median 3Q Max
#> -48700 -2575 -182 2571 48927
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 106228.53 69.78 1522.43 <2e-16 ***
#> sexMale 367.28 39.34 9.34 <2e-16 ***
#> age -123.26 1.74 -70.99 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4570 on 65821 degrees of freedom
#> Multiple R-squared: 0.0726, Adjusted R-squared: 0.0726
#> F-statistic: 2.58e+03 on 2 and 65821 DF, p-value: <2e-16
Code
report::report(fit_engine_2)
#> We fitted a linear model (estimated using OLS) to predict msf_sc with sex
#> and age (formula: msf_sc ~ sex + age). The model explains a statistically
#> significant and weak proportion of variance (R2 = 0.07, F(2, 65821) =
#> 2576.88, p < .001, adj. R2 = 0.07). The model's intercept, corresponding to
#> sex = Female and age = 0, is at 1.06e+05 (95% CI [1.06e+05, 1.06e+05],
#> t(65821) = 1522.43, p < .001). Within this model:
#>
#> - The effect of sex [Male] is statistically significant and positive (beta =
#> 367.28, 95% CI [290.18, 444.38], t(65821) = 9.34, p < .001; Std. beta =
#> 0.07, 95% CI [0.06, 0.08])
#> - The effect of age is statistically significant and negative (beta =
#> -123.26, 95% CI [-126.67, -119.86], t(65821) = -70.99, p < .001; Std. beta =
#> -0.27, 95% CI [-0.27, -0.26])
#>
#> Standardized parameters were obtained by fitting the model on a standardized
#> version of the dataset. 95% Confidence Intervals (CIs) and p-values were
#> computed using a Wald t-distribution approximation.
lm(formula = msf_sc ~ sex + age, data = test_data) |> summary()
#>
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15994 -3559 -366 3310 17396
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 106428.29 71.78 1482.6 <2e-16 ***
#> sexMale 512.59 41.57 12.3 <2e-16 ***
#> age -127.05 2.12 -59.9 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5030 on 65821 degrees of freedom
#> Multiple R-squared: 0.0532, Adjusted R-squared: 0.0532
#> F-statistic: 1.85e+03 on 2 and 65821 DF, p-value: <2e-16