Appendix D — Sample Balance

D.1 Overview

This document outlines the procedure for balancing the sample used in the hypothesis test.

As with all analyses in this thesis, the process is fully reproducible and was conducted using the R programming language (R Core Team, n.d.) along with the Quarto publishing system (Allaire et al., n.d.).

D.2 Setting the Enviroment

Code
source(here::here("R", "get_brazil_region.R"))
source(here::here("R", "get_brazil_state.R"))
source(here::here("R", "utils-stats.R"))

D.3 Loading and Processing the Sample Data

You can visualize the data wrangling pipeline at the code repository in the _targets.R file.

Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.

Code
targets::tar_make(script = here::here("_targets.R"))
Code
data <- 
  targets::tar_read("filtered_data", store = here::here("_targets")) |>
  dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
  dplyr::select(country, region, state, sex, age, msf_sc) |>
  tidyr::drop_na(state, sex, age) |>
  dplyr::mutate(
    age_group = dplyr::case_when(
      age >= 18 & age < 20 ~ "18-19",
      age >= 20 & age < 25 ~ "20-24",
      age >= 25 & age < 30 ~ "25-29",
      age >= 30 & age < 40 ~ "30-39",
      age >= 40 & age < 50 ~ "40-49",
      age >= 50 & age < 60 ~ "50-59",
      age >= 60 & age < 65 ~ "60-64",
      age >= 65 ~ "65+"
    ),
    age_group = factor(
      age_group,
      levels = c(
        "18-19", "20-24", "25-29", "30-39", "40-49", "50-59", "60-64", 
        "65+"
      ),
      ordered = TRUE
    )
  ) |>
  dplyr::relocate(age_group, .after = age)

data

D.4 Getting the Population Data

The sample data was obtained in 2017 from October 15th to 21st by a broadcast of the online questionnaire on a popular Brazil’s Sunday TV show with national reach (Rede Globo, 2017). Here the Brazilian Institute of Geography and Statistics’s (IBGE) Continuous National Household Sample Survey (PNAD Contínua) its used for a estimate of the population distribution in this timeframe (Instituto Brasileiro de Geografia e Estatística, n.d.). The data can be accessed at the IBGE’s SIDRA platform (IBGE’s Table 6407).

The balance is made considering the distribution of the population related only to the variables state, sex and age.

Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.

Code
prettycheck:::assert_internet()

ibge_6407_data <- 
  sidrar::get_sidra(
    api = paste0(
      "/t/6407/n1/all/n3/all/v/606/p/2017/c2/all/c58/1144,1145,",
      "1152,2793,3299,3300,3301,3302,6798"
    )
  ) |>
  dplyr::as_tibble() |>
  dplyr::select(
    dplyr::all_of(
      c(
        "Valor", "Brasil e Unidade da Federação", "Ano", "Sexo", 
        "Grupo de idade"
      )
    )
  ) |>
  dplyr::rename(
    n = Valor,
    state = `Brasil e Unidade da Federação`,
    year = Ano,
    sex = Sexo,
    age_group = `Grupo de idade`
  ) |>
  dplyr::filter(
    state != "Brasil", 
    sex != "Total",
    age_group != "60 anos ou mais"
  ) |>
  dplyr::arrange(state, sex, age_group) |>
  dplyr::mutate(
    year = as.integer(year),
    country = "Brazil",
    region = get_brazil_region(state, "state"),
    sex = dplyr::case_when(
      sex == "Homens" ~ "Male",
      sex == "Mulheres" ~ "Female"
    ),
    sex = factor(sex, ordered = FALSE),
    age_group = dplyr::case_when(
      age_group == "18 a 19 anos" ~ "18-19",
      age_group == "20 a 24 anos" ~ "20-24",
      age_group == "25 a 29 anos" ~ "25-29",
      age_group == "30 a 39 anos" ~ "30-39",
      age_group == "40 a 49 anos" ~ "40-49",
      age_group == "50 a 59 anos" ~ "50-59",
      age_group == "60 a 64 anos" ~ "60-64",
      age_group == "65 anos ou mais" ~ "65+"
    ),
    age_group = factor(age_group, ordered = TRUE),
    n = as.integer(n * 1000)
  ) |>
  dplyr::relocate(year, country, region, state, sex, age_group, n) |>
  dplyr::filter(state %in% get_brazil_state_by_utc(-3, "state")) |>
  dplyr::mutate(
    n_rel = n / sum(n),
    n_per = (n / sum(n)) * 100
  )

ibge_6407_data
Code
ibge_6407_data |> dplyr::glimpse()
#> Rows: 336
#> Columns: 9
#> $ year      <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
#> $ country   <chr> "Brazil", "Brazil", "Brazil", "Brazil", "Brazil", "Brazil…
#> $ region    <chr> "Northeast", "Northeast", "Northeast", "Northeast", "Nort…
#> $ state     <chr> "Alagoas", "Alagoas", "Alagoas", "Alagoas", "Alagoas", "A…
#> $ sex       <fct> Male, Male, Male, Male, Male, Male, Male, Male, Female, F…
#> $ age_group <ord> 18-19, 20-24, 25-29, 30-39, 40-49, 50-59, 60-64, 65+, 18-…
#> $ n         <int> 54000, 140000, 131000, 236000, 201000, 140000, 57000, 127…
#> $ n_rel     <dbl> 0.00037816714988, 0.00098043335154, 0.00091740549323, 0.0…
#> $ n_per     <dbl> 0.037816714988, 0.098043335154, 0.091740549323, 0.1652730…
Code
ibge_6407_data |> summarise_data("age_group")
Code
summarise_data(ibge_6407_data, "sex")
Code
ibge_6407_data |> summarise_data("state")
Code
summarise_data(ibge_6407_data, "region")
Code
ibge_6407_data |> summarise_data("country")

D.5 Comparing Sample Data With the Population Data

Please note that these values refer to the group of states in the Brazilian time zone UTC-3, not the whole country.

Code
compare_sample(data, ibge_6407_data, "age_group")
Code
compare_sample(data, ibge_6407_data, "sex")
Code
compare_sample(data, ibge_6407_data, "state")
Code
compare_sample(data, ibge_6407_data, "region")
Code
compare_sample(data, ibge_6407_data, "country")

D.6 Adjusting Sample Data with the Population Data

The tables above indicate an overrepresentation of states like São Paulo and Rio de Janeiro, while states such as Amapá and Tocantins are underrepresented. To achieve a balanced sample, a cell weighting procedure was use (see Kalton & Flores-Cervantes, 2003) based on the characteristics of state, sex and age group, using data from the 2017 PNAD as a reference.

\[ \text{Cell weighting}: \cfrac{\% \ \text{2017 PNAD}}{\% \ \text{Survey sample}} \]

Code
ibge_6407_data |> 
  dplyr::select(state, sex, age_group, n) |>
  tidyr::pivot_wider(names_from = state, values_from = n) |>
  janitor::adorn_totals(c("row", "col")) |>
  janitor::adorn_percentages("col") |>
  janitor::adorn_pct_formatting(digits = 3) |>
  janitor::adorn_ns() |>
  dplyr::as_tibble()

See https://www.tidyverse.org/blog/2022/05/case-weights/ to learn more.

Code
weights_data <- 
  ibge_6407_data |>
  dplyr::left_join(
    data |> 
      dplyr::summarise(
        n = dplyr::n(), 
        .by = c("country", "region", "state", "sex", "age_group")
      ) |>
      dplyr::mutate(
        n_rel = n / sum(n),
        n_per = (n / sum(n)) * 100
      ),
    by = c("country", "region", "state", "sex", "age_group"),
    suffix = c("_pnad", "_sample")
  ) |>
  dplyr::relocate(year, .before = country) |>
  dplyr::mutate(
    cell_weight = parsnip::importance_weights(n_per_pnad / n_per_sample)
    # inv_prob_weight = parsnip::importance_weights(1 / (n / sum(n)))
  ) |>
  dplyr::select(
    country, region, state, sex, age_group, cell_weight
  ) |>
  dplyr::arrange(state, sex, age_group)

weights_data
Code
test_data <-
  data |>
  dplyr::mutate(
    msf_sc = 
      msf_sc |>
      lubritime:::link_to_timeline(
        threshold = hms::parse_hms("12:00:00")
      ) |>
      as.numeric()
  ) |>
  dplyr::left_join(
    weights_data,
    by = c("country", "region", "state", "sex", "age_group")
  )

D.7 Testing Fiting a Model with the Weights

model <- 
  parsnip::linear_reg() |> 
  parsnip::set_engine("lm") |>
  parsnip::set_mode("regression")
workflow <- 
  workflows::workflow() |>
  workflows::add_case_weights(cell_weight) |>
  workflows::add_formula(msf_sc ~ sex + age) |>
  workflows::add_model(model)

workflow
#> ══ Workflow ═════════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: linear_reg()
#> 
#> ── Preprocessor ─────────────────────────────────────────────────────────────
#> msf_sc ~ sex + age
#> 
#> ── Case Weights ─────────────────────────────────────────────────────────────
#> cell_weight
#> 
#> ── Model ────────────────────────────────────────────────────────────────────
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm
fit <- workflow |> parsnip::fit(test_data)
Code
fit |>
  broom::tidy() |> 
  janitor::adorn_rounding(5)
Code
fit |> 
  broom::glance() |> 
  tidyr::pivot_longer(cols = dplyr::everything()) |>
  janitor::adorn_rounding(10)
Code
fit_engine <- fit |> parsnip::extract_fit_engine()

fit_engine |> summary()
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data, weights = weights)
#> 
#> Weighted Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -48700.087  -2574.721   -181.577   2571.318  48926.513 
#> 
#> Coefficients:
#>                 Estimate   Std. Error   t value   Pr(>|t|)    
#> (Intercept) 19828.531597    69.775678 284.17541 < 2.22e-16 ***
#> sexMale       367.277867    39.336490   9.33682 < 2.22e-16 ***
#> age          -123.263571     1.736254 -70.99398 < 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4568.365 on 65821 degrees of freedom
#> Multiple R-squared:  0.07261391, Adjusted R-squared:  0.07258573 
#> F-statistic: 2576.877 on 2 and 65821 DF,  p-value: < 2.2204e-16
fit_engine_2 <- 
  lm(
    formula = msf_sc ~ sex + age, 
    weights = cell_weight,
    data = test_data
  )

fit_engine_2 |> summary()
#> 
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data, weights = cell_weight)
#> 
#> Weighted Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -48700.087  -2574.721   -181.577   2571.318  48926.513 
#> 
#> Coefficients:
#>                 Estimate   Std. Error   t value   Pr(>|t|)    
#> (Intercept) 19828.531597    69.775678 284.17541 < 2.22e-16 ***
#> sexMale       367.277867    39.336490   9.33682 < 2.22e-16 ***
#> age          -123.263571     1.736254 -70.99398 < 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4568.365 on 65821 degrees of freedom
#> Multiple R-squared:  0.07261391, Adjusted R-squared:  0.07258573 
#> F-statistic: 2576.877 on 2 and 65821 DF,  p-value: < 2.2204e-16
Code

report::report(fit_engine_2)
#> We fitted a linear model (estimated using OLS) to predict msf_sc with sex
#> and age (formula: msf_sc ~ sex + age). The model explains a statistically
#> significant and weak proportion of variance (R2 = 0.07, F(2, 65821) =
#> 2576.88, p < .001, adj. R2 = 0.07). The model's intercept, corresponding to
#> sex = Female and age = 0, is at 19828.53 (95% CI [19691.77, 19965.29],
#> t(65821) = 284.18, p < .001). Within this model:
#> 
#>   - The effect of sex [Male] is statistically significant and positive (beta =
#> 367.28, 95% CI [290.18, 444.38], t(65821) = 9.34, p < .001; Std. beta =
#> 0.07, 95% CI [0.06, 0.08])
#>   - The effect of age is statistically significant and negative (beta =
#> -123.26, 95% CI [-126.67, -119.86], t(65821) = -70.99, p < .001; Std. beta =
#> -0.27, 95% CI [-0.27, -0.26])
#> 
#> Standardized parameters were obtained by fitting the model on a standardized
#> version of the dataset. 95% Confidence Intervals (CIs) and p-values were
#> computed using a Wald t-distribution approximation.
lm(formula = msf_sc ~ sex + age, data = test_data) |> summary()
#> 
#> Call:
#> lm(formula = msf_sc ~ sex + age, data = test_data)
#> 
#> Residuals:
#>         Min          1Q      Median          3Q         Max 
#> -15994.4448  -3559.2346   -366.0323   3310.4309  17395.6007 
#> 
#> Coefficients:
#>                 Estimate   Std. Error   t value   Pr(>|t|)    
#> (Intercept) 20028.288311    71.784441 279.00598 < 2.22e-16 ***
#> sexMale       512.589746    41.567515  12.33150 < 2.22e-16 ***
#> age          -127.047687     2.120366 -59.91782 < 2.22e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5033.417 on 65821 degrees of freedom
#> Multiple R-squared:  0.05319379, Adjusted R-squared:  0.05316502 
#> F-statistic: 1848.989 on 2 and 65821 DF,  p-value: < 2.2204e-16