library(brandr)
library(dplyr)
library(effectsize)
library(forcats)
library(fs)
library(ggplot2)
library(gt)
library(here)
library(htmltools)
library(httr2)
library(infer)
library(janitor)
library(labelled)
library(magrittr)
library(nanoparquet)
library(patchwork)
library(pwr)
library(pwrss)
library(readr)
library(readxl)
library(stringr)
library(summarytools)
library(tidyr)Differences Among Brazilian Children Aged 2 to 4 in Ultra-Processed Food Consumption in 2022 Between Municipalities in Clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)
Overview
This report contains a data analysis exercise for the course An Introduction to the R Programming Language.
The analysis check for potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).
This exercise is for educational purposes only.
The data used in this exercise requires further cleaning and validation before it can be used in real-world applications. For the purposes of this analysis, the data is assumed to be valid, reliable, and to satisfy all assumptions underlying the statistical tests performed, even though this may not hold in practice.
Please note that the results of the statistical tests may not be valid due to these simplifications.
In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.
Problem
Ultra-processed foods (UPF) are industrial formulations typically high in added sugars, unhealthy fats, and salt, while being low in essential nutrients (Monteiro et al., 2018; Monteiro et al., 2019). Their consumption has been linked to various adverse health outcomes, including obesity (Louzada et al., 2015), cardiovascular diseases (Mendonça et al., 2017), and metabolic disorders (Lavigne-Robichaud et al., 2018).
Although the consumption of UPF has been increasing globally, there is limited research on how it varies across different regions and socio-economic contexts, particularly among children. The Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) (Figure 1) provides a framework for assessing the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices (Carvalho et al., 2021; Norde et al., 2023). Understanding the relationship between MISFS-R clusters and UPF consumption can inform targeted interventions to promote healthier dietary patterns among children.
Question
This analysis seeks to address the following question:
Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?
Methods
Approach
This study employed Popper’s hypothetical-deductive method, also known as the method of conjecture and refutation (Popper, 1972/1979, p. 164), as its problem-solving approach. Procedurally, it applied an enhanced version of Null Hypothesis Significance Testing (NHST), grounded on the original ideas of Neyman-Pearson framework for data testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015).
Source of Data
The data used in this analysis were sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Data Munging
The data munging follow the data science workflow outlined by Wickham et al. (2023). All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.), and several R packages.
For data manipulation and workflow, priority was given to packages from the Tidyverse, Tidymodels, and rOpenSci frameworks, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).
Source: Reproduced from Wickham et al. (2023).
Data Analysis
The analysis employed a bilateral t-test for independent groups with a randomization-based empirical null distribution. Summary tables and visual inspections were conducted to explore patterns in the data.
Furthermore, an a priori power analysis and effect-size estimation were performed to evaluate the statistical robustness and practical significance of the findings.
Data Validation
Data from municipalities with fewer than 10 monitored children were excluded to ensure reliable estimates. Additionally, municipalities where the number of children consuming ultra-processed foods (UPF) exceeded the total number of monitored children were removed, as this indicates data inconsistencies.
This does not guarantee the validity of the data, but it helps to mitigate some potential issues.
Hypothesis Testing
The analysis tested whether the means of ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differed meaningfully between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).
To ensure practical significance, a Minimum Effect Size (MES) criterion was applied, following the original Neyman-Pearson framework for hypothesis testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015). The MES was set at Cohen’s threshold for small effects (Cohen’s \(d\) = 0.2) (Cohen, 1988). A difference was considered meaningful only if its effect size was greater than or equal to the MES; otherwise, it was considered negligible.
The test was structured as follows:
Null Hypothesis (\(\text{H}_{0}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 does not differ meaningfully between municipalities in MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being smaller than 0.2 (negligible).
Alternative Hypothesis (\(\text{H}_{1}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differs meaningfully between municipalities in MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being greater or equal than 0.2 (non-negligible).
Formally:
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]
The hypothesis test is conditioned on a type-I error (\(\alpha\)) of 0.05 and a minimum statistical power (1 - \(\beta\)) of 0.8. This means the test should have at least an 80% probability of correctly rejecting the null hypothesis when it is false, thereby minimizing the risk of a type-II error (\(\beta\)).
Code Style
The Tidyverse tidy tools manifesto (Wickham et al., 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.
Reproducibility
The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, 2025) was used to manage and restore the R environment. See the README file in the code repository to learn how to run it.
Set the Environment
Load Packages
Set Data Directories
for (i in c(raw_data_dir, data_dir)) {
if (!dir.exists(i)) {
dir.create(i)
}
}Perform an a priori Power Analysis
pwr_analysis <- pwr.t.test(
d = 0.2,
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided"
)pwr_analysis
#>
#> Two-sample t test power calculation
#>
#> n = 393.405696
#> d = 0.2
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: n is number in *each* groupCode
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.8,
alpha = 0.05,
welch.df = TRUE,
alternative = "not equal"
)
#> +--------------------------------------------------+
#> | SAMPLE SIZE CALCULATION |
#> +--------------------------------------------------+
#>
#> Welch's T-Test (Independent Samples)
#>
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#> H0 (Null Claim) : d - null.d = 0
#> H1 (Alt. Claim) : d - null.d != 0
#>
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#> Sample Size = 394 and 394 <<
#> Type 1 Error (alpha) = 0.050
#> Type 2 Error (beta) = 0.199
#> Statistical Power = 0.801
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "two.sided",
plot = TRUE,
verbose = FALSE
)Download Data
request("https://sisaps.saude.gov.br") |>
req_url_path_append("sisvan") |>
req_url_path_append("public") |>
req_url_path_append("file") |>
req_url_path_append("relatorios") |>
req_url_path_append("consumo") |>
req_url_path_append("2oumais") |>
req_url_path_append("entre2a4anos") |>
req_url_path_append("CONS_ULTRA.xlsx") |>
req_perform(
path = here(raw_data_dir, "CONS_ULTRA.xlsx")
)Import Data
data |> glimpse()
#> Rows: 3,158
#> Columns: 6
#> $ UF <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC",…
#> $ `Código IBGE` <chr> "120001", "120005", "120010", "120013", "120020", "12…
#> $ Município <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJARI", …
#> $ Total <chr> "7", "88", "141", "73", "230", "4", "148", "2", "23",…
#> $ `%` <chr> "0.875", "0.77192982456140347", "0.92156862745098034"…
#> $ ...6 <chr> "8", "114", "153", "82", "267", "5", "205", "2", "27"…Tidy Data
Rename Columns
data <-
data |>
clean_names() |>
rename(
municipality_code = codigo_ibge,
municipality = municipio,
federal_unit = uf,
n_upf = total,
n_upf_per = percent,
n_monitored = x6
)data |> glimpse()
#> Rows: 3,158
#> Columns: 6
#> $ federal_unit <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", "…
#> $ municipality_code <chr> "120001", "120005", "120010", "120013", "120020",…
#> $ municipality <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ n_upf <chr> "7", "88", "141", "73", "230", "4", "148", "2", "…
#> $ n_upf_per <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored <chr> "8", "114", "153", "82", "267", "5", "205", "2", …Standardize Columns
data <-
data |>
mutate(
year = 2022 |> as.integer(),
municipality_code = municipality_code |> as.integer(),
federal_unit = federal_unit |> as.factor(),
n_upf = n_upf |> as.integer(),
n_monitored = n_monitored |> as.integer()
)data |> glimpse()
#> Rows: 3,158
#> Columns: 7
#> $ federal_unit <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ municipality_code <int> 120001, 120005, 120010, 120013, 120020, 120025, 1…
#> $ municipality <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ n_upf <int> 7, 88, 141, 73, 230, 4, 148, 2, 23, 162, 2, 168, …
#> $ n_upf_per <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored <int> 8, 114, 153, 82, 267, 5, 205, 2, 27, 191, 2, 190,…
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…Relocate Columns
data |> glimpse()
#> Rows: 3,158
#> Columns: 7
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120001, 120005, 120010, 120013, 120020, 120025, 1…
#> $ municipality <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ federal_unit <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf <int> 7, 88, 141, 73, 230, 4, 148, 2, 23, 162, 2, 168, …
#> $ n_upf_per <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored <int> 8, 114, 153, 82, 267, 5, 205, 2, 27, 191, 2, 190,…Validate Data
Filter Data
data <-
data |>
filter(
!(n_monitored < 10),
!(n_upf > n_monitored)
)data |> glimpse()
#> Rows: 2,419
#> Columns: 7
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per <chr> "0.77192982456140347", "0.92156862745098034", "0.…
#> $ n_monitored <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…Transform Data
Recalculate UPF Consumption Percentage
data <-
data |>
mutate(
n_upf_per = n_upf |>
divide_by(n_monitored) |>
multiply_by(100)
)data |> glimpse()
#> Rows: 2,419
#> Columns: 7
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per <dbl> 77.19298246, 92.15686275, 89.02439024, 86.1423221…
#> $ n_monitored <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…Define MISFS
data <-
data |>
mutate(
misfs = federal_unit |>
recode_values(
c("AC", "GO", "MS", "MT", "RO", "TO") ~ "A",
c("ES", "MG", "PR", "RJ", "RS", "SC", "SP") ~ "B",
c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE") ~ "C",
c("AM", "AP", "PA", "RR") ~ "D"
) |>
factor(levels = c("A", "B", "C", "D"))
) |>
relocate(misfs, .after = federal_unit)data |> glimpse()
#> Rows: 2,419
#> Columns: 8
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ misfs <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, C, C, C, C…
#> $ n_upf <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per <dbl> 77.19298246, 92.15686275, 89.02439024, 86.1423221…
#> $ n_monitored <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…Filter MISFS
data |> glimpse()
#> Rows: 1,426
#> Columns: 8
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 130002, 130006, 130008, 130010, 130014, 130030, 1…
#> $ municipality <chr> "ALVARÃES", "AMATURÁ", "ANAMÃ", "ANORI", "APUÍ", …
#> $ federal_unit <fct> AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, A…
#> $ misfs <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
#> $ n_upf <int> 30, 282, 215, 52, 165, 934, 14, 360, 52, 134, 281…
#> $ n_upf_per <dbl> 81.08108108, 82.69794721, 88.84297521, 91.2280701…
#> $ n_monitored <int> 37, 341, 242, 57, 172, 1148, 14, 438, 157, 160, 3…Create Data Dictionary
Prepare Metadata
metadata <-
data |>
`var_label<-`(
list(
year = "Year of the data collection",
municipality_code = "Municipality code",
municipality = "Municipality name",
federal_unit = "State abbreviation (Federal Unit)",
misfs = paste0(
"Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)",
"cluster"
),
n_upf = "Number of children that consumed ultra-processed foods (UPF)",
n_upf_per = paste0(
"Percentage of children that consumed ultra-processed foods (UPF)"
),
n_monitored = "Number of monitored children"
)
) |>
generate_dictionary(details = "full") |>
convert_list_columns_to_character()Visualize Final Data
metadata |> glimpse()
#> Rows: 8
#> Columns: 14
#> $ pos <int> 1, 2, 3, 4, 5, 6, 7, 8
#> $ variable <chr> "year", "municipality_code", "municipality", "federal…
#> $ label <chr> "Year of the data collection", "Municipality code", "…
#> $ col_type <chr> "int", "int", "chr", "fct", "fct", "int", "dbl", "int"
#> $ missing <int> 0, 0, 0, 0, 0, 0, 0, 0
#> $ levels <chr> "", "", "", "AC; AL; AM; AP; BA; CE; DF; ES; GO; MA; …
#> $ value_labels <chr> "", "", "", "", "", "", "", ""
#> $ class <chr> "integer", "integer", "character", "factor", "factor"…
#> $ type <chr> "integer", "integer", "character", "integer", "intege…
#> $ na_values <chr> "", "", "", "", "", "", "", ""
#> $ na_range <chr> "", "", "", "", "", "", "", ""
#> $ n_na <int> 0, 0, 0, 0, 0, 0, 0, 0
#> $ unique_values <int> 1, 1426, 1419, 11, 2, 383, 854, 417
#> $ range <chr> "2022 - 2022", "130002 - 432360", "ABADIA DOS DOURADO…metadatadata |> glimpse()
#> Rows: 1,426
#> Columns: 8
#> $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 130002, 130006, 130008, 130010, 130014, 130030, 1…
#> $ municipality <chr> "ALVARÃES", "AMATURÁ", "ANAMÃ", "ANORI", "APUÍ", …
#> $ federal_unit <fct> AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, A…
#> $ misfs <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
#> $ n_upf <int> 30, 282, 215, 52, 165, 934, 14, 360, 52, 134, 281…
#> $ n_upf_per <dbl> 81.08108108, 82.69794721, 88.84297521, 91.2280701…
#> $ n_monitored <int> 37, 341, 242, 57, 172, 1148, 14, 438, 157, 160, 3…dataSave Data
Clean Old Data Files
if (dir.exists(data_dir)) {
dir_delete(data_dir)
dir_create(data_dir, recurse = TRUE)
}Write Data
valid_file_pattern <- "sisvan-upf-2022"data |>
write_parquet(
here(data_dir, paste0(valid_file_pattern, ".parquet"))
)Explore Data
Visualize a Random Sample of the Data
Code
data |> sample_n(100)Visualize Distributions
Code
data |>
descr(
var = n_upf_per,
style = "rmarkdown",
plain.ascii = FALSE,
headings = FALSE
)| n_upf_per | |
|---|---|
| Mean | 84.09 |
| Std.Dev | 14.21 |
| Min | 0.00 |
| Q1 | 80.00 |
| Median | 87.17 |
| Q3 | 92.59 |
| Max | 100.00 |
| MAD | 9.20 |
| IQR | 12.59 |
| CV | 0.17 |
| Skewness | -2.59 |
| SE.Skewness | 0.06 |
| Kurtosis | 9.90 |
| N.Valid | 1426.00 |
| N | 1426.00 |
| Pct.Valid | 100.00 |
Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
plot_hist <-
data |>
ggplot(aes(x = n_upf_per)) +
geom_histogram(
aes(y = after_stat(density)),
bins = 30,
color = "white"
) +
labs(x = "Value", y = "Density") +
geom_density(
color = "red",
linewidth = 1
) +
theme(legend.position = "none")
plot_qq <-
data |>
ggplot(aes(sample = n_upf_per)) +
stat_qq() +
stat_qq_line(
color = "red",
linewidth = 1
) +
labs(
x = "Theoretical Quantiles (Std. Normal)",
y = "Sample Quantiles"
) +
theme(legend.position = "none")
plot_hist + plot_qqSource: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
data |>
select(n_upf_per) |>
pivot_longer(everything()) |>
ggplot(aes(x = name, y = value)) +
geom_boxplot(
outlier.color = "red",
outlier.shape = 1,
width = 0.75
) +
geom_jitter(
width = 0.375,
alpha = 0.1,
color = get_brand_color("black"),
size = 0.5
) +
coord_flip() +
scale_fill_brand_d() +
labs(y = "Value") +
theme(
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
)Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
data |>
freq(
var = misfs,
style = "rmarkdown",
plain.ascii = FALSE,
headings = FALSE
)| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| A | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| B | 1300 | 91.16 | 91.16 | 91.16 | 91.16 |
| C | 0 | 0.00 | 91.16 | 0.00 | 91.16 |
| D | 126 | 8.84 | 100.00 | 8.84 | 100.00 |
| <NA> | 0 | 0.00 | 100.00 | ||
| Total | 1426 | 100.00 | 100.00 | 100.00 | 100.00 |
Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
data |>
mutate(
misfs = misfs |> fct_rev()
) |>
ggplot(aes(y = misfs)) +
geom_bar(fill = get_brand_color("blue")) +
scale_y_discrete(drop = FALSE) +
labs(
x = "MISFS-R Cluster",
y = "Frequency"
)Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Visualize Combined Distributions
Code
data |>
ggplot(
aes(
x = misfs,
y = n_upf_per,
fill = misfs
)
) +
geom_boxplot(
outlier.color = "red"
) +
geom_jitter(
width = 0.375,
alpha = 0.1,
color = "black",
size = 0.5
) +
scale_fill_brand_d() +
labs(
x = "MISFS-R Cluster",
y = "UPF consumption (%)",
fill = NULL
)Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
data |>
ggplot(
aes(
x = n_upf_per,
fill = misfs
)
) +
geom_density(
alpha = 0.5,
position = "identity"
) +
scale_fill_brand_d() +
labs(
x = "UPF consumption (%)",
y = "Density",
fill = "MISFS"
)Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Code
data |>
ggplot(
aes(
sample = n_upf_per,
color = misfs
)
) +
stat_qq() +
stat_qq_line(color = "black") +
scale_color_brand_d() +
facet_wrap(vars(misfs)) +
labs(
x = "Theoretical Quantiles (Std. Normal)",
y = "Sample Quantiles",
color = "MISFS"
)Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Summarize Data
Code
data |>
group_by(misfs) |>
summarize(
n = n(),
mean = mean(n_upf_per) / 100,
sd = sd(n_upf_per) / 100,
median = median(n_upf_per) / 100,
iqr = IQR(n_upf_per) / 100
) |>
pivot_longer(
cols = -misfs,
names_to = "Statistic",
values_to = "Value"
) |>
pivot_wider(
names_from = misfs,
values_from = Value
) |>
mutate(
Statistic = Statistic |>
recode_values(
"n" ~ "Sample Size",
"mean" ~ "Mean",
"sd" ~ "Standard Deviation",
"median" ~ "Median",
"iqr" ~ "Interquartile Range"
)
) |>
gt() |>
tab_header(
title = md("**Children Aged 2 to 4 UPF Consumption**"),
subtitle = "Comparison between MISFS-R Clusters B and D in 2022"
) |>
cols_label(
Statistic = "",
B = "Cluster B",
D = "Cluster D"
) |>
fmt_percent(
columns = c(B, D),
rows = !Statistic %in% c("Sample Size"),
decimals = 2
) |>
tab_footnote(
footnote = paste0(
"Statistics calculated removing municipalities with fewer than 10 ",
"monitored children and those where the number of children consuming ",
"UPFs exceeded the total number of monitored children."
),
locations = cells_body(
columns = Statistic,
rows = !Statistic %in% c("Sample Size")
),
placement = "right"
) |>
opt_footnote_marks(marks = c("*", "+")) |>
# tab_source_note(
# source_note = "Source: SISVAN, Brazilian Ministry of Health (2022)"
# ) |>
tab_options(
table.width = pct(75),
table.border.top.style = "solid",
table.border.bottom.style = "solid",
heading.border.bottom.style = "solid",
column_labels.border.bottom.style = "solid",
row_group.border.top.style = "solid",
row_group.border.bottom.style = "solid",
footnotes.border.lr.style = "solid",
source_notes.border.bottom.style = "solid"
) |>
opt_css(
css = "
.gt_footnote {
text-align: left !important;
}
"
)| Children Aged 2 to 4 UPF Consumption | ||
| Comparison between MISFS-R Clusters B and D in 2022 | ||
| Cluster B | Cluster D | |
|---|---|---|
| Sample Size | 1300 | 126 |
| Mean* | 84.17% | 83.25% |
| Standard Deviation* | 14.38% | 12.29% |
| Median* | 87.32% | 85.18% |
| Interquartile Range* | 12.69% | 13.70% |
| * Statistics calculated removing municipalities with fewer than 10 monitored children and those where the number of children consuming UPFs exceeded the total number of monitored children. | ||
Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).
Assess Model Assumptions
✅ Independence of observations.
❌ Normality of the distribution of the response variable (n_upf_per) within each group (misfs is the explanatory/independent variable).
⏭️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and the permutation approach below do not require this assumption).
Model Data
Calculate Observed Statistic
observed_statistic <-
data |>
specify(n_upf_per ~ misfs) |>
hypothesize(null = "independence") |>
calculate(
stat = "t",
order = c("B", "D")
)observed_statisticGenerate the Null Distribution
null_dist <-
data |>
specify(n_upf_per ~ misfs) |>
hypothesize(null = "independence") |>
generate(
reps = 1000,
type = "permute"
) |>
calculate(
stat = "t",
order = c("B", "D")
)null_dist |>
visualize() +
shade_p_value(
obs_stat = observed_statistic,
direction = "two-sided"
) +
labs(
title = NULL,
x = "t-statistic",
y = "Frequency"
)Assess p-value
p_value <-
null_dist |>
get_p_value(
obs_stat = observed_statistic,
direction = "two-sided"
)p_valueAssess Effect Size
cohens_d(
x = misfs_b,
y = misfs_d,
mu = 0,
ci = 0.95,
alternative = "two.sided"
) |>
interpret_hedges_g(rules = "cohen1988")Conclusion
As noted at the beginning of this exercise, the data and analysis are purely educational and should not be used to draw real-world conclusions without further validation. The data contain numerous issues and do not meet the assumptions of the statistical tests used, which compromise the reliability of the results. Therefore, any conclusions should be interpreted with caution.
Assuming the data were valid and the t-test assumptions were met, the analysis found no statistically significant difference in means (\(t\) = 0.789, \(p\)-value = 0.368). The observed effect size was very small and did not exceed the Minimum Effect Size (MES) threshold (Cohen’s \(d\) = 0.065, 95% CI [-0.118, 0.248]), indicating any potential difference is likely negligible or effectively zero within the 95% confidence interval bounds.
The power analysis revealed that the study lacked sufficient power to detect a small effect size (Cohen’s \(d\) = 0.2) with the given sample size, which may contribute to the non-significant result. While we cannot reject the null hypothesis, we also cannot confidently conclude that there is no meaningful difference in ultra-processed food consumption between the two clusters.
Based on these findings, we conclude that the data are insufficient to support a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D. This does not confirm the absence of a difference, only that the study lacked sufficient statistical power to detect one. Future research with larger sample sizes may help clarify this relationship.
Citation
To cite this work, please use the following format:
Vartanian, D. (2026). An introduction to the R programming language: Class exercise [Report]. Center for Metropolitan Studies, University of São Paulo.
A BibLaTeX entry for LaTeX users is:
@report{vartanian2026,
title = {An introduction to the R programming language: Class exercise},
author = {{Daniel Vartanian}},
year = {2026},
address = {São Paulo},
institution = {Center for Metropolitan Studies, University of São Paulo},
langid = {en}
License
The original data sources may be subject to their own licensing terms and conditions.
The code in this report is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.
Copyright (C) 2026 Daniel Vartanian
The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.










