Exploring potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 among B and D MISFS-R clusters
Overview
This report presents a data analysis exercise for the course An Introduction to the R Programming Language.
The analysis explores potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022, focusing on clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).
This exercise is for educational purposes only.
The data requires additional cleaning and validation to ensure its suitability for real-world applications. This analysis assumes the data is valid and reliable, which may not necessarily hold true.
Furthermore, the assumptions underlying the statistical tests were not explicitly verified. For simplicity, it was presumed that the data satisfies all necessary assumptions. Consequently, the validity of the statistical test results may be compromised.
Question
This analysis seeks to address the following question:
Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?
MISFS is a multidimensional index designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices. The MISFS-R is a revised version that introduces new indicators and a refined methodology (Figure 1). For more details, see Carvalho et al. (2021) and Norde et al. (2023).
Source: Reproduced from Norde et al. (2023).
Methods
Approach and Procedure Method
This study employed the hypothetical-deductive method, also known as the method of conjecture and refutation (Popper, 1972/1979, p. 164), as its problem-solving approach. Procedurally, it applied an enhanced version of Null Hypothesis Significance Testing (NHST), grounded on the original ideas of Neyman-Pearson framework for data testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015).
Source of Data/Information
The data used in this analysis were sourced from Brazil’s Food and Nutrition Surveillance System (SISVAN), regarding the dataset on ultra-processed food consumption (Sistema de Vigilância Alimentar e Nutricional, n.d.).
Only data from municipalities with 10 or more monitored children were considered in the analysis.
Data Wrangling
Data munging and analysis followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 2. All processes were made using the R programming language (R Core Team, n.d.), RStudio IDE (Posit Team, n.d.), and several R packages.
The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.
Source: Reproduced from Wickham et al. (2023).
The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.
All the analyses are 100% reproducible and can be run again at any time. See the README file in the code repository to learn how to run them.
Data Analysis
The analysis employed a bilateral t-test for independent groups using a randomization-based empirical null distribution. Visual inspections of the data were also conducted to explore and assess patterns. Furthermore, a power analysis and effect-size estimation were performed to evaluate the statistical robustness and practical significance of the findings.
Hypothesis Testing
We tested whether the means of ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differed meaningfully across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).
To ensure practical significance, we applied a Minimum Effect Size (MES) criterion, following the original Neyman-Pearson framework for hypothesis testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015). The MES was set at Cohen’s threshold for small effects (Cohen’s \(d\) = 0.2) (Cohen, 1988). Thus, a difference was considered meaningfully only if its effect-size was greater or equal to the MES; otherwise, it was considered negligible.
The test was structured as follows:
Null Hypothesis (\(\text{H}_{0}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 does not differ meaningfully across MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being smaller than 0.2 (negligible).
Alternative Hypothesis (\(\text{H}_{a}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differs meaningfully across MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being greater or equal than 0.2 (non-negligible).
Formally:
\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{a}: \text{Cohen's} \ d \geq \text{MES} \end{cases} \]
To ensure validity, this hypothesis test is conditioned on a Type I error (\(\alpha\)) of 0.05 and a minimum statistical power (1 - \(\beta\)) of 0.8. This means the test should have at least an 80% probability of correctly rejecting the null hypothesis when it is false, thereby minimizing the risk of a Type II error (\(\beta\)).
Setting the Environment
Importing and Tidying the Data
Source: SISVAN Food Consumption dataset for ultra-processed food consumption of children between 2 to 4 years old (Sistema de Vigilância Alimentar e Nutricional, n.d.).
Downloading the Data
if (!dir.exists(here::here("data"))) dir.create("data")file <- here("data", "sisvan-raw.xlsx")
paste0(
"https://sisaps.saude.gov.br/sisvan/public/file/relatorios/",
"consumo/2oumais/entre2a4anos/CONS_ULTRA.xlsx"
) |>
curl::curl_download(file)Reading the Data
Tidying the Data
data <-
data |>
clean_names() |>
rename(
state_abbrev = uf,
municipality_code = codigo_ibge,
municipality_name = municipio,
n_upf = total,
n_upf_rel = percent,
n_monitored = x6
) |>
mutate(
municipality_code = as.integer(municipality_code),
municipality_name = str_to_title(municipality_name),
n_upf = as.integer(n_upf),
n_monitored = as.integer(n_monitored),
year = as.integer(2022)
) |>
relocate(year)Validating the Data
Transforming the Data
data <-
data |>
mutate(
misfs = case_match(
state_abbrev,
c("AC", "GO", "MS", "MT", "RO", "TO") ~ "A",
c("ES", "MG", "PR", "RJ", "RS", "SC", "SP") ~ "B",
c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE") ~ "C",
c("AM", "AP", "PA", "RR") ~ "D",
) |>
factor(levels = c("A", "B", "C", "D"))
) |>
filter(misfs %in% c("B", "D")) |>
relocate(misfs, .before = state_abbrev)dataData Dictionary
-
year: Year of the data collection (type:integer). -
misfs: Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) cluster (type:factor). -
state_abbrev: State abbreviation (Federal Unit) (type:character). -
municipality_code: Municipality code (type:integer). -
municipality_name: Municipality name (type:character). -
n_upf: Number of children that consumed ultra-processed foods (UPF) (type:integer). -
n_upf_rel: Percentage of children that consumed ultra-processed foods (UPF) (type:double). -
n_monitored: Number of monitored children (type:integer).
Saving the Valid Data
Checking Distributions
Code
data |>
freq(
var = misfs,
style = "rmarkdown",
plain.ascii = FALSE,
headings = FALSE
)misfs variable.
| Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
|---|---|---|---|---|---|
| A | 0 | 0.00 | 0.00 | 0.00 | 0.00 |
| B | 1300 | 91.16 | 91.16 | 91.16 | 91.16 |
| C | 0 | 0.00 | 91.16 | 0.00 | 91.16 |
| D | 126 | 8.84 | 100.00 | 8.84 | 100.00 |
| <NA> | 0 | 0.00 | 100.00 | ||
| Total | 1426 | 100.00 | 100.00 | 100.00 | 100.00 |
Source: Created by the authors.
Code
Code
data |>
descr(
var = n_upf_rel,
style = "rmarkdown",
plain.ascii = FALSE,
headings = FALSE
)n_upf_rel variable.
| n_upf_rel | |
|---|---|
| Mean | 0.84 |
| Std.Dev | 0.14 |
| Min | 0.00 |
| Q1 | 0.80 |
| Median | 0.87 |
| Q3 | 0.93 |
| Max | 1.00 |
| MAD | 0.09 |
| IQR | 0.13 |
| CV | 0.17 |
| Skewness | -2.59 |
| SE.Skewness | 0.06 |
| Kurtosis | 9.90 |
| N.Valid | 1426.00 |
| N | 1426.00 |
| Pct.Valid | 100.00 |
Source: Created by the authors.
Code
plot_hist <-
data |>
ggplot(aes(x = n_upf_rel)) +
geom_histogram(
aes(y = after_stat(density)),
bins = 30,
color = "white"
) +
labs(x = "Value", y = "Density") +
geom_density(
color = get_brand_color("red"),
linewidth = 1
) +
theme(legend.position = "none")
plot_qq <-
data |>
ggplot(aes(sample = n_upf_rel)) +
stat_qq() +
stat_qq_line(
color = get_brand_color("red"),
linewidth = 1
) +
labs(
x = "Theoretical quantiles (Std. normal)",
y = "Sample quantiles"
) +
theme(legend.position = "none")
plot_hist + plot_qqn_upf_rel variable with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the variable and the theoretical quantiles of the normal distribution.
Code
data |>
select(n_upf_rel) |>
pivot_longer(everything()) |>
ggplot(aes(x = name, y = value)) +
geom_boxplot(
outlier.color = get_brand_color("dark-red"),
outlier.shape = 1,
width = 0.75
) +
geom_jitter(
width = 0.375,
alpha = 0.1,
color = get_brand_color("black"),
size = 0.5
) +
ggplot2::coord_flip() +
scale_fill_brand_d() +
labs(x = "Value") +
theme(
axis.title.y = ggplot2::element_blank(),
axis.text.y = ggplot2::element_blank(),
axis.ticks.y = ggplot2::element_blank()
)Checking Combined Distributions
Code
data |>
ggplot(aes(x = misfs, y = n_upf_rel, fill = misfs)) +
geom_boxplot(outlier.color = get_brand_color("dark-red")) +
geom_jitter(
width = 0.375,
alpha = 0.1,
color = get_brand_color("black"),
size = 0.5
) +
scale_fill_brand_d(alpha = 0.7) +
labs(x = "MISFS-R", y = "UPF consumption (%)", fill = "MISFS-R")Code
data |>
ggplot(aes(x = n_upf_rel, fill = misfs)) +
geom_density(alpha = 0.5, position = "identity") +
scale_fill_brand_d() +
labs(x = "UPF consumption (%)", y = "Density", fill = "MISFS")Modeling the Data
observed_statistic <-
data |>
specify(n_upf_rel ~ misfs) |>
hypothesize(null = "independence") |>
calculate(stat = "t", order = c("B", "D"))
#> Dropping unused factor levels c("A", "C") from the supplied explanatory
#> variable 'misfs'.
observed_statisticnull_dist <-
data |>
specify(n_upf_rel ~ misfs) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "t", order = c("B", "D"))
#> Dropping unused factor levels c("A", "C") from the supplied explanatory
#> variable 'misfs'.
null_distCode
null_dist |>
visualize() +
shade_p_value(obs_stat = observed_statistic, direction = "two-sided") +
labs(title = NULL, x = "t-statistic", y = "Frequency")null_dist |>
get_p_value(obs_stat = observed_statistic, direction = "two-sided")Assessing the Effect Size
cohens_d(
x = misfs_b,
y = misfs_d,
mu = 0,
ci = 0.95,
alternative = "two.sided"
) |>
interpret_hedges_g(rules = "cohen1988")Assessing Power
The power analysis below was conducted considering the Minimal Effect Size (MES) threshold (Cohen’s \(d\) = 0.2), not the observed effect size.
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2,
mu2 = 0,
paired = FALSE,
n2 = length(misfs_d),
kappa = length(misfs_b) / length(misfs_d),
power = NULL,
alpha = 0.05,
welch.df = TRUE,
alternative = "not equal"
)
#> Difference between Two means
#> (Independent Samples t Test)
#> H0: mu1 = mu2
#> HA: mu1 != mu2
#> ------------------------------
#> Statistical power = 0.567
#> n1 = 1300
#> n2 = 126
#> ------------------------------
#> Alternative = "not equal"
#> Degrees of freedom = 150.27
#> Non-centrality parameter = 2.144
#> Type I error rate = 0.05
#> Type II error rate = 0.433Code
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "not equal",
plot = TRUE,
verbose = FALSE
)Conclusion
Our analysis found no statistically significant difference in means (\(t\) = 0.789, \(p\)-value = 0.384). The observed effect size was very small and did not exceed the Minimal Effect Size (MES) threshold (Cohen’s \(d\) = 0.06, 95% CI [-0.12, 0.25]). This suggests that any potential difference is likely negligible or effectively zero within the bounds of the 95% confidence interval.
Considering the MES threshold (Cohen’s \(d\) = 0.2), the test exhibited limited statistical power (1 - \(\beta\) = 0.57), meaning there was a considerable probability (\(\beta\) = 0.43) of failing to detect a true difference if one exists.
Consequently, we conclude that the evidence is insufficient to support a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across clusters B and D. However, this does not confirm the absence of a difference—only that our study lacked sufficient evidence to detect one. Future research with a larger sample may help clarify this relationship.
License
The code in this report is licensed under the MIT License, while the document are available under the Creative Commons Attribution 4.0 International License.
How to Cite
To cite this work, please use the following format:
Vartanian, D., & Pereira, J. L. (2025). An introduction to the R programming language: Class exercise [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://danielvartan.github.io/r-course-exercise/
A BibTeX entry for LaTeX users is
@techreport{vartanian2025,
title = {An introduction to the R programming language: Class exercise},
author = {{Daniel Vartanian} and {Jaqueline Lopes Pereira}},
year = {2025},
address = {São Paulo},
institution = {Sustentarea Research and Extension Group at the University of São Paulo},
langid = {en},
url = {https://danielvartan.github.io/r-course-exercise/}
}









