Exploring potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 among B and D MISFS-R clusters

Author

Daniel Vartanian & Jaqueline Lopes Pereira

Published

2025-03-30

License: MIT License: CC BY 4.0

Overview

This report presents a data analysis exercise for the course An Introduction to the R Programming Language.

The analysis explores potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022, focusing on clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

This exercise is for educational purposes only.

The data requires additional cleaning and validation to ensure its suitability for real-world applications. This analysis assumes the data is valid and reliable, which may not necessarily hold true.

Furthermore, the assumptions underlying the statistical tests were not explicitly verified. For simplicity, it was presumed that the data satisfies all necessary assumptions. Consequently, the validity of the statistical test results may be compromised.

Question

This analysis seeks to address the following question:

Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?

MISFS is a multidimensional index designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices. The MISFS-R is a revised version that introduces new indicators and a refined methodology (Figure 1). For more details, see Carvalho et al. (2021) and Norde et al. (2023).

Figure 1: Dendrogram for cluster analysis between Brazilian states considering all the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) indicators and geographical location of each cluster.

Source: Reproduced from Norde et al. (2023).

Methods

Approach and Procedure Method

This study employed the hypothetical-deductive method, also known as the method of conjecture and refutation (Popper, 1972/1979, p. 164), as its problem-solving approach. Procedurally, it applied an enhanced version of Null Hypothesis Significance Testing (NHST), grounded on the original ideas of Neyman-Pearson framework for data testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015).

Source of Data/Information

The data used in this analysis were sourced from Brazil’s Food and Nutrition Surveillance System (SISVAN), regarding the dataset on ultra-processed food consumption (Sistema de Vigilância Alimentar e Nutricional, n.d.).

Only data from municipalities with 10 or more monitored children were considered in the analysis.

Data Wrangling

Data munging and analysis followed the data science workflow outlined by Wickham et al. (2023), as illustrated in Figure 2. All processes were made using the R programming language (R Core Team, n.d.), RStudio IDE (Posit Team, n.d.), and several R packages.

The tidyverse and rOpenSci peer-reviewed package ecosystem and other R packages adherents of the tidy tools manifesto (Wickham, 2023) were prioritized. All processes were made in order to provide result transparency and reproducibility.

Figure 2: Data science workflow created by Wickham, Çetinkaya-Runde, and Grolemund.

Source: Reproduced from Wickham et al. (2023).

The Tidyverse code style guide and design principles were followed to ensure consistency and enhance readability.

All the analyses are 100% reproducible and can be run again at any time. See the README file in the code repository to learn how to run them.

Data Analysis

The analysis employed a bilateral t-test for independent groups using a randomization-based empirical null distribution. Visual inspections of the data were also conducted to explore and assess patterns. Furthermore, a power analysis and effect-size estimation were performed to evaluate the statistical robustness and practical significance of the findings.

Hypothesis Testing

We tested whether the means of ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differed meaningfully across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

To ensure practical significance, we applied a Minimum Effect Size (MES) criterion, following the original Neyman-Pearson framework for hypothesis testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015). The MES was set at Cohen’s threshold for small effects (Cohen’s \(d\) = 0.2) (Cohen, 1988). Thus, a difference was considered meaningfully only if its effect-size was greater or equal to the MES; otherwise, it was considered negligible.

The test was structured as follows:

  • Null Hypothesis (\(\text{H}_{0}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 does not differ meaningfully across MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being smaller than 0.2 (negligible).

  • Alternative Hypothesis (\(\text{H}_{a}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differs meaningfully across MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being greater or equal than 0.2 (non-negligible).

Formally:

\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{a}: \text{Cohen's} \ d \geq \text{MES} \end{cases} \]

To ensure validity, this hypothesis test is conditioned on a Type I error (\(\alpha\)) of 0.05 and a minimum statistical power (1 - \(\beta\)) of 0.8. This means the test should have at least an 80% probability of correctly rejecting the null hypothesis when it is false, thereby minimizing the risk of a Type II error (\(\beta\)).

Setting the Environment

Importing and Tidying the Data

Source: SISVAN Food Consumption dataset for ultra-processed food consumption of children between 2 to 4 years old (Sistema de Vigilância Alimentar e Nutricional, n.d.).

Downloading the Data

if (!dir.exists(here::here("data"))) dir.create("data")
file <- here("data", "sisvan-raw.xlsx")

paste0(
    "https://sisaps.saude.gov.br/sisvan/public/file/relatorios/",
    "consumo/2oumais/entre2a4anos/CONS_ULTRA.xlsx"
  ) |>
  curl::curl_download(file)

Reading the Data

data <-
  here("data", "sisvan-raw.xlsx") |>
  read_xlsx(
    sheet = "2022",
    skip = 1,
    col_types = "text"
  )

Tidying the Data

data <-
  data |>
  clean_names() |>
  rename(
    state_abbrev = uf,
    municipality_code = codigo_ibge,
    municipality_name = municipio,
    n_upf = total,
    n_upf_rel = percent,
    n_monitored = x6
  ) |>
  mutate(
    municipality_code = as.integer(municipality_code),
    municipality_name = str_to_title(municipality_name),
    n_upf = as.integer(n_upf),
    n_monitored = as.integer(n_monitored),
    year = as.integer(2022)
  ) |>
  relocate(year)

Validating the Data

data <-
  data |>
  filter(
    !n_upf > n_monitored,
    n_monitored >= 10,
    !state_abbrev == "DF"
  ) |>
  drop_na(n_upf, n_monitored) |>
  mutate(n_upf_rel = n_upf / n_monitored)

Transforming the Data

data <-
  data |>
  mutate(
    misfs = case_match(
      state_abbrev,
      c("AC", "GO", "MS", "MT", "RO", "TO") ~ "A",
      c("ES", "MG", "PR", "RJ", "RS", "SC", "SP") ~ "B",
      c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE") ~ "C",
      c("AM", "AP", "PA", "RR") ~ "D",
    ) |>
      factor(levels = c("A", "B", "C", "D"))
  ) |>
  filter(misfs %in% c("B", "D")) |>
  relocate(misfs, .before = state_abbrev)
data

Data Dictionary

  • year: Year of the data collection (type: integer).
  • misfs: Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) cluster (type: factor).
  • state_abbrev: State abbreviation (Federal Unit) (type: character).
  • municipality_code: Municipality code (type: integer).
  • municipality_name: Municipality name (type: character).
  • n_upf: Number of children that consumed ultra-processed foods (UPF) (type: integer).
  • n_upf_rel: Percentage of children that consumed ultra-processed foods (UPF) (type: double).
  • n_monitored: Number of monitored children (type: integer).

Saving the Valid Data

data |> write_csv(here("data", "sisvan-valid.csv"))
data |> write_rds(here("data", "sisvan-valid.rds"))

Checking Distributions

Code
data |>
  freq(
    var = misfs,
    style = "rmarkdown",
    plain.ascii = FALSE,
    headings = FALSE
  )
Table 1: Frequencies of the misfs variable.
  Freq % Valid % Valid Cum. % Total % Total Cum.
A 0 0.00 0.00 0.00 0.00
B 1300 91.16 91.16 91.16 91.16
C 0 0.00 91.16 0.00 91.16
D 126 8.84 100.00 8.84 100.00
<NA> 0 0.00 100.00
Total 1426 100.00 100.00 100.00 100.00

Source: Created by the authors.

Code
levels <- data |> pull(misfs) |> levels()

data |>
  ggplot(aes(y = misfs)) +
  geom_bar() +
  labs(x = "MISFS-R Cluster", y = "Frequency") +
  scale_y_discrete(limits = rev(levels))
Figure 3: Bar plot of the misfs variable.

Source: Created by the authors.

Code
data |>
  descr(
    var = n_upf_rel,
    style = "rmarkdown",
    plain.ascii = FALSE,
    headings = FALSE
  )
Table 2: Statistics for the n_upf_rel variable.
  n_upf_rel
Mean 0.84
Std.Dev 0.14
Min 0.00
Q1 0.80
Median 0.87
Q3 0.93
Max 1.00
MAD 0.09
IQR 0.13
CV 0.17
Skewness -2.59
SE.Skewness 0.06
Kurtosis 9.90
N.Valid 1426.00
N 1426.00
Pct.Valid 100.00

Source: Created by the authors.

Code
plot_hist <-
  data |>
  ggplot(aes(x = n_upf_rel)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    color = "white"
  ) +
  labs(x = "Value", y = "Density") +
  geom_density(
    color = get_brand_color("red"),
    linewidth = 1
  ) +
  theme(legend.position = "none")

plot_qq <-
  data |>
  ggplot(aes(sample = n_upf_rel)) +
  stat_qq() +
  stat_qq_line(
    color = get_brand_color("red"),
    linewidth = 1
  ) +
  labs(
    x = "Theoretical quantiles (Std. normal)",
    y = "Sample quantiles"
  ) +
  theme(legend.position = "none")

plot_hist + plot_qq
Figure 4: Histogram of the n_upf_rel variable with a kernel density estimate, along with a quantile-quantile (Q-Q) plot between the variable and the theoretical quantiles of the normal distribution.

Source: Created by the authors.

Code
data |>
  select(n_upf_rel) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = name, y = value)) +
  geom_boxplot(
    outlier.color = get_brand_color("dark-red"),
    outlier.shape = 1,
    width = 0.75
  ) +
  geom_jitter(
    width = 0.375,
    alpha = 0.1,
    color = get_brand_color("black"),
    size = 0.5
  ) +
  ggplot2::coord_flip() +
  scale_fill_brand_d() +
  labs(x = "Value") +
  theme(
    axis.title.y = ggplot2::element_blank(),
    axis.text.y = ggplot2::element_blank(),
    axis.ticks.y = ggplot2::element_blank()
  )
Figure 5: Boxplot of the n_upf_rel variable.

Source: Created by the authors.

Checking Combined Distributions

Code
data |>
  ggplot(aes(x = misfs, y = n_upf_rel, fill = misfs)) +
  geom_boxplot(outlier.color = get_brand_color("dark-red")) +
  geom_jitter(
    width = 0.375,
    alpha = 0.1,
    color = get_brand_color("black"),
    size = 0.5
  ) +
  scale_fill_brand_d(alpha = 0.7) +
  labs(x = "MISFS-R", y = "UPF consumption (%)", fill = "MISFS-R")
Figure 6: Boxplots of the n_upf_rel variable by MISFS-R.

Source: Created by the authors.

Code
data |>
  ggplot(aes(x = n_upf_rel, fill = misfs)) +
  geom_density(alpha = 0.5, position = "identity") +
  scale_fill_brand_d() +
  labs(x = "UPF consumption (%)", y = "Density", fill = "MISFS")
Figure 7: Density plots of the n_upf_rel variable by MISFS-R.

Source: Created by the authors.

Modeling the Data

observed_statistic <-
  data |>
  specify(n_upf_rel ~ misfs) |>
  hypothesize(null = "independence") |>
  calculate(stat = "t", order = c("B", "D"))
#> Dropping unused factor levels c("A", "C") from the supplied explanatory
#> variable 'misfs'.

observed_statistic
null_dist <-
  data |>
  specify(n_upf_rel ~ misfs) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "t", order = c("B", "D"))
#> Dropping unused factor levels c("A", "C") from the supplied explanatory
#> variable 'misfs'.

null_dist
Code
null_dist |>
  visualize() +
  shade_p_value(obs_stat = observed_statistic, direction = "two-sided") +
  labs(title = NULL, x = "t-statistic", y = "Frequency")
Figure 8: Simulation of the null distribution of the t-statistic along with the observed t-statistic.

Source: Created by the authors.

null_dist |>
  get_p_value(obs_stat = observed_statistic, direction = "two-sided")

Assessing the Effect Size

Code
misfs_b <-
  data |>
  filter(misfs == "B") |>
  pull(n_upf_rel)

misfs_d <-
  data |>
  filter(misfs == "D") |>
  pull(n_upf_rel)
cohens_d(
  x = misfs_b,
  y = misfs_d,
  mu = 0,
  ci = 0.95,
  alternative = "two.sided"
) |>
  interpret_hedges_g(rules = "cohen1988")

Assessing Power

The power analysis below was conducted considering the Minimal Effect Size (MES) threshold (Cohen’s \(d\) = 0.2), not the observed effect size.

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2,
  mu2 = 0,
  paired = FALSE,
  n2 = length(misfs_d),
  kappa = length(misfs_b) / length(misfs_d),
  power = NULL,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "not equal"
)
#>  Difference between Two means 
#>  (Independent Samples t Test) 
#>  H0: mu1 = mu2 
#>  HA: mu1 != mu2 
#>  ------------------------------ 
#>   Statistical power = 0.567 
#>   n1 = 1300 
#>   n2 = 126 
#>  ------------------------------ 
#>  Alternative = "not equal" 
#>  Degrees of freedom = 150.27 
#>  Non-centrality parameter = 2.144 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.433
Code
power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "not equal",
  plot = TRUE,
  verbose = FALSE
)

Conclusion

Our analysis found no statistically significant difference in means (\(t\) = 0.789, \(p\)-value = 0.384). The observed effect size was very small and did not exceed the Minimal Effect Size (MES) threshold (Cohen’s \(d\) = 0.06, 95% CI [-0.12, 0.25]). This suggests that any potential difference is likely negligible or effectively zero within the bounds of the 95% confidence interval.

Considering the MES threshold (Cohen’s \(d\) = 0.2), the test exhibited limited statistical power (1 - \(\beta\) = 0.57), meaning there was a considerable probability (\(\beta\) = 0.43) of failing to detect a true difference if one exists.

Consequently, we conclude that the evidence is insufficient to support a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across clusters B and D. However, this does not confirm the absence of a difference—only that our study lacked sufficient evidence to detect one. Future research with a larger sample may help clarify this relationship.

License

License: MIT License: CC BY 4.0

The code in this report is licensed under the MIT License, while the document are available under the Creative Commons Attribution 4.0 International License.

How to Cite

To cite this work, please use the following format:

Vartanian, D., & Pereira, J. L. (2025). An introduction to the R programming language: Class exercise [Report]. Sustentarea Research and Extension Group at the University of São Paulo. https://danielvartan.github.io/r-course-exercise/

A BibTeX entry for LaTeX users is

@techreport{vartanian2025,
  title = {An introduction to the R programming language: Class exercise},
  author = {{Daniel Vartanian} and {Jaqueline Lopes Pereira}},
  year = {2025},
  address = {São Paulo},
  institution = {Sustentarea Research and Extension Group at the University of São Paulo},
  langid = {en},
  url = {https://danielvartan.github.io/r-course-exercise/}
}

References

Carvalho, A. M. de, Verly Jr, E., Marchioni, D. M., & Jones, A. D. (2021). Measuring sustainable food systems in Brazil: A framework and multidimensional index to evaluate socioeconomic, nutritional, and environmental aspects. World Development, 143, 105470. https://doi.org/10.1016/j.worlddev.2021.105470
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Neyman, J., & Pearson, E. S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A(1/2), 175–240. https://doi.org/10.2307/2331945
Neyman, J., & Pearson, E. S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294. https://doi.org/10.2307/2332112
Norde, M. M., Porciuncula, L., Garrido, G., Nunes-Galbes, N. M., Sarti, F. M., Marchioni, D. M. L., & de Carvalho, A. M. (2023). Measuring food systems sustainability in heterogenous countries: The Brazilian multidimensional index updated version applicability. Sustainable Development, 31(1), 91–107. https://doi.org/10.1002/sd.2376
Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00223
Popper, K. R. (1979). Objective knowledge: An evolutionary approach. Oxford University Press. (Original work published 1972)
Posit Team. (n.d.). RStudio: Integrated development environment for R [Computer software]. Posit Software. http://www.posit.co
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org
Sistema de Vigilância Alimentar e Nutricional. (n.d.). Relatórios de acesso público [Data set]. Retrieved March 19, 2023, from https://sisaps.saude.gov.br/sisvan/relatoriopublico/
Wickham, H. (2023, February 23). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz