Differences Among Brazilian Children Aged 2 to 4 in Ultra-Processed Food Consumption in 2022 Between Municipalities in Clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)

Daniel Vartanian

Author

Daniel Vartanian

Published

February 13, 2026

Overview

This report contains a data analysis exercise for the course An Introduction to the R Programming Language.

The analysis check for potential differences in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

This exercise is for educational purposes only.

The data used in this exercise requires further cleaning and validation before it can be used in real-world applications. For the purposes of this analysis, the data is assumed to be valid, reliable, and to satisfy all assumptions underlying the statistical tests performed, even though this may not hold in practice.

Please note that the results of the statistical tests may not be valid due to these simplifications.

In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.

Problem

Ultra-processed foods (UPF) are industrial formulations typically high in added sugars, unhealthy fats, and salt, while being low in essential nutrients (Monteiro et al., 2018; Monteiro et al., 2019). Their consumption has been linked to various adverse health outcomes, including obesity (Louzada et al., 2015), cardiovascular diseases (Mendonça et al., 2017), and metabolic disorders (Lavigne-Robichaud et al., 2018).

Although the consumption of UPF has been increasing globally, there is limited research on how it varies across different regions and socio-economic contexts, particularly among children. The Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) (Figure 1) provides a framework for assessing the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices (Carvalho et al., 2021; Norde et al., 2023). Understanding the relationship between MISFS-R clusters and UPF consumption can inform targeted interventions to promote healthier dietary patterns among children.

Figure 1: Dendrogram showing hierarchical cluster analysis of Brazilian states based on Revised Multidimensional Index for Sustainable Food Systems (MISFS-R) indicators and geographical location.

Question

This analysis seeks to address the following question:

Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?

Methods

Approach

This study employed Popper’s hypothetical-deductive method, also known as the method of conjecture and refutation (Popper, 1972/1979, p. 164), as its problem-solving approach. Procedurally, it applied an enhanced version of Null Hypothesis Significance Testing (NHST), grounded on the original ideas of Neyman-Pearson framework for data testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015).

Source of Data

The data used in this analysis were sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).

Data Munging

The data munging follow the data science workflow outlined by Wickham et al. (2023). All processes were made using the Quarto publishing system (Allaire et al., n.d.), the R programming language (R Core Team, n.d.), and several R packages.

For data manipulation and workflow, priority was given to packages from the Tidyverse, Tidymodels, and rOpenSci frameworks, as well as other packages adhering to the tidy tools manifesto (Wickham, 2023).

Figure 2: Data science workflow proposed by Wickham, Çetinkaya-Runde, and Grolemund.

Data Analysis

The analysis employed a bilateral t-test for independent groups with a randomization-based empirical null distribution. Summary tables and visual inspections were conducted to explore patterns in the data.

Furthermore, an a priori power analysis and effect-size estimation were performed to evaluate the statistical robustness and practical significance of the findings.

Data Validation

Data from municipalities with fewer than 10 monitored children were excluded to ensure reliable estimates. Additionally, municipalities where the number of children consuming ultra-processed foods (UPF) exceeded the total number of monitored children were removed, as this indicates data inconsistencies.

This does not guarantee the validity of the data, but it helps to mitigate some potential issues.

Hypothesis Testing

The analysis tested whether the means of ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differed meaningfully between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

To ensure practical significance, a Minimum Effect Size (MES) criterion was applied, following the original Neyman-Pearson framework for hypothesis testing (Neyman & Pearson, 1928a, 1928b; Perezgonzalez, 2015). The MES was set at Cohen’s threshold for small effects (Cohen’s \(d\) = 0.2) (Cohen, 1988). A difference was considered meaningful only if its effect size was greater than or equal to the MES; otherwise, it was considered negligible.

The test was structured as follows:

Null Hypothesis (\(\text{H}_{0}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 does not differ meaningfully between municipalities in MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being smaller than 0.2 (negligible).
Alternative Hypothesis (\(\text{H}_{1}\)): Ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 differs meaningfully between municipalities in MISFS-R clusters B and D, indicated by Cohen’s \(d\) effect-size statistic being greater or equal than 0.2 (non-negligible).

Formally:

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]

The hypothesis test is conditioned on a type-I error (\(\alpha\)) of 0.05 and a minimum statistical power (1 - \(\beta\)) of 0.8. This means the test should have at least an 80% probability of correctly rejecting the null hypothesis when it is false, thereby minimizing the risk of a type-II error (\(\beta\)).

Code Style

The Tidyverse tidy tools manifesto (Wickham et al., 2023), code style guide (Wickham, n.d.-a) and design principles (Wickham, n.d.-b) were followed to ensure consistency and enhance readability.

Reproducibility

The pipeline is fully reproducible and can be run again at any time. To ensure consistent results, the renv package (Ushey & Wickham, 2025) was used to manage and restore the R environment. See the README file in the code repository to learn how to run it.

Set the Environment

Load Packages

library(brandr)
library(dplyr)
library(effectsize)
library(forcats)
library(fs)
library(ggplot2)
library(gt)
library(here)
library(htmltools)
library(httr2)
library(infer)
library(janitor)
library(labelled)
library(magrittr)
library(nanoparquet)
library(patchwork)
library(pwr)
library(pwrss)
library(readr)
library(readxl)
library(stringr)
library(summarytools)
library(tidyr)

Set Data Directories

raw_data_dir <- here("data-raw")
data_dir <- here("data")

for (i in c(raw_data_dir, data_dir)) {
  if (!dir.exists(i)) {
    dir.create(i)
  }
}

Perform an a priori Power Analysis

pwr_analysis <- pwr.t.test(
  d = 0.2,
  sig.level = 0.05,
  power = 0.8,
  type = "two.sample",
  alternative = "two.sided"
)

pwr_analysis
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 393.405696
#>               d = 0.2
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

Code

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.8,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "not equal"
)
#> +--------------------------------------------------+
#> |             SAMPLE SIZE CALCULATION              |
#> +--------------------------------------------------+
#> 
#> Welch's T-Test (Independent Samples)
#> 
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#>   H0 (Null Claim) : d - null.d = 0 
#>   H1 (Alt. Claim) : d - null.d != 0 
#> 
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#>   Sample Size            = 394 and 394  <<
#>   Type 1 Error (alpha)   = 0.050
#>   Type 2 Error (beta)    = 0.199
#>   Statistical Power      = 0.801

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "two.sided",
  plot = TRUE,
  verbose = FALSE
)

Download Data

request("https://sisaps.saude.gov.br") |>
  req_url_path_append("sisvan") |>
  req_url_path_append("public") |>
  req_url_path_append("file") |>
  req_url_path_append("relatorios") |>
  req_url_path_append("consumo") |>
  req_url_path_append("2oumais") |>
  req_url_path_append("entre2a4anos") |>
  req_url_path_append("CONS_ULTRA.xlsx") |>
  req_perform(
    path = here(raw_data_dir, "CONS_ULTRA.xlsx")
  )

Import Data

data <-
  here(raw_data_dir, "CONS_ULTRA.xlsx") |>
  read_xlsx(
    sheet = "2022",
    skip = 1,
    col_types = "text"
  )

data |> glimpse()
#> Rows: 3,158
#> Columns: 6
#> $ UF            <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC",…
#> $ `Código IBGE` <chr> "120001", "120005", "120010", "120013", "120020", "12…
#> $ Município     <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJARI", …
#> $ Total         <chr> "7", "88", "141", "73", "230", "4", "148", "2", "23",…
#> $ `%`           <chr> "0.875", "0.77192982456140347", "0.92156862745098034"…
#> $ ...6          <chr> "8", "114", "153", "82", "267", "5", "205", "2", "27"…

Tidy Data

Rename Columns

data <-
  data |>
  clean_names() |>
  rename(
    municipality_code = codigo_ibge,
    municipality = municipio,
    federal_unit = uf,
    n_upf = total,
    n_upf_per = percent,
    n_monitored = x6
  )

data |> glimpse()
#> Rows: 3,158
#> Columns: 6
#> $ federal_unit      <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", "…
#> $ municipality_code <chr> "120001", "120005", "120010", "120013", "120020",…
#> $ municipality      <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ n_upf             <chr> "7", "88", "141", "73", "230", "4", "148", "2", "…
#> $ n_upf_per         <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored       <chr> "8", "114", "153", "82", "267", "5", "205", "2", …

Standardize Columns

data <-
  data |>
  mutate(
    year = 2022 |> as.integer(),
    municipality_code = municipality_code |> as.integer(),
    federal_unit = federal_unit |> as.factor(),
    n_upf = n_upf |> as.integer(),
    n_monitored = n_monitored |> as.integer()
  )

data |> glimpse()
#> Rows: 3,158
#> Columns: 7
#> $ federal_unit      <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ municipality_code <int> 120001, 120005, 120010, 120013, 120020, 120025, 1…
#> $ municipality      <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ n_upf             <int> 7, 88, 141, 73, 230, 4, 148, 2, 23, 162, 2, 168, …
#> $ n_upf_per         <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored       <int> 8, 114, 153, 82, 267, 5, 205, 2, 27, 191, 2, 190,…
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…

Relocate Columns

data <-
  data |>
  relocate(year) |>
  relocate(federal_unit, .after = municipality)

data |> glimpse()
#> Rows: 3,158
#> Columns: 7
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120001, 120005, 120010, 120013, 120020, 120025, 1…
#> $ municipality      <chr> "ACRELÂNDIA", "ASSIS BRASIL", "BRASILÉIA", "BUJAR…
#> $ federal_unit      <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf             <int> 7, 88, 141, 73, 230, 4, 148, 2, 23, 162, 2, 168, …
#> $ n_upf_per         <chr> "0.875", "0.77192982456140347", "0.92156862745098…
#> $ n_monitored       <int> 8, 114, 153, 82, 267, 5, 205, 2, 27, 191, 2, 190,…

Validate Data

Filter Data

data <-
  data |>
  filter(
    !(n_monitored < 10),
    !(n_upf > n_monitored)
  )

data |> glimpse()
#> Rows: 2,419
#> Columns: 7
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality      <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit      <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf             <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per         <chr> "0.77192982456140347", "0.92156862745098034", "0.…
#> $ n_monitored       <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…

Transform Data

Recalculate UPF Consumption Percentage

data <-
  data |>
  mutate(
    n_upf_per = n_upf |>
      divide_by(n_monitored) |>
      multiply_by(100)
  )

data |> glimpse()
#> Rows: 2,419
#> Columns: 7
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality      <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit      <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ n_upf             <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per         <dbl> 77.19298246, 92.15686275, 89.02439024, 86.1423221…
#> $ n_monitored       <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…

Define MISFS

data <-
  data |>
  mutate(
    misfs = federal_unit |>
      recode_values(
        c("AC", "GO", "MS", "MT", "RO", "TO") ~ "A",
        c("ES", "MG", "PR", "RJ", "RS", "SC", "SP") ~ "B",
        c("AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE") ~ "C",
        c("AM", "AP", "PA", "RR") ~ "D"
      ) |>
      factor(levels = c("A", "B", "C", "D"))
  ) |>
  relocate(misfs, .after = federal_unit)

data |> glimpse()
#> Rows: 2,419
#> Columns: 8
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 120005, 120010, 120013, 120020, 120030, 120034, 1…
#> $ municipality      <chr> "ASSIS BRASIL", "BRASILÉIA", "BUJARI", "CRUZEIRO …
#> $ federal_unit      <fct> AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, AC, A…
#> $ misfs             <fct> A, A, A, A, A, A, A, A, A, A, A, A, A, C, C, C, C…
#> $ n_upf             <int> 88, 141, 73, 230, 148, 23, 162, 168, 52, 70, 243,…
#> $ n_upf_per         <dbl> 77.19298246, 92.15686275, 89.02439024, 86.1423221…
#> $ n_monitored       <int> 114, 153, 82, 267, 205, 27, 191, 190, 58, 74, 284…

Filter MISFS

data <-
  data |>
  filter(misfs %in% c("B", "D"))

data |> glimpse()
#> Rows: 1,426
#> Columns: 8
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 130002, 130006, 130008, 130010, 130014, 130030, 1…
#> $ municipality      <chr> "ALVARÃES", "AMATURÁ", "ANAMÃ", "ANORI", "APUÍ", …
#> $ federal_unit      <fct> AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, A…
#> $ misfs             <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
#> $ n_upf             <int> 30, 282, 215, 52, 165, 934, 14, 360, 52, 134, 281…
#> $ n_upf_per         <dbl> 81.08108108, 82.69794721, 88.84297521, 91.2280701…
#> $ n_monitored       <int> 37, 341, 242, 57, 172, 1148, 14, 438, 157, 160, 3…

Create Data Dictionary

Prepare Metadata

metadata <-
  data |>
  `var_label<-`(
    list(
      year = "Year of the data collection",
      municipality_code = "Municipality code",
      municipality = "Municipality name",
      federal_unit = "State abbreviation (Federal Unit)",
      misfs = paste0(
        "Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)",
        "cluster"
      ),
      n_upf = "Number of children that consumed ultra-processed foods (UPF)",
      n_upf_per = paste0(
        "Percentage of children that consumed ultra-processed foods (UPF)"
      ),
      n_monitored = "Number of monitored children"
    )
  ) |>
  generate_dictionary(details = "full") |>
  convert_list_columns_to_character()

Visualize Final Data

metadata |> glimpse()
#> Rows: 8
#> Columns: 14
#> $ pos           <int> 1, 2, 3, 4, 5, 6, 7, 8
#> $ variable      <chr> "year", "municipality_code", "municipality", "federal…
#> $ label         <chr> "Year of the data collection", "Municipality code", "…
#> $ col_type      <chr> "int", "int", "chr", "fct", "fct", "int", "dbl", "int"
#> $ missing       <int> 0, 0, 0, 0, 0, 0, 0, 0
#> $ levels        <chr> "", "", "", "AC; AL; AM; AP; BA; CE; DF; ES; GO; MA; …
#> $ value_labels  <chr> "", "", "", "", "", "", "", ""
#> $ class         <chr> "integer", "integer", "character", "factor", "factor"…
#> $ type          <chr> "integer", "integer", "character", "integer", "intege…
#> $ na_values     <chr> "", "", "", "", "", "", "", ""
#> $ na_range      <chr> "", "", "", "", "", "", "", ""
#> $ n_na          <int> 0, 0, 0, 0, 0, 0, 0, 0
#> $ unique_values <int> 1, 1426, 1419, 11, 2, 383, 854, 417
#> $ range         <chr> "2022 - 2022", "130002 - 432360", "ABADIA DOS DOURADO…

metadata

data |> glimpse()
#> Rows: 1,426
#> Columns: 8
#> $ year              <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2…
#> $ municipality_code <int> 130002, 130006, 130008, 130010, 130014, 130030, 1…
#> $ municipality      <chr> "ALVARÃES", "AMATURÁ", "ANAMÃ", "ANORI", "APUÍ", …
#> $ federal_unit      <fct> AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, AM, A…
#> $ misfs             <fct> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D…
#> $ n_upf             <int> 30, 282, 215, 52, 165, 934, 14, 360, 52, 134, 281…
#> $ n_upf_per         <dbl> 81.08108108, 82.69794721, 88.84297521, 91.2280701…
#> $ n_monitored       <int> 37, 341, 242, 57, 172, 1148, 14, 438, 157, 160, 3…

data

Save Data

Clean Old Data Files

if (dir.exists(data_dir)) {
  dir_delete(data_dir)
  dir_create(data_dir, recurse = TRUE)
}

Write Data

valid_file_pattern <- "sisvan-upf-2022"

data |>
  write_csv(
    here(data_dir, paste0(valid_file_pattern, ".csv"))
  )

data |>
  write_rds(
    here(data_dir, paste0(valid_file_pattern, ".rds"))
  )

data |>
  write_parquet(
    here(data_dir, paste0(valid_file_pattern, ".parquet"))
  )

Explore Data

Visualize a Random Sample of the Data

Code

data |> sample_n(100)

Visualize Distributions

Code

data |>
  descr(
    var = n_upf_per,
    style = "rmarkdown",
    plain.ascii = FALSE,
    headings = FALSE
  )

Table 1: Descriptive statistics representing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

	n_upf_per
Mean	84.09
Std.Dev	14.21
Min	0.00
Q1	80.00
Median	87.17
Q3	92.59
Max	100.00
MAD	9.20
IQR	12.59
CV	0.17
Skewness	-2.59
SE.Skewness	0.06
Kurtosis	9.90
N.Valid	1426.00
N	1426.00
Pct.Valid	100.00

Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).

Code

plot_hist <-
  data |>
  ggplot(aes(x = n_upf_per)) +
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    color = "white"
  ) +
  labs(x = "Value", y = "Density") +
  geom_density(
    color = "red",
    linewidth = 1
  ) +
  theme(legend.position = "none")

plot_qq <-
  data |>
  ggplot(aes(sample = n_upf_per)) +
  stat_qq() +
  stat_qq_line(
    color = "red",
    linewidth = 1
  ) +
  labs(
    x = "Theoretical Quantiles (Std. Normal)",
    y = "Sample Quantiles"
  ) +
  theme(legend.position = "none")

plot_hist + plot_qq

Figure 3: Histogram with density curve and Q-Q plot showing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Code

data |>
  select(n_upf_per) |>
  pivot_longer(everything()) |>
  ggplot(aes(x = name, y = value)) +
  geom_boxplot(
    outlier.color = "red",
    outlier.shape = 1,
    width = 0.75
  ) +
  geom_jitter(
    width = 0.375,
    alpha = 0.1,
    color = get_brand_color("black"),
    size = 0.5
  ) +
  coord_flip() +
  scale_fill_brand_d() +
  labs(y = "Value") +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )

Figure 4: Boxplot with jitter points showing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Code

data |>
  freq(
    var = misfs,
    style = "rmarkdown",
    plain.ascii = FALSE,
    headings = FALSE
  )

Table 2: Frequencies representing the distribution of observations for children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
A	0	0.00	0.00	0.00	0.00
B	1300	91.16	91.16	91.16	91.16
C	0	0.00	91.16	0.00	91.16
D	126	8.84	100.00	8.84	100.00
<NA>	0			0.00	100.00
Total	1426	100.00	100.00	100.00	100.00

Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).

Code

data |>
  mutate(
    misfs = misfs |> fct_rev()
  ) |>
  ggplot(aes(y = misfs)) +
  geom_bar(fill = get_brand_color("blue")) +
  scale_y_discrete(drop = FALSE) +
  labs(
    x = "MISFS-R Cluster",
    y = "Frequency"
  )

Figure 5: Bar plot showing the frequency of observations for children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Visualize Combined Distributions

Code

data |>
  ggplot(
    aes(
      x = misfs,
      y = n_upf_per,
      fill = misfs
    )
  ) +
  geom_boxplot(
    outlier.color = "red"
  ) +
  geom_jitter(
    width = 0.375,
    alpha = 0.1,
    color = "black",
    size = 0.5
  ) +
  scale_fill_brand_d() +
  labs(
    x = "MISFS-R Cluster",
    y = "UPF consumption (%)",
    fill = NULL
  )

Figure 6: Boxplots with jitter points showing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Code

data |>
  ggplot(
    aes(
      x = n_upf_per,
      fill = misfs
    )
  ) +
  geom_density(
    alpha = 0.5,
    position = "identity"
  ) +
  scale_fill_brand_d() +
  labs(
    x = "UPF consumption (%)",
    y = "Density",
    fill = "MISFS"
  )

Figure 7: Density plots showing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Code

data |>
  ggplot(
    aes(
      sample = n_upf_per,
      color = misfs
    )
  ) +
  stat_qq() +
  stat_qq_line(color = "black") +
  scale_color_brand_d() +
  facet_wrap(vars(misfs)) +
  labs(
    x = "Theoretical Quantiles (Std. Normal)",
    y = "Sample Quantiles",
    color = "MISFS"
  )

Figure 8: Q-Q plots showing the distribution of the percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

Summarize Data

Code

data |>
  group_by(misfs) |>
  summarize(
    n = n(),
    mean = mean(n_upf_per) / 100,
    sd = sd(n_upf_per) / 100,
    median = median(n_upf_per) / 100,
    iqr = IQR(n_upf_per) / 100
  ) |>
  pivot_longer(
    cols = -misfs,
    names_to = "Statistic",
    values_to = "Value"
  ) |>
  pivot_wider(
    names_from = misfs,
    values_from = Value
  ) |>
  mutate(
    Statistic = Statistic |>
      recode_values(
        "n" ~ "Sample Size",
        "mean" ~ "Mean",
        "sd" ~ "Standard Deviation",
        "median" ~ "Median",
        "iqr" ~ "Interquartile Range"
      )
  ) |>
  gt() |>
  tab_header(
    title = md("**Children Aged 2 to 4 UPF Consumption**"),
    subtitle = "Comparison between MISFS-R Clusters B and D in 2022"
  ) |>
  cols_label(
    Statistic = "",
    B = "Cluster B",
    D = "Cluster D"
  ) |>
  fmt_percent(
    columns = c(B, D),
    rows = !Statistic %in% c("Sample Size"),
    decimals = 2
  ) |>
  tab_footnote(
    footnote = paste0(
      "Statistics calculated removing municipalities with fewer than 10 ",
      "monitored children and those where the number of children consuming ",
      "UPFs exceeded the total number of monitored children."
    ),
    locations = cells_body(
      columns = Statistic,
      rows = !Statistic %in% c("Sample Size")
    ),
    placement = "right"
  ) |>
  opt_footnote_marks(marks = c("*", "+")) |>
  # tab_source_note(
  #   source_note = "Source: SISVAN, Brazilian Ministry of Health (2022)"
  # ) |>
  tab_options(
    table.width = pct(75),
    table.border.top.style = "solid",
    table.border.bottom.style = "solid",
    heading.border.bottom.style = "solid",
    column_labels.border.bottom.style = "solid",
    row_group.border.top.style = "solid",
    row_group.border.bottom.style = "solid",
    footnotes.border.lr.style = "solid",
    source_notes.border.bottom.style = "solid"
  ) |>
  opt_css(
    css = "
    .gt_footnote {
      text-align: left !important;
    }
    "
  )

Table 3: Percentage of children aged 2 to 4 in 2022 who consumed ultra-processed foods (UPF) between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R).

	Cluster B	Cluster D
Children Aged 2 to 4 UPF Consumption
Comparison between MISFS-R Clusters B and D in 2022
Sample Size	1300	126
Mean^*	84.17%	83.25%
Standard Deviation^*	14.38%	12.29%
Median^*	87.32%	85.18%
Interquartile Range^*	12.69%	13.70%
^* Statistics calculated removing municipalities with fewer than 10 monitored children and those where the number of children consuming UPFs exceeded the total number of monitored children.

Source: Created by the author based on data sourced from the Food and Nutrition Surveillance System (SISVAN) of the Brazilian Ministry of Health (MS) (Sistema de Vigilância Alimentar e Nutricional & Ministério da Saúde, n.d.).

Assess Model Assumptions

See Howell (2013) and Student (1908) to learn more about the assumptions of the t-test.

✅ Independence of observations.

❌ Normality of the distribution of the response variable (n_upf_per) within each group (misfs is the explanatory/independent variable).

⏭️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and the permutation approach below do not require this assumption).

Model Data

Calculate Observed Statistic

observed_statistic <-
  data |>
  specify(n_upf_per ~ misfs) |>
  hypothesize(null = "independence") |>
  calculate(
    stat = "t",
    order = c("B", "D")
  )

observed_statistic

Generate the Null Distribution

null_dist <-
  data |>
  specify(n_upf_per ~ misfs) |>
  hypothesize(null = "independence") |>
  generate(
    reps = 1000,
    type = "permute"
  ) |>
  calculate(
    stat = "t",
    order = c("B", "D")
  )

null_dist |>
  visualize() +
  shade_p_value(
    obs_stat = observed_statistic,
    direction = "two-sided"
  ) +
  labs(
    title = NULL,
    x = "t-statistic",
    y = "Frequency"
  )

Assess p-value

p_value <-
  null_dist |>
  get_p_value(
    obs_stat = observed_statistic,
    direction = "two-sided"
  )

p_value

Assess Effect Size

misfs_b <-
  data |>
  filter(misfs == "B") |>
  pull(n_upf_per)

misfs_d <-
  data |>
  filter(misfs == "D") |>
  pull(n_upf_per)

cohens_d(
  x = misfs_b,
  y = misfs_d,
  mu = 0,
  ci = 0.95,
  alternative = "two.sided"
) |>
  interpret_hedges_g(rules = "cohen1988")

Conclusion

As noted at the beginning of this exercise, the data and analysis are purely educational and should not be used to draw real-world conclusions without further validation. The data contain numerous issues and do not meet the assumptions of the statistical tests used, which compromise the reliability of the results. Therefore, any conclusions should be interpreted with caution.

Assuming the data were valid and the t-test assumptions were met, the analysis found no statistically significant difference in means (\(t\) = 0.789, \(p\)-value = 0.368). The observed effect size was very small and did not exceed the Minimum Effect Size (MES) threshold (Cohen’s \(d\) = 0.065, 95% CI [-0.118, 0.248]), indicating any potential difference is likely negligible or effectively zero within the 95% confidence interval bounds.

The power analysis revealed that the study lacked sufficient power to detect a small effect size (Cohen’s \(d\) = 0.2) with the given sample size, which may contribute to the non-significant result. While we cannot reject the null hypothesis, we also cannot confidently conclude that there is no meaningful difference in ultra-processed food consumption between the two clusters.

Based on these findings, we conclude that the data are insufficient to support a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D. This does not confirm the absence of a difference, only that the study lacked sufficient statistical power to detect one. Future research with larger sample sizes may help clarify this relationship.

Citation

To cite this work, please use the following format:

Vartanian, D. (2026). An introduction to the R programming language: Class exercise [Report]. Center for Metropolitan Studies, University of São Paulo.

A BibLaTeX entry for LaTeX users is:

@report{vartanian2026,
  title = {An introduction to the R programming language: Class exercise},
  author = {{Daniel Vartanian}},
  year = {2026},
  address = {São Paulo},
  institution = {Center for Metropolitan Studies, University of São Paulo},
  langid = {en}

License

The original data sources may be subject to their own licensing terms and conditions.

The code in this report is licensed under the GNU General Public License Version 3, while the report is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International.

Copyright (C) 2026 Daniel Vartanian

The code in this report is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation, either version 3 of the License, or (at your option)
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <https://www.gnu.org/licenses/>.

Acknowledgments

This work was developed with support from the Center for Metropolitan Studies (CEM) based at the School of Philosophy, Letters and Human Sciences (FFLCH) of the University of São Paulo (USP) and at the Brazilian Center for Analysis and Planning (CEBRAP).

This work was financed, in part, by the São Paulo Research Foundation (FAPESP), Brazil. Process Number 2025/17879-2.

This work was developed with support from the Sustentarea Research and Extension Group at the University of São Paulo (USP).

This work was financed, in part, by the National Council for Scientific and Technological Development (CNPq), Brazil. Process Number 383443/2024-5.

References

Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048

Carvalho, A. M. de, Verly Jr, E., Marchioni, D. M., & Jones, A. D. (2021). Measuring sustainable food systems in Brazil: A framework and multidimensional index to evaluate socioeconomic, nutritional, and environmental aspects. World Development, 143, 105470. https://doi.org/10.1016/j.worlddev.2021.105470

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Howell, D. C. (2013). Statistical methods for psychology (8th ed.). Wadsworth Cengage Learning.

Lavigne-Robichaud, M., Moubarac, J.-C., Lantagne-Lopez, S., Johnson-Down, L., Batal, M., Sidi, E. A. L., & Lucas, M. (2018). Diet quality indices in relation to metabolic syndrome in an Indigenous Cree (Eeyouch) population in northern Québec, Canada. Public Health Nutrition, 21(1), 172–180. https://doi.org/10.1017/S136898001700115X

Louzada, M. L. da C., Baraldi, L. G., Steele, E. M., Martins, A. P. B., Canella, D. S., Moubarac, J.-C., Levy, R. B., Cannon, G., Afshin, A., Imamura, F., Mozaffarian, D., & Monteiro, C. A. (2015). Consumption of ultra-processed foods and obesity in Brazilian adolescents and adults. Preventive Medicine, 81, 9–15. https://doi.org/10.1016/j.ypmed.2015.07.018

Mendonça, R. de D., Lopes, A. C. S., Pimenta, A. M., Gea, A., Martinez-Gonzalez, M. A., & Bes-Rastrollo, M. (2017). Ultra-processed food consumption and the incidence of hypertension in a mediterranean cohort: The seguimiento universidad de navarra project. American Journal of Hypertension, 30(4), 358–366. https://doi.org/10.1093/ajh/hpw137

Monteiro, C. A., Cannon, G., Levy, R. B., Moubarac, J.-C., Louzada, M. L., Rauber, F., Khandpur, N., Cediel, G., Neri, D., Martinez-Steele, E., Baraldi, L. G., & Jaime, P. C. (2019). Ultra-processed foods: What they are and how to identify them. Public Health Nutrition, 22(5), 936–941. https://doi.org/10.1017/S1368980018003762

Monteiro, C. A., Cannon, G., Moubarac, J.-C., Levy, R. B., Louzada, M. L. C., & Jaime, P. C. (2018). The UN Decade of Nutrition, the NOVA food classification and the trouble with ultra-processing. Public Health Nutrition, 21(1), 5–17. https://doi.org/10.1017/S1368980017000234

Neyman, J., & Pearson, E. S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A(1/2), 175–240. https://doi.org/10.2307/2331945

Neyman, J., & Pearson, E. S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294. https://doi.org/10.2307/2332112

Norde, M. M., Porciuncula, L., Garrido, G., Nunes-Galbes, N. M., Sarti, F. M., Marchioni, D. M. L., & de Carvalho, A. M. (2023). Measuring food systems sustainability in heterogenous countries: The Brazilian multidimensional index updated version applicability. Sustainable Development, 31(1), 91–107. https://doi.org/10.1002/sd.2376

Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00223

Popper, K. R. (1979). Objective knowledge: An evolutionary approach. Oxford University Press. (Original work published in 1972)

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.r-project.org

Sistema de Vigilância Alimentar e Nutricional, & Ministério da Saúde. (n.d.). Consumo de alimentos ultraprocessados 2015-2022 [Consumption of ultra-processed foods 2015-2022] [Dataset]. https://sisaps.saude.gov.br/sisvan/relatoriopublico

Student. (1908). The probable error of a mean. Biometrika, 6(1). https://doi.org/10.2307/2331554

Ushey, K., & Wickham, H. (2025). renv: Project environments [Computer software]. https://doi.org/10.32614/CRAN.package.renv

Wickham, H. (n.d.-a). The tidyverse style guide. https://style.tidyverse.org

Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org

Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz