Main Objective
Provide a solid introduction to R programming with a focus on Tidyverse principles and its ecosystem.
Daniel Vartanian
February 2, 2026
This course will introduce you to the R programming language.

Here are the main topics:
(Artwork by Allison Horst)
Provide a solid introduction to R programming with a focus on Tidyverse principles and its ecosystem.
Future courses?
An Introduction to Git and GitHub.
An introduction to Quarto and Reproducible Workflows.
An introduction to Geospatial Data Science and the r-spatial framework.
An introduction to Shiny dashboards.

(Artwork by Allison Horst)
The course will take place in Laboratory 122 of the School of Philosophy, Letters and Human Sciences (FFLCH) at the University of São Paulo (USP).
The classes are distributed over 5 days, with a total of 30 hours.
Monday (02/02) 10:00-17:00
Tuesday (03/02) 10:00-17:00
Wednesday (04/02) 10:00-17:00
Thursday (05/02) 10:00-17:00
Friday (06/02) 10:00-17:00
Theory ➡️ Practice
🏋 In-class activities
☕ 15-minute break between sessions
🍝 1-hour for lunch
📓 Final project
😌 No formal evaluation
Mistakes will happen. Don’t let them discourage you!

(Artwork by Allison Horst)
Don’t be afraid to ask questions, even if they seem simple. Everyone has to start somewhere!
An old but good way to get help is to search or ask questions on forums like Stack Overflow.
Today, you can also use AI-powered tools like Claude to get instant help with your coding questions.
💾 Bring a flash drive/pen drive
👩💻 Try to use the same computer for all classes
💬 Always open this presentation in the browser
❌ Don’t miss a class
🙋♂️ Ask questions
🤝 Help each other
🙏 Be kind
🎉 Have fun!
Always have this presentation open in your browser.
We will use it to navigate through the course content.
Use it to access the course materials and exercises.
Tip: Want to create QR Codes like this? Check the qrcode R package.
This is a programming course.
As such, we must understand the basics of how a computer works.
It is essential to start from materiality; otherwise, things stop making sense.
If you understand R at its core, you can navigate any challenge you encounter.

(Ariadne’s thread. Artwork by Eroshka)
The science of information processes and their interactions with the world (Denning, 2005).
Computer science studies information processes both artificial and natural (Denning, 2005). It seeks to build a scientific foundation for such topics as computer design, computer programming, information processing, algorithmic solutions of problems, and the algorithmic process itself (Brookshear & Brylow, 2020).

(Photo by Mark Richards)

(Artwork by Pablo Picasso – Le Taureau (1945-46))
Stored-Program Concept (1945) (AKA “von Neumann Architecture”).
First proposed by J. P. Eckert and his team (1944-1945).
Input -> Storage -> Processing -> Output
(Reproduced from Brookshear & Brylow (2020))
| Decimal | Binary | Trinary |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 10 | 2 |
| 3 | 11 | 10 |
| Decimal | Binary | Trinary |
|---|---|---|
| 01 + | 01 + | 01 + |
| 01 + | 01 + | 01 + |
| 01 = | 01 = | 01 = |
| 03 | 11 | 10 |
| x | y | x ∧ y | x ∨ y | ¬x | ¬y |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 1 |
| 0 | 1 | 0 | 1 | 1 | 0 |
| 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 |
Tip 1: See this video by Iberê Thenório (2022), on the Manual do Mundo YouTube channel, to learn more (pt-BR).
Tip 2: Learn more about number bases and their history in Boyer & Merzbach (1968/2011).
(Adapted from Brookshear & Brylow (2020))
(Video by George Zaidan and Sajan Saini, on TED-ED)
An algorithm is a set of steps that defines how a task is performed.
A representation of an algorithm is called a program.
The process of developing a program, encoding it in machine-compatible form, and inserting it into a machine is called programming, or sometimes coding.
Programs, and the algorithms they represent, are collectively referred to as software, in contrast to the machinery itself, which is known as hardware.

(Brookshear & Brylow (2020, p. 18) | Artwork by WinWin artlab)
The instructions which govern this operation must be given to the device in absolutely exhaustive detail. […] All these procedures require the use of some code to express the logical and the algebraical definition of the problem under consideration, as well as the necessary numerical material (von Neumann, 1993).

(Photo by Corbis Historical/Getty Images)
A programming paradigm is a specific approach to the programming process (Brookshear & Brylow, 2020).

(Adapted from Kahanwal (2013))

(Artwork by Calltutors)
R is an excellent language to get started with programming because it’s specialized. Unlike some other languages, it doesn’t have an overwhelming amount of features for you to learn.

(Artwork by Allison Horst)
In a general sense, Python is also good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.

Programming in movies versus programming in real life:


R is a free and open-source programming language designed for data analysis and graphics (Ihaka & Gentleman, 1996).
It was developed by Ross Ihaka and Robert Gentleman in the statistics department at the University of Auckland (New Zealand) in 1991 and introduced to the scientific community in 1993 (Peng, 2022).

(Robert Gentleman (Left) and Ross Ihaka (Right). Photos by an Unknown Author.)


(Artworks by Allison Horst)
Installing R is very simple.
You just need to download the installer from the Comprehensive R Archive Network (CRAN).
Go to Download R for [YOUR OS] and then click on the base link (the base version of R).
Simply follow the instructions of the installer.
An IDE (Integrated Development Environment) is a software that provides a set of tools to help the programmer to write code. It puts together things like a text editor, a console, a debugger, and a file manager.
Positron is flavour of Visual Studio Code, the most used IDE for programming in any language. It supersedes RStudio, the main IDE for R for many years, focusing in providing a better experience for programming in R and Python.
Note: Always use technical softwares in English.
(Reproduced from Posit)
A project consolidates all related files and resources in one place, ensuring that your work is reproducible and well-organized.
It is important to have a structured project directory. A best practice is to follow the structure made for R Packages (Marwick et al., 2018).
Every data project must be accompanied, at least, by a README file, explaining the project, and a LICENSE file, defining the terms of use of the data (See Open Source Licenses here).

(Reproduced from Marwick et al. (2018))
You can use the usethis R package to create these files.
First, call the package using:
Add a README file using:
Add a GNU General Public License version 3 file using:
(Artwork by Allison Horst)
Markdown is a lightweight markup language with plain text formatting syntax. The great thing about Markdown is that you can forget about the formatting and focus on the content.
Quarto Notebooks are a type of Markdown document that allows you to mix code, text, and output in a single document. These slides are all written in Markdown.
Learn more about the Markdown syntax here.
Everything in R is an object.
Everything that happens in R is the result of a function call.
Scalars (0D): 1
Vectors (1D): c(1, 2, 3)
Matrices (2D):matrix(1:9, nrow = 3, ncol = 3)
Arrays (nD):array(1:27, dim = c(3, 3, 3))

(Artwork by Kanishk Srivastava)
In R, most operations are vectorized, meaning they operate on all elements of a vector at once, without needing explicit loops.
This makes code more concise, readable, and less error-prone, while improving efficiency compared to non-vectorized languages.

(Artwork by Kanishk Srivastava)
Date (e.g., 2023-01-01) (linear time)POSIXct (e.g., 2023-01-01 00:00:00) (linear time)Interval (e.g., 2023-01-01 00:00:00–2023-12-15 15:40:00) (linear time)Duration (e.g., 2 months, 3 days) (linear time)Period (e.g., 1 year, 2 months, 3 days) (linear(ish) time)hms (e.g., 01:30:00) (circular time)And much more…
Non-atomic types are objects that can be composed of different data types.
They can also be recursive objects, meaning they can contain themselves as elements.
listdata.frame (a special case of list)Left Hand Side Operator Right Hand Side

(Artwork by Allison Horst)
R has strict rules regarding the names of objects.
.), and underscores (_).variable-name Bad (kebab-case)
variable.name Good, but not advisable
variable_name Good (snake_case), most used in R
1variable_name Bad
_variable_name Bad
.1variable_name Bad
variableName Good (camelCase)
VariableName Good (PascalCase)
TRUE Bad (reserved word)
(Reproduced from Hartnett (2021))
Atomic
`[`() 1 level extractionData Frames
x[i, ] Extract line i
x[, j] Extract column/variable j
x[i, j] Extract line i from column/variable j
NA (Not Available)
Missing values must be explicitly declared in R. For that R uses the NA value.
NA comes in different flavors, for example: NA_integer_, NA_real_, NA_character_, NA_complex_.
If you use just NA, R will use the most appropriate type.
NaN (Not a Number)
Tip: See the naniar package.
Click here to learn the differences between the base R and magrittr pipes.
🚨 Only if necessary 🚨
Your code must speak for itself.
In data analysis code, only use comments to record important things. If you need comments to explain what your code is doing, consider rewriting your code to be clearer (Wickham, n.d.-a).
One the most important things in programming is to know how to find help. In R, you can use the help() function to get help on a function or package.
Since R has more an academic focus, documentation is usually plentiful and very well written.
Documentation Websites: The logolink R package documentation.
Write an R script that asks the user to select a traffic light color (red, yellow, or green). Based on the user’s selection, the script should print the appropriate action:
Expected output:
Answer:
# color <- menu(
# choices = c("red", "yellow", "green"),
# title = "Select a traffic light color:"
# )
color <- 2 # For demonstration purposes
if (color == 1) {
print("Stop!")
} else if (color == 2) {
print("Slow down!")
} else if (color == 3) {
print("Go!")
} else {
print("Invalid selection")
}
#> [1] "Slow down!"🚨 Avoid loops 🚨 (if you can). Use functionals instead.
for Loops)Write an R script that goes through temperatures for 7 days and gives weather advice for each day.
Use a for loop to go through each temperature.
For each day, use if/else to give advice.
x: Cold - Wear a coat!
x: Nice - Perfect weather!
x: Hot - Stay hydrated!
x with the day number (1-7).for Loops)Use seq_along() to iterate over the indices of the temperatures vector.
for Loops)Expected output:
temperatures <- c(18, 22, 14, 27, 7, 21, 31)
#>
#> [Your code goes here!]
#>
#> [1] "Day 1: Nice - Perfect weather!"
#> [1] "Day 2: Nice - Perfect weather!"
#> [1] "Day 3: Cold - Wear a coat!"
#> [1] "Day 4: Hot - Stay hydrated!"
#> [1] "Day 5: Cold - Wear a coat!"
#> [1] "Day 6: Nice - Perfect weather!"
#> [1] "Day 7: Hot - Stay hydrated!"Solution:
temperatures <- c(18, 22, 14, 27, 7, 21, 31)
for (i in seq_along(temperatures)) {
i_temperature <- temperatures[i]
if (i_temperature < 18) {
paste0("Day ", i, ": Cold - Wear a coat!") |> print()
} else if (i_temperature <= 23) {
paste0("Day ", i, ": Nice - Perfect weather!") |> print()
} else {
paste0("Day ", i, ": Hot - Stay hydrated!") |> print()
}
}
#> [1] "Day 1: Nice - Perfect weather!"
#> [1] "Day 2: Nice - Perfect weather!"
#> [1] "Day 3: Cold - Wear a coat!"
#> [1] "Day 4: Hot - Stay hydrated!"
#> [1] "Day 5: Cold - Wear a coat!"
#> [1] "Day 6: Nice - Perfect weather!"
#> [1] "Day 7: Hot - Stay hydrated!"while Loops)Write an R script that keeps asking the user to guess a secret number until they get it right or run out of attempts.
Set a secret number.
Give the user 5 attempts to guess.
Use a while loop that continues as long as the user hasn’t guessed the number and still has attempts left.
Inside the loop:
while Loops)Use sample() to set a secret number between 1 and 100.
Use readline() to ask for a guess and as.numeric() to convert it to a number.
while Loops)Expected behavior:
secret <- sample(1:100, 1)
attempts <- 5
correct <- FALSE
#>
#> [Your code goes here!]
#>
#> [1] "You have 5 attempts left."
#> [Enter your guess: 50
#> [[1] "Too high! Try again."
#> [[1] "You have 4 attempts left."
#> [Enter your guess: 30
#> [[1] "Too high! Try again."
#> [[1] "You have 3 attempts left."
#> [Enter your guess: 20
#> [[1] "Too high! Try again."
#> [[1] "You have 2 attempts left."
#> [Enter your guess: 10
#> [[1] "Correct! You won!"Solution:
secret <- sample(1:100, 1)
attempts <- 5
correct <- FALSE
while (attempts > 0 && !correct) {
paste("You have", attempts, "attempts left.") |> print()
guess <-
readline(prompt = "Enter your guess: ") |>
as.numeric()
if (guess == secret) {
correct <- TRUE
} else if (guess < secret) {
print("Too low! Try again.")
} else {
print("Too high! Try again.")
}
attempts <- attempts - 1
}
if (correct) {
print("Correct! You won!")
} else {
print(paste("Game over! The secret number was", secret))
}Write a function that checks if a password meets basic security requirements.
Create a function called validate_password that takes one parameter: password.
Check if the password is a character string and has at least 8 characters.
The function should return:
Test your function with different inputs to ensure it works correctly.
Use is.character() to check if the password is a character string.
Use nchar() to get the number of characters in the password.
Use stop() to throw an error if the password is invalid.
Expected behavior:
Solution:
validate_password <- function(password) {
if (!is.character(password)) {
stop("Password must be text")
} else if (nchar(password) < 8) {
stop("Password too short")
} else {
message("Valid password")
}
}
validate_password(12345678)
#> Error:
#> ! Password must be text
validate_password("hello")
#> Error:
#> ! Password too short
validate_password("hello123")
#> Valid passwordA functional is a function that takes a function as an input and returns a vector as output (Wickham, 2019).
It makes the programmer’s work easier by applying a function to a vector without needing a loop. These functions are typically written in C (a lower-level programming language), making them very fast.
R has native functionals (e.g., lapply()), but prefer solutions from the purrr package from Tidyverse.
library(microbenchmark)
microbenchmark(
with_for_loop(1:1000),
with_map(1:1000),
times = 100,
check = "equal"
)
#> Unit: microseconds
#> expr min lq mean median uq max
#> with_for_loop(1:1000) 1054.845 1195.9870 1623.9336 1252.052 1679.917 5111.942
#> with_map(1:1000) 335.771 361.8655 433.7118 396.855 451.498 1626.814
#> neval cld
#> 100 a
#> 100 bUse the map() function, from the purrr package, to adjust student grades.
Load the purrr library using library(purrr).
Use the following test scores from three different classes.
Expected output:
library(purrr)
test_scores <- list(
class_a = c(72, 85, 68, 90, 77),
class_b = c(65, 70, 62, 75, 68),
class_c = c(88, 92, 85, 95, 90)
)
#>
#> [Your code goes here!]
#>
#> $class_a
#> [1] 84.7 99.0 80.3 104.5 90.2
#>
#> $class_b
#> [1] 77.0 82.5 73.7 88.0 80.3
#>
#> $class_c
#> [1] 102.3 106.7 99.0 110.0 104.5Solution:
library(purrr)
test_scores <- list(
class_a = c(72, 85, 68, 90, 77),
class_b = c(65, 70, 62, 75, 68),
class_c = c(88, 92, 85, 95, 90)
)
test_scores |>
map(\(x) x + 5) |>
map(\(x) x * 1.1)
#> $class_a
#> [1] 84.7 99.0 80.3 104.5 90.2
#>
#> $class_b
#> [1] 77.0 82.5 73.7 88.0 80.3
#>
#> $class_c
#> [1] 102.3 106.7 99.0 110.0 104.5Environments are data structures that powers scoping (Wickham, 2019).
Scoping: The act of finding the value associated with a name.
Each program must do one thing well (McIlroy et al., 1978).
In R, the fundamental unit of shareable code is the package.
A package bundles together code, data, documentation, and tests, and is easy to share with others (Wickham & Bryan, 2023).
Note: Other programming languages refer to packages as libraries.
R is just the core. It needs packages to do anything useful.
A typical installation of R comes with a set of packages, like:
base: Basic R functions (e.g., sum())datasets: Some datasets for testing and teaching (e.g., mtcars)graphics: The basic graphics functions (e.g., plot())grDevices: The graphics devices (e.g., pdf())methods: The built-in object-oriented programming systemparallel: Support for parallel computationstats: Basic statistical functions (e.g., t.test())utils: Utility functions (e.g., install.packages())Based on the atomic types, we can create other types of objects.
Using the right kind of object in R is very important, because these objects also include methods to deal with particular types of data.
For example, time can be expressed in different ways, like linear time (e.g., durations, periods, data-time, intervals) and circular time (e.g., time of day). We can also have time series, which are a sequence of data points taken at successive points in time.
Date
(Days since 1970-01-01 (UNIX epoch))
POSIXct
(Seconds since 1970-01-01 (UNIX epoch))
Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy (The Tidy Manifesto), grammar, and data structures.
It was created by Hadley Wickham, a prominent figure in the R community and author of several key references for this course. Hadley Wickham’s contributions have significantly shaped modern R programming.
The Tidyverse also has a meta-package that installs or load the most important packages from the collection.
readr: Read flat files (e.g., txt, csv, tsv) into R.stringr: A fresh approach to string manipulation in R.lubridate: Make working with dates in R just that little bit easier.dplyr: A grammar of data manipulation.purrr: A functional programming toolkit for R.forcats: Tools for working with categorical variables (factors).ggplot2: An implementation of the Grammar of Graphics in R.rOpenSci is a non-profit initiative founded in 2011 that brings together a community of researchers and developers committed to open science. They create and maintain high-quality R packages that provide access to data and tools from diverse sources, ensuring reliability and reproducibility in scientific research.
All packages go through a peer-review process, which ensures that they are well written and reliable.
Example: mctq R package peer-review.
You can do most of the things you need with the packages that come bundled with R. However, that is not the most efficient way to do things today.
In this course we are going to focus on the Tidyverse and rOpenSci packages, which are the most modern and efficient way to work with data in R.
If you want a deeper understanding of the R language, I encourage you to explore the base R solutions.
Always follow the Tidyverse style guide to write clean and efficient R code:
Variable and function names should use only lowercase letters, numbers, and
_. Use underscores (_) (so called snake case) to separate words within a name.
The tidyverse has four guiding principles:
Data science is the study of the generalizable extraction of knowledge from data (Dhar, 2023).
For some, data science is just statistics (Broman, 2013) (hype statistics). For others, it’s a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).

(Reproduced from Wickham et al. (2023))
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning (Reis & Housley, 2022).

(Reproduced from Reis & Housley (2022))
You only analyze or visualize data when you already have clean, tidy, and validated data.
The processing for getting data ready for analysis is called data munging. You can also see it as data wrangling (usually when dealing with machine learning) or, simply, data cleaning.
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Lohr, 2014).
Ackoff’s DIKW pyramid (Ackoff, 1989)
Data < Information < Knowledge < Wisdom
Data versus the interpretation of the data.

Data is an abstraction. It’s a representation of the world around us. Without context, it has no meaning.

(Artwork by Pablo Picasso – Le Taureau (1945-46))
A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. […] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear (van der Loo & Jonge, 2018).

(Reproduced from van der Loo & Jonge (2018))
With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source (van der Loo & Jonge, 2018).
If the researcher has made any modifications to the raw data, it is not the raw form of the data (S. E. Ellis & Leek, 2018).

(Photo by Unknown Author)

(Artwork by Allison Horst)

(Artwork by Allison Horst)

(Reproduced from Wickham et al. (2023))
Data validation techniques are used to ensure that data is accurate, consistent, and reliable.
Examples of invalid data:
Tip: For details on identifying biologically implausible values (BIVs) in anthropometric data, see the World Health Organization (WHO) anthro R package.
Daily air quality measurements in New York (May to September 1973).
Data can be hierarchical, with multiple levels of organization.
Interest rate representing the adjusted average rate of overnight repurchase agreements backed by Brazil’s federal government securities held in the Special System for Settlement and Custody (SELIC). Reported in % per day.
library(httr2)
request("https://api.bcb.gov.br/") |>
req_url_path_append("dados") |>
req_url_path_append("serie") |>
req_url_path_append("bcdata.sgs.11") |>
req_url_path_append("dados") |>
req_url_query(
formato = "json",
dataInicial = "01/12/2024",
dataFinal = "03/12/2024"
) |>
req_user_agent("github.com/danielvartan/r-course") |>
req_perform() |>
resp_body_json()
#> [[1]]
#> [[1]]$data
# [1] "02/12/2024"
# [[1]]$valor
# [1] "0.041957"
# [[2]]
# [[2]]$data
# [1] "03/12/2024"
# [[2]]$valor
# [1] "0.041957"Not all data is tabular; spatial data can be very large and complex.
Excel cannot handle spatial data, and GUI-based statistical softwares, when capable of handling spatial data, are often limited and struggles with performance issues.
Projected monthly average maximum temperatures (°C) for June 2021–2040, based on the global climate model (GCM) ACCESS-CM2.

(WorldClim 2.1 data. June mean temperature (°C) in South America (1970-2000))
Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data. (Klein et al., 1992)

(Artwork by 9Dots Management)
Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns) (Coronel & Morris, 2019).

Proper documentation is crucial for data sharing, reproducibility, and ensuring that others can understand and use your data effectively.
Here are some examples and guides to help you document your data:
Check S. E. Ellis & Leek (2018) for a detailed guide on data documentation.

(Adapted from S. E. Ellis & Leek (2018, Figure 1))
There are many open data formats available for researchers to use. Open can mean different things, but in this context, it means that the format is not proprietary and can be used by anyone. Here are some examples:
⚠️ Excel Files Are Not an Open Data Format! ⚠️
Data frames are a special type of list used for storing data tables. They are the most common way of storing data in R.
tibble Packagetibble is a modern take on data frames in R, offering improved usability and integration with the Tidyverse. It enhances readability while retaining essential data frame features.
Using R native data.frame class:
mtcars
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2Using tibble (Tidyverse way):
library(dplyr) # or library(tibble)
mtcars |> as_tibble()
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rowsIt will depend on what king of data you are working with.
Different from Excel or GUI-based statistical software, R can deal with any kind of data.
Examples of R functions to read data:
utils::read.csv(), readr::read_csv()
sf::st_read(), stars::read_stars(), terra::vect(), terra::rast()
readxl::read_excel() (Excel), haven::read_dta() (Stata), haven::read_sav() (SPSS), haven::read_sas() (SAS)Created by the great Allison Horst, the author of these beautiful illustrations.
The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
These data were collected from 2007–2009 by Dr. Kristen Gorman with the Palmer Station, part of the US Long Term Ecological Research Network (LTER).
We will use this package to get familiar with R.
Tip: See this video by Renan, Michele, and Mucuvinha (2026) on the Mundo Sem Fim YouTube channel to learn more about these penguins.


(Artwork by Allison Horst)
🕵 Known your data.
data-raw in the root of your project.The data documentation can be accessed by running:
Before importing the data to R, let’s first take a look at the content of the data file in a text editor.
palmerpenguins-raw.csv
studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,NA
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18,195,3250,FEMALE,8.36821,-25.33302,NA
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,NA,NA,NA,NA,NA,NA,NA,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,NA
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,NA
PAL0708,7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.4606,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,NA,NA,NA,No blood sample obtained.
PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42,20.2,190,4250,NA,9.13362,-25.09368,No blood sample obtained for sexing.
PAL0708,11,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A1,Yes,2007-11-09,37.8,17.1,186,3300,NA,8.63243,-25.21315,No blood sample obtained for sexing.
PAL0708,12,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A2,Yes,2007-11-09,37.8,17.3,180,3700,NA,NA,NA,No blood sample obtained.
PAL0708,13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A1,Yes,2007-11-15,41.1,17.6,182,3200,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,14,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A2,Yes,2007-11-15,38.6,21.2,191,3800,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,15,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A1,Yes,2007-11-16,34.6,21.1,198,4400,MALE,8.55583,-25.22588,NA
PAL0708,16,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A2,Yes,2007-11-16,36.6,17.8,185,3700,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,17,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N9A1,Yes,2007-11-12,38.7,19,195,3450,FEMALE,9.18528,-25.06691,NAPointing to files inside your project.
here Packagehere is a package that helps you use relative paths in your R projects.
It turns file management much easier and OS independent.

(Artwork by Allison Horst)
readr Packagedata |> glimpse()
#> Rows: 344
#> Columns: 17
#> $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
#> $ `Sample Number` <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10…
#> $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
#> $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
#> $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
#> $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
#> $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
#> $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
#> $ `Date Egg` <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-…
#> $ `Culmen Length (mm)` <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.…
#> $ `Culmen Depth (mm)` <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8"…
#> $ `Flipper Length (mm)` <chr> "181", "186", "195", NA, "193", "190", "181", "1…
#> $ `Body Mass (g)` <chr> "3750", "3800", "3250", NA, "3450", "3650", "362…
#> $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
#> $ `Delta 15 N (o/oo)` <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.6649…
#> $ `Delta 13 C (o/oo)` <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "…
#> $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult…janitor Packagejanitor provides simple functions for examining and cleaning dirty data.

(Artwork by Allison Horst)
clean_names()The clean_names() function transforms data frame column names into machine-readable formats.
This makes it easier to work with the data.
data |> names()
#> [1] "studyName" "Sample Number" "Species"
#> [4] "Region" "Island" "Stage"
#> [7] "Individual ID" "Clutch Completion" "Date Egg"
#> [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
#> [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
#> [16] "Delta 13 C (o/oo)" "Comments"Most tidyverse functions use data masking, allowing you to refer to data frame columns directly by their names (e.g., my_variable) instead of using more verbose syntax like data[["my_variable"]].
This is part of a concept called tidy evaluation.
tidyr Packagetidyr provides a set of functions that help you to tidy your data.
data |> glimpse()
#> Rows: 344
#> Columns: 18
#> $ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
#> $ sample_number <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ life_stage <chr> "Adult", "Adult", "Adult", "Adult", "Adult", "Adult"…
#> $ egg_stage <chr> "1 Egg Stage", "1 Egg Stage", "1 Egg Stage", "1 Egg …
#> $ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
#> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
#> $ date_egg <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ culmen_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
#> $ delta_15_n_o_oo <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.66496", …
#> $ delta_13_c_o_oo <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "-25.…
#> $ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…dplyr Packagedplyr is one the most important packages in the Tidyverse. It provides a grammar for data manipulation.
mutate(): Create, modify, and delete columns.transmute(): Creates a new data frame containing only specified computationsselect(): Keep or drop columns using their names and types.slice(): Subset rows using their positions.filter(): Keep rows that match a condition.arrange(): Order rows using column values.rename(): Rename columns.relocate(): Change column order.summarize(): Summarise each group down to one row.select()The select() function is used to select columns from a data frame.
In our case, we are not interested in using all the variables in the raw data. We will select only the variables we need.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ date_egg <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ culmen_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…rename()
(Artwork by Allison Horst)
rename()Let’s rename culmen to bill for clarity. Likewise, we’ll change date_egg to year, extracting the year value in another moment.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…mutate()
(Artwork by Allison Horst)
mutate()The mutate() function is used to create new columns or modify existing columns in a data frame.
Let’s assign proper classes to the variables. We also need to transform the year column, extracting just the year value.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…mutate()For categorical variables, it is helpful to inspect their unique values.
If the vector contains a limited number of unique values, it is likely a good candidate to be converted into a factor for better data handling and analysis.
mutate()data <-
data |>
mutate(
year =
year |>
as.Date() |>
year() |>
as.integer(),
species =
case_match(
species,
"Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
"Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
"Gentoo penguin (Pygoscelis papua)" ~ "Gentoo"
) |>
as.factor(),
island = as.factor(island),
bill_length_mm = bill_length_mm |> as.numeric(),
bill_depth_mm = bill_depth_mm |> as.numeric(),
flipper_length_mm = flipper_length_mm |> as.integer(),
body_mass_g = body_mass_g |> as.integer(),
sex =
sex |>
str_to_lower() |>
as.factor()
)mutate()data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…relocate()
(Artwork by Allison Horst)
relocate()Let’s organize our columns in a more logical order.
The year column is best placed next to the sex column.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…Always save your data in non-binary formats like csv to ensure that it can be read by other software.
rds stands for R Data Serialization. It is a binary data format that allows you to load an object into R preserving all its attributes.
Tip: Store your data in research repositories like The Open Science Framework (See the osfr package). If you are working with sensitive or human data, ensure it is encrypted before storing it in the cloud (See the lockr package).
You’ve learned the basic grammar for data manipulation using Tidyverse, but many important tools remain to be covered. The following slides introduce some of them.
I strongly encourage you to take the time to explore the documentation website of each Tidyverse package to discover more functions that can help you in your data analysis workflow. I guarantee your future self will thank you!
Now that we’ve reconstructed the penguins dataset from the palmerpenguins package, we’ll load the package and use the built-in dataset for the following examples. There’s no need to import the valid data into R.
filter()
(Artwork by Allison Horst)
filter()The filter() function is used to filter rows from a data frame based on specified conditions.
penguins |>
filter(
species == "Adelie",
island %in% c("Dream", "Torgersen"),
body_mass_g > 4000,
bill_length_mm < 40
)
#> # A tibble: 10 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.2 19.6 195 4675
#> 2 Adelie Torgersen 34.6 21.1 198 4400
#> 3 Adelie Dream 39.2 21.1 196 4150
#> 4 Adelie Dream 39.8 19.1 184 4650
#> 5 Adelie Dream 39.6 18.8 190 4600
#> 6 Adelie Torgersen 35.1 19.4 193 4200
#> 7 Adelie Dream 39.6 18.1 186 4450
#> 8 Adelie Dream 37.5 18.5 199 4475
#> 9 Adelie Dream 39.7 17.9 193 4250
#> 10 Adelie Dream 39.2 18.6 190 4250
#> # ℹ 2 more variables: sex <fct>, year <int>filter)The dplyr package provides a dataset called starwars that contains information about characters from the Star Wars universe. Filter the starwars dataset to include only characters who meet the following criteria:
filter)starwars |> glimpse()
#> Rows: 87
#> Columns: 14
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
#> $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…filter)Solution:
library(dplyr)
starwars |>
filter(
species == "Droid",
height > 100,
mass < 150
)
#> # A tibble: 2 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none masculine
#> 2 IG-88 200 140 none metal red 15 none masculine
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>Data analysis typically requires working with multiple data frames that need to be combined to answer your research questions.

(Reproduced from Wickham et al. (2023))
These combinations are called joins. There are different types of joins depending on how you want to combine the data frames.

(Reproduced from Wickham et al. (2023))
A left join retains all rows from x. Each row in x appears in the output, matching with y where possible or with NA values when no match exists.

(Reproduced from Wickham et al. (2023))
Perform a join to combine band_members with band_instruments2. These two datasets can be found in the dplyr package.
Use left_join() and join_by() for this exercise.
Solution:
One of the most common issues is data stored in wide format where column names actually represent values of a variable.

(Artwork by Allison Horst)
tidyr provides two functions to deal with this kind if issue: pivot_longer() and pivot_wider().
relig_income
#> # A tibble: 18 × 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122
#> 2 Atheist 12 27 37 52 35 70 73
#> 3 Buddhist 27 21 30 34 33 58 62
#> 4 Catholic 418 617 732 670 638 1116 949
#> 5 Don’t k… 15 14 15 11 10 35 21
#> 6 Evangel… 575 869 1064 982 881 1486 949
#> 7 Hindu 1 9 7 9 11 34 47
#> 8 Histori… 228 244 236 238 197 223 131
#> 9 Jehovah… 20 27 24 24 21 30 15
#> 10 Jewish 19 19 25 25 30 95 69
#> 11 Mainlin… 289 495 619 655 651 1107 939
#> 12 Mormon 29 40 48 51 56 112 85
#> 13 Muslim 6 7 9 10 9 23 16
#> 14 Orthodox 13 17 23 32 32 47 38
#> 15 Other C… 9 7 11 13 13 14 18
#> 16 Other F… 20 33 40 46 49 63 46
#> 17 Other W… 5 2 3 4 2 7 3
#> 18 Unaffil… 217 299 374 365 341 528 407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> # `Don't know/refused` <dbl>relig_income |>
pivot_longer(
cols = !religion,
names_to = "income",
values_to = "count"
)
#> # A tibble: 180 × 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # ℹ 170 more rows(Artwork by Allison Horst)
Perform a pivot to transform the table4a dataset, found in the tidyr package, from wide to long format using the pivot_longer() function.
Solution:
library(tidyr)
table4a |>
pivot_longer(
cols = !country,
names_to = "year",
values_to = "cases"
)
#> # A tibble: 6 × 3
#> country year cases
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766glimpse()penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…head() & tail()
penguins |> head(3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> # ℹ 2 more variables: sex <fct>, year <int>penguins |> tail(3)
#> # A tibble: 3 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Chinstrap Dream 49.6 18.2 193 3775
#> 2 Chinstrap Dream 50.8 19 210 4100
#> 3 Chinstrap Dream 50.2 18.7 198 3775
#> # ℹ 2 more variables: sex <fct>, year <int>slice_sample()penguins |> slice_sample(n = 15)
#> # A tibble: 15 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 46.5 14.4 217 4900
#> 2 Adelie Dream 40.8 18.4 195 3900
#> 3 Gentoo Biscoe 50.5 15.9 225 5400
#> 4 Chinstrap Dream 50.8 19 210 4100
#> 5 Chinstrap Dream 51.3 18.2 197 3750
#> 6 Chinstrap Dream 45.4 18.7 188 3525
#> 7 Gentoo Biscoe 49.8 15.9 229 5950
#> 8 Adelie Dream 40.7 17 190 3725
#> 9 Chinstrap Dream 50.9 17.9 196 3675
#> 10 Gentoo Biscoe 47.8 15 215 5650
#> 11 Chinstrap Dream 43.5 18.1 202 3400
#> 12 Adelie Biscoe 38.6 17.2 199 3750
#> 13 Adelie Dream 37.3 17.8 191 3350
#> 14 Adelie Torgers… 39 17.1 191 3050
#> 15 Adelie Dream 42.3 21.2 191 4150
#> # ℹ 2 more variables: sex <fct>, year <int>uniquecount()penguins |>
count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#> species island n
#> <fct> <fct> <int>
#> 1 Adelie Biscoe 44
#> 2 Adelie Dream 56
#> 3 Adelie Torgersen 52
#> 4 Chinstrap Biscoe 0
#> 5 Chinstrap Dream 68
#> 6 Chinstrap Torgersen 0
#> 7 Gentoo Biscoe 124
#> 8 Gentoo Dream 0
#> 9 Gentoo Torgersen 0count()penguins |>
count(species, island, .drop = FALSE) |>
mutate(
n_cum = cumsum(n),
n_per = n |>
divide_by(sum(n)) |>
multiply_by(100) |>
round(3),
n_per_cum = n_per |> cumsum()
)
#> # A tibble: 9 × 6
#> species island n n_cum n_per n_per_cum
#> <fct> <fct> <int> <int> <dbl> <dbl>
#> 1 Adelie Biscoe 44 44 12.8 12.8
#> 2 Adelie Dream 56 100 16.3 29.1
#> 3 Adelie Torgersen 52 152 15.1 44.2
#> 4 Chinstrap Biscoe 0 152 0 44.2
#> 5 Chinstrap Dream 68 220 19.8 64.0
#> 6 Chinstrap Torgersen 0 220 0 64.0
#> 7 Gentoo Biscoe 124 344 36.0 100
#> 8 Gentoo Dream 0 344 0 100
#> 9 Gentoo Torgersen 0 344 0 100The R packages that come with a typical R installation provide a set of basic statistical functions.
For skewness and kurtosis, use the moments package.
summarize()penguins |>
summarize(
n = n(),
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
mean_bill_depth_mm = mean(bill_depth_mm, na.rm = TRUE),
mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
.by = c("species", "sex")
)
#> # A tibble: 8 × 7
#> species sex n mean_body_mass_g mean_bill_length_mm mean_bill_depth_mm
#> <fct> <fct> <int> <dbl> <dbl> <dbl>
#> 1 Adelie male 73 4043. 40.4 19.1
#> 2 Adelie female 73 3369. 37.3 17.6
#> 3 Adelie <NA> 6 3540 37.8 18.3
#> 4 Gentoo female 58 4680. 45.6 14.2
#> 5 Gentoo male 61 5485. 49.5 15.7
#> 6 Gentoo <NA> 5 4588. 45.6 14.6
#> 7 Chinstrap female 34 3527. 46.6 17.6
#> 8 Chinstrap male 34 3939. 51.1 19.3
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>summarize)Create a comprehensive species profile for the Palmer Penguins dataset. For each penguin species, calculate:
summarize)Use n() to count the number of observations.
Use n_distinct() to count the number of unique values.
summarize)Solution:
penguins |>
summarize(
n = n(),
n_islands = n_distinct(island),
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
.by = "species"
)
#> # A tibble: 3 × 6
#> species n n_islands mean_body_mass_g mean_bill_length_mm
#> <fct> <int> <int> <dbl> <dbl>
#> 1 Adelie 152 3 3701. 38.8
#> 2 Gentoo 124 1 5076. 47.5
#> 3 Chinstrap 68 1 3733. 48.8
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>summarytools PackageThe summarytools package makes it easy to create frequency and descriptive statistics tables for single variables with minimal code.
penguins |> freq(var = species, round.digits = 3)
#> Frequencies
#> penguins$species
#> Type: Factor
#>
#> Freq % Valid % Valid Cum. % Total % Total Cum.
#> --------------- ------ --------- -------------- --------- --------------
#> Adelie 152 44.186 44.186 44.186 44.186
#> Chinstrap 68 19.767 63.953 19.767 63.953
#> Gentoo 124 36.047 100.000 36.047 100.000
#> <NA> 0 0.000 100.000
#> Total 344 100.000 100.000 100.000 100.000penguins |> descr(var = bill_depth_mm, round.digits = 3)
#> Descriptive Statistics
#> penguins$bill_depth_mm
#> N: 344
#>
#> bill_depth_mm
#> ----------------- ---------------
#> Mean 17.151
#> Std.Dev 1.975
#> Min 13.100
#> Q1 15.600
#> Median 17.300
#> Q3 18.700
#> Max 21.500
#> MAD 2.224
#> IQR 3.100
#> CV 0.115
#> Skewness -0.142
#> SE.Skewness 0.132
#> Kurtosis -0.923
#> N.Valid 342.000
#> N 344.000
#> Pct.Valid 99.419summarize() ApproachIf you need more control over the descriptive statistics, then you will need to use summarize().
stats_summary <- function(x) {
penguins |>
summarize(
n = n(),
n_valid = .data[[x]] |>
magrittr::extract(!is.na(.data[[x]])) |>
length(),
n_valid_per = n_valid |> # == (n_valid / n) * 100
divide_by(n) |>
multiply_by(100),
mean = .data[[x]] |> mean(na.rm = TRUE),
var = .data[[x]] |> var(na.rm = TRUE),
sd = .data[[x]] |> sd(na.rm = TRUE),
min = .data[[x]] |> min(na.rm = TRUE),
quartile_1 = .data[[x]] |> quantile(prob = 0.25, na.rm = TRUE),
median = .data[[x]] |> median(na.rm = TRUE),
quartile_3 = .data[[x]] |> quantile(prob = 0.75, na.rm = TRUE),
max = .data[[x]] |> max(na.rm = TRUE),
iqr = .data[[x]] |> IQR(na.rm = TRUE),
range = .data[[x]] |> range(na.rm = TRUE) |> diff(),
skewness = .data[[x]] |> skewness(na.rm = TRUE),
kurtosis = .data[[x]] |> kurtosis(na.rm = TRUE),
.by = species
) |>
pivot_longer(
cols = -species,
names_to = "statistic",
values_to = "value"
) |>
pivot_wider(
names_from = species,
values_from = value
)
}stats_summary("bill_length_mm")
#> # A tibble: 15 × 4
#> statistic Adelie Gentoo Chinstrap
#> <chr> <dbl> <dbl> <dbl>
#> 1 n 152 124 68
#> 2 n_valid 151 123 68
#> 3 n_valid_per 99.3 99.2 100
#> 4 mean 38.8 47.5 48.8
#> 5 var 7.09 9.50 11.2
#> 6 sd 2.66 3.08 3.34
#> 7 min 32.1 40.9 40.9
#> 8 quartile_1 36.8 45.3 46.3
#> 9 median 38.8 47.3 49.6
#> 10 quartile_3 40.8 49.6 51.1
#> 11 max 46 59.6 58
#> 12 iqr 4 4.25 4.73
#> 13 range 13.9 18.7 17.1
#> 14 skewness 0.160 0.643 -0.0886
#> 15 kurtosis 2.81 4.20 2.95The gt package is the most powerful package for creating and customizing tables in R.
It provides a flexible grammar for table creation and styling.
gt Packagepenguins_summary <-
penguins |>
filter(!is.na(sex)) |>
summarize(
n = n(),
mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
.by = c("species", "sex")
) |>
mutate(sex = sex |> str_to_title()) |>
arrange(species, sex)gt Packagepenguins_summary
#> # A tibble: 6 × 6
#> species sex n mean_body_mass_g mean_bill_length_mm
#> <fct> <chr> <int> <dbl> <dbl>
#> 1 Adelie Female 73 3369. 37.3
#> 2 Adelie Male 73 4043. 40.4
#> 3 Chinstrap Female 34 3527. 46.6
#> 4 Chinstrap Male 34 3939. 51.1
#> 5 Gentoo Female 58 4680. 45.6
#> 6 Gentoo Male 61 5485. 49.5
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>| Sex | N | Body Mass (g) | Bill Length (mm) | Flipper Length (mm) |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3368.836 | 37.25753 | 187.7945 |
| Male | 73 | 4043.493 | 40.39041 | 192.4110 |
| Chinstrap | ||||
| Female | 34 | 3527.206 | 46.57353 | 191.7353 |
| Male | 34 | 3938.971 | 51.09412 | 199.9118 |
| Gentoo | ||||
| Female | 58 | 4679.741 | 45.56379 | 212.7069 |
| Male | 61 | 5484.836 | 49.47377 | 221.5410 |
| Antarctic Penguin Characteristics | ||||
| Comparison between species and sexes | ||||
| Sex | N | Body Mass (g) | Bill Length (mm) | Flipper Length (mm) |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3368.836 | 37.25753 | 187.7945 |
| Male | 73 | 4043.493 | 40.39041 | 192.4110 |
| Chinstrap | ||||
| Female | 34 | 3527.206 | 46.57353 | 191.7353 |
| Male | 34 | 3938.971 | 51.09412 | 199.9118 |
| Gentoo | ||||
| Female | 58 | 4679.741 | 45.56379 | 212.7069 |
| Male | 61 | 5484.836 | 49.47377 | 221.5410 |
| Antarctic Penguin Characteristics | ||||
| Comparison between species and sexes | ||||
| Sex | N | Body Mass (g) | Bill Length (mm) | Flipper Length (mm) |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3,368.84 | 37.26 | 187.79 |
| Male | 73 | 4,043.49 | 40.39 | 192.41 |
| Chinstrap | ||||
| Female | 34 | 3,527.21 | 46.57 | 191.74 |
| Male | 34 | 3,938.97 | 51.09 | 199.91 |
| Gentoo | ||||
| Female | 58 | 4,679.74 | 45.56 | 212.71 |
| Male | 61 | 5,484.84 | 49.47 | 221.54 |
| Antarctic Penguin Characteristics | ||||
| Comparison between species and sexes | ||||
| Sex | N | Body Mass (g) | Bill Length (mm) | Flipper Length (mm) |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3,368.84 | 37.26 | 187.79 |
| Male | 73 | 4,043.49 | 40.39 | 192.41 |
| Chinstrap | ||||
| Female | 34 | 3,527.21 | 46.57 | 191.74 |
| Male | 34 | 3,938.97 | 51.09 | 199.91 |
| Gentoo | ||||
| Female | 58 | 4,679.74 | 45.56 | 212.71 |
| Male | 61 | 5,484.84 | 49.47 | 221.54 |
| Antarctic Penguin Characteristics | ||||
| Comparison between species and sexes | ||||
| Sex | N | Body Mass (g)1 | Bill Length (mm)1 | Flipper Length (mm)1 |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3,368.84 | 37.26 | 187.79 |
| Male | 73 | 4,043.49 | 40.39 | 192.41 |
| Chinstrap | ||||
| Female | 34 | 3,527.21 | 46.57 | 191.74 |
| Male | 34 | 3,938.97 | 51.09 | 199.91 |
| Gentoo | ||||
| Female | 58 | 4,679.74 | 45.56 | 212.71 |
| Male | 61 | 5,484.84 | 49.47 | 221.54 |
| 1 Averages calculated after removing missing values. | ||||
| Source: Palmer Station LTER / palmerpenguins package. | ||||
penguin_table <-
penguin_table |>
tab_style(
style = cell_fill(color = "#E8F4F8"),
locations = cells_row_groups()
) |>
tab_style(
style = cell_text(size = px(18)),
locations = cells_title(groups = "title")
) |>
tab_options(
table.border.top.style = "solid",
table.border.bottom.style = "solid",
heading.border.bottom.style = "solid",
column_labels.border.bottom.style = "solid",
row_group.border.top.style = "solid",
row_group.border.bottom.style = "solid"
)
penguin_table| Antarctic Penguin Characteristics | ||||
| Comparison between species and sexes | ||||
| Sex | N | Body Mass (g)1 | Bill Length (mm)1 | Flipper Length (mm)1 |
|---|---|---|---|---|
| Adelie | ||||
| Female | 73 | 3,368.84 | 37.26 | 187.79 |
| Male | 73 | 4,043.49 | 40.39 | 192.41 |
| Chinstrap | ||||
| Female | 34 | 3,527.21 | 46.57 | 191.74 |
| Male | 34 | 3,938.97 | 51.09 | 199.91 |
| Gentoo | ||||
| Female | 58 | 4,679.74 | 45.56 | 212.71 |
| Male | 61 | 5,484.84 | 49.47 | 221.54 |
| 1 Averages calculated after removing missing values. | ||||
| Source: Palmer Station LTER / palmerpenguins package. | ||||
gt)Using the starwars dataset from the dplyr package, create a summary table that displays the mean height and mass of characters grouped by their species.
Customize the table with appropriate titles, labels, and formatting to enhance its readability.
starwars |> glimpse()
#> Rows: 87
#> Columns: 14
#> $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
#> $ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…gt)Solution:
library(dplyr)
library(gt)
library(tidyr)
starwars_summary <-
starwars |>
drop_na(species, height, mass) |>
summarize(
mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
.by = "species"
)
starwars_summary |>
gt() |>
cols_label(
species = "Species",
mean_height = "Mean Height (cm)",
mean_mass = "Mean Mass (kg)"
) |>
tab_header(
title = md("**Star Wars Character Characteristics**"),
subtitle = "Mean height and mass by species"
) |>
fmt_number(
columns = starts_with("mean"),
decimals = 2
) |>
tab_source_note(
source_note = "Source: Star Wars API / dplyr package."
)| Star Wars Character Characteristics | ||
| Mean height and mass by species | ||
| Species | Mean Height (cm) | Mean Mass (kg) |
|---|---|---|
| Human | 180.25 | 81.31 |
| Droid | 140.00 | 69.75 |
| Wookiee | 231.00 | 124.00 |
| Rodian | 173.00 | 74.00 |
| Hutt | 175.00 | 1,358.00 |
| Yoda's species | 66.00 | 17.00 |
| Trandoshan | 190.00 | 113.00 |
| Mon Calamari | 180.00 | 83.00 |
| Ewok | 88.00 | 20.00 |
| Sullustan | 160.00 | 68.00 |
| Neimodian | 191.00 | 90.00 |
| Gungan | 210.00 | 74.00 |
| Dug | 112.00 | 40.00 |
| Zabrak | 175.00 | 80.00 |
| Twi'lek | 178.00 | 55.00 |
| Aleena | 79.00 | 15.00 |
| Vulptereen | 94.00 | 45.00 |
| Toong | 163.00 | 65.00 |
| Cerean | 198.00 | 82.00 |
| Nautolan | 196.00 | 87.00 |
| Tholothian | 184.00 | 50.00 |
| Kel Dor | 188.00 | 80.00 |
| Geonosian | 183.00 | 80.00 |
| Mirialan | 168.00 | 53.10 |
| Clawdite | 168.00 | 55.00 |
| Besalisk | 198.00 | 102.00 |
| Kaminoan | 229.00 | 88.00 |
| Skakoan | 193.00 | 48.00 |
| Togruta | 178.00 | 57.00 |
| Kaleesh | 216.00 | 159.00 |
| Pau'an | 206.00 | 80.00 |
| Source: Star Wars API / dplyr package. | ||
gt ExtensionsThe gt package also has several extensions that provide additional functionality for specific use cases.
gtsummary provides a simple way to create publication-ready summary tables for statistical models and data frames.
gtextras provides additional themes, color scales, and utilities to enhance gt tables.
The Grammar of Graphics, by Leland Wilkinson, set the foundation thinking about data visualization.
We will based this part of the course on the principles found in this book.

(Book cover image from Wilkinson (2005))
ggplot2 Package
(Artwork by Allison Horst)
ggplot2 PackageThe most powerful and flexible package for data visualization.
A Tidyverse package based on the principles of The Grammar of Graphics.

(Reproduced from Wickham (2016))


Create a bar plot using the starwars dataset that shows the frequency of characters by their sex variable.
Apply the following customizations:
sex before creating the plot using the drop_na() function from the tidyr package.colors() function. View available color tones here.coord_flip() to display horizontal bars.Solution:
library(dplyr)
library(forcats)
library(ggplot2)
library(stringr)
library(tidyr)
starwars |>
drop_na(sex) |>
mutate(
sex = sex |>
str_to_title() |>
as.factor() |>
fct_infreq()
) |>
ggplot(aes(x = sex)) +
geom_bar(fill = "dodgerblue4") +
coord_flip() +
labs(
title = "Distribution of Star Wars Characters by Sex",
x = "Sex",
y = "Frequency",
caption = "Source: Star Wars API / dplyr package"
)
The ggplot2 package includes the diamonds dataset containing information about diamond quality and price. Using this dataset, create a histogram to visualize the distribution of the price variable.
Apply the following customizations:
diamonds |> glimpse()
#> Rows: 53,940
#> Columns: 10
#> $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
#> $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
#> $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
#> $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
#> $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
#> $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
#> $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
#> $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
#> $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…Solution:
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(
bill_length_mm,
species
) |>
ggplot(
aes(
x = species,
y = bill_length_mm,
fill = species
)
) +
geom_boxplot(
outlier.color = "red"
) +
geom_jitter(
width = 0.2,
alpha = 0.1
) +
labs(
x = "Species",
y = "Bill Length (mm)",
fill = "Species"
)
The ggplot2 package includes the msleep dataset containing information about the sleep habits of various mammals. Using this dataset, create a boxplot to visualize the distribution of the sleep_total variable across different vore (dietary categories).
Apply the following customizations:
sleep_total and vore before creating the plot using the drop_na() function from the tidyr package.vore category by mapping the fill aesthetic to the vore variable.msleep |> glimpse()
#> Rows: 83
#> Columns: 11
#> $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
#> $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
#> $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
#> $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
#> $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
#> $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
#> $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
#> $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
#> $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
#> $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…Solution:
library(ggplot2)
library(tidyr)
msleep |>
drop_na(
sleep_total,
vore
) |>
ggplot(
aes(
x = vore,
y = sleep_total,
fill = vore
)
) +
geom_boxplot() +
labs(
title = "Distribution of Total Sleep by Dietary Category",
x = "Dietary Category (Vore)",
y = "Total Sleep (hours)",
caption = "Source: ggplot2 package"
)library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point() +
geom_smooth(
method = "lm",
formula = y ~ x,
se = FALSE
) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
)
The ggplot2 package includes the mpg dataset containing information about various car models and their fuel efficiency. Using this dataset, create a scatter plot to visualize the relationship between displ (engine displacement) and hwy (highway miles per gallon).
Apply the following customizations:
geom_smooth() function.mpg |> glimpse()
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
#> $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
#> $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
#> $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
#> $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
#> $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
#> $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
#> $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
#> $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
#> $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
#> $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…Solution:
library(dplyr)
library(ggplot2)
library(magrittr)
library(palmerpenguins)
library(stringr)
penguins |>
mutate(
sex = sex |> str_to_title()
) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = sex
)
) +
geom_point() +
geom_smooth(
method = "lm",
formula = y ~ x,
se = FALSE,
color = "black"
) +
facet_wrap(vars(species)) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Sex"
)
patchwork Package
(Artwork by Allison Horst)
library(ggplot2)
library(palmerpenguins)
library(patchwork)
plot_hist <-
penguins |>
ggplot(
aes(
x = flipper_length_mm,
fill = species
)
) +
geom_histogram(
alpha = 0.5,
position = "identity"
) +
labs(
x = "Flipper Length (mm)",
y = "Frequency"
) +
theme(legend.position = "none")
plot_boxplot <-
penguins |>
ggplot(
aes(
x = species,
y = bill_length_mm,
fill = species
)
) +
geom_boxplot(outlier.color = "red") +
geom_jitter(
width = 0.2,
alpha = 0.1
) +
labs(
x = "Species",
y = "Bill Length (mm)",
fill = "Species"
) +
theme(
axis.title.x = element_blank(),
legend.position = "none"
)
plot_scatter <-
penguins |>
drop_na(
body_mass_g,
flipper_length_mm,
species
) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point() +
geom_smooth(
method = "lm",
formula = y ~ x,
se = FALSE
) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
theme(legend.position = "none")
(plot_hist + plot_boxplot) / plot_scatter + plot_annotation(tag_levels = "A")
ggplot2 ExtensionsLike gt, the ggplot2 package has several extensions that provide additional functionality for specific use cases. You can check a list of them here.
ggwordcloud provides a way to create word clouds using ggplot2.
tidyplots provides a user-friendly code-based interface for creating customizable and insightful plots.

Color palettes can be sequential, diverging, or qualitative (discrete). Here are some examples of two popular packages for color palettes in R.
viridis: Colorblind-Friendly Color Maps
library(ggplot2)
library(palmerpenguins)
library(tidyr)
library(viridis)
penguins |>
drop_na(
flipper_length_mm,
species
) |>
ggplot(
aes(
x = flipper_length_mm,
fill = species
)
) +
geom_density(
alpha = 0.5,
position = "identity"
) +
scale_fill_viridis(discrete = TRUE) +
labs(
x = "Flipper Length (mm)",
y = "Density",
fill = "Species"
) +
theme(
text = element_text(size = 20),
legend.position = "none"
)
RColorBrewer: Common Color Palettes
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(
flipper_length_mm,
species
) |>
ggplot(
aes(
x = flipper_length_mm,
fill = species
)
) +
geom_density(
alpha = 0.5,
position = "identity"
) +
scale_fill_brewer(palette = "Set1") +
labs(
x = "Flipper Length (mm)",
y = "Density",
fill = "Species"
) +
theme(
text = element_text(size = 20),
legend.position = "none"
)
Check out the amazing work by Yan Holtz. Visit From Data to Viz too see the diagram below and many others.

(Artwork by Yan Holtz)
Another amazing work by Yan Holtz is the R Graph Gallery, which provides numerous examples of R graphics and how to create them.

(Artwork by Yan Holtz)
The Tidymodels framework is a collection of packages for modeling and machine learning using Tidyverse principles.
It is created by the same team that developed Tidyverse and is designed to work seamlessly with it.
Although it is a relative newcomer to the R ecosystem (2018), it has quickly gained popularity due to its simplicity and consistency.
Like Tidyverse, Tidymodels also has a meta-package that installs or load the most important packages from the collection.
rsample: Provides infrastructure for efficient data splitting and resampling.parsnip: A tidy, unified interface to models.recipes: A tidy interface to data pre-processing tools for feature engineering.workflows: A package to bundle the pre-processing, modeling, and post-processing together.infer: A statistical grammar for inferential statistics.A hypothesis is a statement about a population parameter.
The goal of a hypothesis test is to decide, based on a sample from the population, which of two complementary hypotheses is true.
The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis and the alternative hypothesis. They are denoted by \(\text{H}_{0}\) and \(\text{H}_{1}\), respectively.
A hypothesis testing procedure or hypothesis test is a rule that specifies:
| Decision about \(\text{H}_{0}\) | \(\text{H}_{0}\) True | \(\text{H}_{0}\) False |
|---|---|---|
| Accept | Correct inference (True negative) (\(1 - \alpha\)) |
Type II error (False negative) (\(\beta\)) |
| Reject | Type I error (False positive) (\(\alpha\)) |
Correct inference (True positive) (\(1 - \beta\)) |
(Based on Casella & Berger (2002, p. 383))
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.8,
alpha = 0.05,
welch.df = TRUE,
alternative = "greater"
)
#> +--------------------------------------------------+
#> | SAMPLE SIZE CALCULATION |
#> +--------------------------------------------------+
#>
#> Welch's T-Test (Independent Samples)
#>
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#> H0 (Null Claim) : d - null.d <= 0
#> H1 (Alt. Claim) : d - null.d > 0
#>
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#> Sample Size = 310 and 310 <<
#> Type 1 Error (alpha) = 0.050
#> Type 2 Error (beta) = 0.200
#> Statistical Power = 0.8
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "one.sided",
plot = TRUE,
verbose = FALSE
)
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.3,
alpha = 0.05,
welch.df = TRUE,
alternative = "greater",
)
#> +--------------------------------------------------+
#> | SAMPLE SIZE CALCULATION |
#> +--------------------------------------------------+
#>
#> Welch's T-Test (Independent Samples)
#>
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#> H0 (Null Claim) : d - null.d <= 0
#> H1 (Alt. Claim) : d - null.d > 0
#>
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#> Sample Size = 64 and 64 <<
#> Type 1 Error (alpha) = 0.050
#> Type 2 Error (beta) = 0.698
#> Statistical Power = 0.302
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "one.sided",
plot = TRUE,
verbose = FALSE
)
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.999,
alpha = 0.001,
welch.df = TRUE,
alternative = "greater",
)
#> +--------------------------------------------------+
#> | SAMPLE SIZE CALCULATION |
#> +--------------------------------------------------+
#>
#> Welch's T-Test (Independent Samples)
#>
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#> H0 (Null Claim) : d - null.d <= 0
#> H1 (Alt. Claim) : d - null.d > 0
#>
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#> Sample Size = 1913 and 1913 <<
#> Type 1 Error (alpha) = 0.001
#> Type 2 Error (beta) = 0.001
#> Statistical Power = 0.999
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "one.sided",
plot = TRUE,
verbose = FALSE
)

(Artwork by Allison Horst)

(Artwork by Allison Horst)
Large samples and sensitivity
Is a difference of \(0.00001\) valid?
Statistical ritual versus Statistical thinking
Comparison of a 95% of confidence level (\(\alpha = 0.05\)) and an n-dependent p-value curve. The parameter \(n_{\alpha}\) represents the minimum sample size to detect statistically significant differences among compared groups. The parameter \(n_{\gamma}\) represents the convergence point of the p-value curve. When the p-value curve expresses practical differences, the area under the red curve (\(A_{p(n)}\)) is smaller than the area under the constant function \(\alpha = 0.05\) (\(A_{\alpha = 0.05}\)) when it is evaluated between \(0\) and \(n_{\gamma}\).

(Reproduced from Gómez-de-Mariscal et al. (2021, Figure 3))
[…] in many circumstances, all that is intended by “proving” the null hypothesis is that the ES [Effect Size] is not necessarily zero but small enough to be negligible
(Cohen, 1988, p. 461).

(Photo by an unknown author.)
| Test | Relevant Effect Size |
Effect Size Classes
|
||
|---|---|---|---|---|
| Small | Medium | Large | ||
| Comparison of independent means | \(d\), \(\Delta\), Hedges’ \(g\) | 0.20 | 0.50 | 0.80 |
| Comparison of two correlations | \(q\) | 0.10 | 0.30 | 0.50 |
| Difference between proportions | Cohen’s \(g\) | 0.05 | 0.15 | 0.25 |
| Correlation | \(r\) | 0.10 | 0.30 | 0.50 |
| \(r^{2}\) | 0.01 | 0.09 | 0.25 | |
| Crosstabulation | \(w\), \(\varphi\), \(V\), \(C\) | 0.10 | 0.30 | 0.50 |
| ANOVA | \(f\) | 0.10 | 0.25 | 0.40 |
| \(\eta^{2}\) | 0.01 | 0.06 | 0.14 | |
| Multiple regression | \(R^{2}\) | 0.02 | 0.13 | 0.26 |
| \(f^{2}\) | 0.02 | 0.15 | 0.35 | |
| Notes: The rationale for most of these benchmarks can be found in Cohen (1988) at the following pages: Cohen’s \(d\) (p. 40), \(q\) (p. 115), Cohen’s \(g\) (pp. 147–149), \(r\) and \(r^{2}\) (pp. 79–80), Cohen’s \(w\) (pp. 224–227), \(f\) and \(\eta^{2}\) (pp. 285–287), \(R^{2}\) and \(f^{2}\) (pp. 413–414). | ||||
\(\Delta\text{R}^{2}\) = 0.00388
Cohen’s \(f^{2}\) = 0.00414
Latitudinal cline of chronotype (Leocadio-Miguel et al., 2017).
Critique of Leocadio-Miguel et al. latitude article (Vartanian, 2024).
The ASA statement on p-values (Wasserstein & Lazar, 2016).
Authors who rely solely on the p-value demonstrate a preference for statistical rituals over statistical reasoning (Gigerenzer, 2004).

Note: The HO score (Horne & Östberg, 1976) goes from 16 to 86, with higher scores indicating a preference for morningness.
(Reproduced from Leocadio-Miguel et al. (2017, Figure 2))
Check Antoine Soetewey’s flowchart to help you decide the appropriate statistical test for your data.
(Artwork by Antoine Soetewey)
infer Packageinfer is a Tidymodels package that provides a statistical grammar for inferential statistics.
It also offers pipeline examples for various hypothesis tests. These pipelines can serve as a helpful starting point for implementing other types of models.
infer Package
specify() specifies the variable, or relationship between variables, of interest.hypothesize() declares the null hypothesis.generate() generates data reflecting the null hypothesis or using the bootstrap.calculate() calculates summary statistics from either the observed data to form the observed test statistic, or from the generated data to form the null distribution of test statistics.visualize() plots the null distribution of test statistics.Is there a meaningful difference in body mass between male and female Adelie penguins?
Sexual dimorphism (physical differences between sexes) is common in many bird species. This could have implications for understanding their ecology and behavior.
To test this, we will perform a t-Test for Independent Samples using a randomization-based empirical null distribution approach.
Our test type-1 error rate (\(\alpha\)) will be set at 0.05 and our type-2 error rate (\(\beta\)) will be set at 0.2, giving us a power of 0.8.
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]
Is there a meaningful difference in body mass between male and female Adelie penguins?
To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval. Cohen’s benchmark for a medium effect size (\(d\) = 0.5) will be used as our Minimum Effect Size (MES) (Cohen, 1988).
Tip: See Perezgonzalez (2015) to learn more about data testing and practical significance.
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]
Cohen’s \(d\) is a measure of effect size that indicates the standardized difference between two means. When comparing two independent means, Cohen’s \(d\) for two tails t-test can be calculated as:
\[ d = \frac{|\text{m}_{A} - \text{m}_{B}|}{s_{\text{pooled}}} \]
Where:
The spooled standard deviation (\(s_{pooled}\)) is calculated using the following formula:
\[ s_{pooled} = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}} \]
Where:
pwr and pwrss PackagesThese two packages provide functions for performing power analysis and sample size calculations for various statistical tests, including t-tests, ANOVA, regression, and more.
pwr provides basic functions for power analysis.
pwrss provides functions to perform power and sample size calculations for various statistical tests.
The power (\(1 - \beta\)) of a statistical test is the probability that it will yield statistically significant results (Cohen, 1988).
A power analysis helps determine the minimum sample size required to detect an effect of a given size with a desired level of confidence (Cohen, 1988). We need to check if our sample size will be sufficient to achieve this confidence.
A power analysis for a t-test with an expected effect size of 0.5 (\(\text{Cohen's} \ d\)) indicates that we would need approximately 64 participants in each group to achieve a power of 0.8 at a significance level of 0.05.
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.5,
mu2 = 0,
power = 0.8,
alpha = 0.05,
welch.df = TRUE,
alternative = "not equal"
)
#> +--------------------------------------------------+
#> | SAMPLE SIZE CALCULATION |
#> +--------------------------------------------------+
#>
#> Welch's T-Test (Independent Samples)
#>
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#> H0 (Null Claim) : d - null.d = 0
#> H1 (Alt. Claim) : d - null.d != 0
#>
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#> Sample Size = 64 and 64 <<
#> Type 1 Error (alpha) = 0.050
#> Type 2 Error (beta) = 0.199
#> Statistical Power = 0.801
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "two.sided",
plot = TRUE,
verbose = FALSE
)
🔗 Independence of observations.
🔔 Normality of the distribution of the response variable (body_mass_g) within each group (sex is the explanatory/independent variable).
⚖️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and our permutation approach do not require this assumption).
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
filter(species == "Adelie") |>
drop_na(body_mass_g, sex) |>
ggplot(
aes(
x = sex,
y = body_mass_g,
fill = sex
)
) +
geom_boxplot(outlier.color = "red") +
geom_jitter(
width = 0.2,
alpha = 0.1
) +
labs(
x = "Sex",
y = "Body Mass (g)",
fill = "Sex"
)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
filter(species == "Adelie") |>
drop_na(body_mass_g, sex) |>
ggplot(
aes(
x = body_mass_g,
fill = sex
)
) +
geom_histogram(
position = "identity",
alpha = 0.7
) +
labs(
x = "Body Mass (g)",,
y = "Frequency",
fill = "Sex"
)

library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
filter(species == "Adelie") |>
drop_na(body_mass_g, sex) |>
ggplot(
aes(
sample = body_mass_g,
color = sex
)
) +
stat_qq() +
stat_qq_line(color = "black") +
facet_wrap(vars(sex)) +
labs(
x = "Theoretical Quantiles",
y = "Sample Quantiles",
color = "Sex"
)
✅ Independence of observations.
✅ Normality of the distribution of the response variable (body_mass_g) within each group (sex is the explanatory/independent variable).
⏭️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and our permutation approach do not require this assumption).
The t-statistic for comparing two independent means can be calculated using the following formula:
\[ t = \frac{\text{m}_{A} - \text{m}_{B}}{\sqrt{\frac{s_{A}^{2}}{n_{A}} + \frac{s_{B}^{2}}{n_{B}}}} \]
Where:
What would our data look like if there was no difference between sexes?
One way to simulate this scenario is to randomly shuffle the sex values among the body_mass_g values. This breaks any real association between sex and body mass.
We then calculate the t-statistic for this new dataset. Repeating this process many times allows us to build a distribution of t-statistics that represent what we would expect to see if there were no real difference in body mass between male and female Adelie penguins.
null_dist
#> Response: body_mass_g (numeric)
#> Explanatory: sex (factor)
#> Null Hypothesis: in...
#> # A tibble: 1,000 × 2
#> replicate stat
#> <int> <dbl>
#> 1 1 1.11
#> 2 2 -0.414
#> 3 3 -2.09
#> 4 4 1.23
#> 5 5 0.639
#> 6 6 -0.847
#> 7 7 -0.333
#> 8 8 0.0270
#> 9 9 -1.27
#> 10 10 0.144
#> # ℹ 990 more rowsIs there a meaningful difference in body mass between male and female Adelie penguins?
Our analysis found a statistically significant difference in means (\(t\) = 13.1, \(p\)-value < 0.001). The observed effect size was large and exceed the Minimal Effect Size (MES) threshold (\(d\) = 2.17, 95% CI [1.76, 2.58]).
Since we could reliably detect effects of 0.5 (\(\text{Cohen's} \ d\)) or larger, the power of our test remains high, indicating that the probability of a false negative (\(\beta\)) is very low.
Based on these results, we conclude that there is a meaningful difference in body mass between male and female Adelie penguins, with male penguins having a higher mean body mass. Therefore, we reject the null hypothesis in favor of the alternative hypothesis.
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{1}: d \geq \text{MES} \\ \end{cases} \]
Model diagnostics are crucial!
It’s essential to verify that all model assumptions hold. However, a discussion on this topic is beyond the scope of this course.
You can find these assumptions in most statistical textbooks, or you can look at the original papers that introduced the models (e.g., fot t-Tests, see Student (1908)).

(Artwork by Allison Horst)
🚨 Avoid Using! 🚨
Objective assumption tests (e.g., Anderson–Darling test) are not advisable for large samples, as they can be overly sensitive to minor deviations. Additionally, they might overlook visual patterns that are not captured by a single metric.
Usually, a visual inspection of the data is the preferred approach in most cases.
For a straightforward critique of normality tests specifically, refer to this article by Greener (2020).
See also: Kozak & Piepho (2018), Schucany & Ng (2006), and Shatz (2024).
Answer the following question using the data provided:
Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?
The Multidimensional Index for Sustainable Food Systems (MISFS) is a tool designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices.
The MISFS-R is a revised version that introduces new indicators and a refined methodology for calculating the index.
For more details, see Carvalho et al. (2021) and Norde et al. (2023).

(Adapted from Norde et al. (2023, Figure 6))

(Reproduced from Norde et al. (2023, Figure 6))
This exercise is for educational purposes only.
The data used in this exercise requires further cleaning and validation before it can be used in real-world applications. For the purposes of this analysis, we will assume the data is valid, reliable, and satisfies all assumptions underlying the statistical tests performed, even though this may not hold in practice.
Please note that the results of the statistical test may not be valid due to these simplifications.
In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.
Remember the data science workflow:

(Reproduced from Wickham et al. (2023))
1. Create a new project and open it in Positron.
2. Download the Quarto template file, rename it as index.qmd, and store it in the project root.
3. Download the template references file, name it as reference.bib and store it in the project root.
4. Read and review the Quarto template.
5. Perform an a priori power analysis.
6. Download the data.
7. Inspect the data file.
8. Import the data into R.
9. Clean, tidy, and validate the data.
10. Transform the data as needed.
11. Save the processed data.
12. Conduct a brief exploratory data analysis.
13. Assess the model assumptions.
14. Model the data.
15. Write your conclusions.
16. Render the report.
(If you’re unable to download the data, you can access it here)
🧮 Recalculate the percentage.
❌ Remove municipalities with less than 10 monitored children.
❌ Remove municipalities where the number of children consuming ultra-processed foods exceeds the number monitored.
🙋♂️ If you’re stuck, ask for help.
🎉 Have fun!
You can find the solution to this exercise at the following links:
Code Repository

(Artwork by Allison Horst)
I hope you feel more confident in your R programming skills and are ready to tackle new challenges. Remember, learning is a continuous journey, and the R community is there to support you.
🇧🇷 For Brazilian graduate students, consider exploring the abnt Quarto format. It helps you write your thesis or dissertation in compliance with the Brazilian Association of Technical Standards (ABNT) standards while ensuring full reproducibility. Check out an example here.
🙋♂️ If you have any questions or need assistance, don’t hesitate to reach out!
Here are other great resources to learn more about R and programming in general.
🎓 Online Courses
Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 1 [Introduction to computer science with python – Part 1] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos (University of São Paulo, Brazil) (pt-BR)
Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 2 [Introduction to computer science with python – Part 2] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos-2s (University of São Paulo, Brazil) (pt-BR)
Peng, R. D., Leek, J., & Caffo, B. (n.d.). Data science specialization. [Online course]. Coursera. https://www.coursera.org/specializations/jhu-data-science (Johns Hopkins University, United States)
🎥 Videos
Thenório, I. (2022, March 22). A saga dos computadores [The computer saga] [YouTube video]. https://www.youtube.com/playlist?list=PLYjrJH3e_wDOA5mxhiMxE6yslcIzU5NkX (pt-BR)
Code.org. (2018, January 30). How computers work [YouTube video]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td
Jago, M. (Director). (2014, August 29). Turing machines explained [YouTube video]. Computerphile. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e
Lockerbie, T. (2025, December 26). I made my phone’s chip 2,000,000× bigger and flew inside [YouTube video]. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e
Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [YouTube video]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7
📙 Books
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz
Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf
Bryan, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com (strongly recommended)
Wickham, H. (n.d.). Tidy design principles. https://design.tidyverse.org
Wickham, H. (n.d.). The tidyverse style guide. https://style.tidyverse.org
Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz
This presentation was created with the Quarto Publishing System. The code and materials are available on GitHub.

(Artwork by Allison Horst)
In accordance with the American Psychological Association (APA) Style, 7th edition.

(Artwork by Allison Horst)