An Introduction to the R Programming Language

Daniel Vartanian

February 2, 2026

Hi There! 👋

This course will introduce you to the R programming language.

Here are the main topics:

  1. Introduction
  2. Computer Science
  3. R Basics
  4. Data Munging
  5. Exploratory Data Analysis
  6. Modeling
  7. Exercise
  8. Conclusion

Introduction

Pre-Course Survey

Main Objective

Provide a solid introduction to R programming with a focus on Tidyverse principles and its ecosystem.

Learning Objectives

  1. Understand key concepts of computer science and programming.
  2. Learn the structure and essential components of R programming language.
  3. Learn tidyverse grammar for data manipulation.
  4. Learn grammar of graphics for data exploration and visualization.
  5. Learn tidymodels grammar for building statistical and predictive models.

Future courses?

An Introduction to Git and GitHub.

An introduction to Quarto and Reproducible Workflows.

An introduction to Geospatial Data Science and the r-spatial framework.

An introduction to Shiny dashboards.

Course Content

Schedule

The course will take place in Laboratory 122 of the School of Philosophy, Letters and Human Sciences (FFLCH) at the University of São Paulo (USP).

The classes are distributed over 5 days, with a total of 30 hours.

Monday (02/02) 10:00-17:00

Tuesday (03/02) 10:00-17:00

Wednesday (04/02) 10:00-17:00

Thursday (05/02) 10:00-17:00

Friday (06/02) 10:00-17:00

Class Dynamic

Theory ➡️ Practice

🏋 In-class activities

☕ 15-minute break between sessions

🍝 1-hour for lunch

📓 Final project

😌 No formal evaluation

Main References

Learn from Mistakes

Mistakes will happen. Don’t let them discourage you!

Getting Help

Don’t be afraid to ask questions, even if they seem simple. Everyone has to start somewhere!

An old but good way to get help is to search or ask questions on forums like Stack Overflow.

Today, you can also use AI-powered tools like Claude to get instant help with your coding questions.

Notes

💾 Bring a flash drive/pen drive

👩‍💻 Try to use the same computer for all classes

💬 Always open this presentation in the browser

❌ Don’t miss a class

🙋‍♂️ Ask questions

🤝 Help each other

🙏 Be kind

🎉 Have fun!

Course Presentation

Always have this presentation open in your browser.

We will use it to navigate through the course content.

Use it to access the course materials and exercises.

Tip: Want to create QR Codes like this? Check the qrcode R package.

danielvartan.github.io/r-course

Module 1
Computer Science

Let’s Start from Zero

This is a programming course.

As such, we must understand the basics of how a computer works.

It is essential to start from materiality; otherwise, things stop making sense.

If you understand R at its core, you can navigate any challenge you encounter.

Computer Science

The science of information processes and their interactions with the world (Denning, 2005).

Computer science studies information processes both artificial and natural (Denning, 2005). It seeks to build a scientific foundation for such topics as computer design, computer programming, information processing, algorithmic solutions of problems, and the algorithmic process itself (Brookshear & Brylow, 2020).

Abstractions

Computer Architecture

Stored-Program Concept (1945) (AKA “von Neumann Architecture”).

First proposed by J. P. Eckert and his team (1944-1945).

Input -> Storage -> Processing -> Output

Binary Arith. & Boolean Logic

Decimal Binary Trinary
0 0 0
1 1 1
2 10 2
3 11 10
Decimal Binary Trinary
01 + 01 + 01 +
01 + 01 + 01 +
01 = 01 = 01 =
03 11 10
x y x ∧ y x ∨ y ¬x ¬y
0 0 0 0 1 1
0 1 0 1 1 0
1 0 0 1 0 1
1 1 1 1 0 0

Tip 1: See this video by Iberê Thenório (2022), on the Manual do Mundo YouTube channel, to learn more (pt-BR).

Tip 2: Learn more about number bases and their history in Boyer & Merzbach (1968/2011).

Logic Gates

What Are We Actually Doing When We Program?

Basic Terms

An algorithm is a set of steps that defines how a task is performed.

A representation of an algorithm is called a program.

The process of developing a program, encoding it in machine-compatible form, and inserting it into a machine is called programming, or sometimes coding.

Programs, and the algorithms they represent, are collectively referred to as software, in contrast to the machinery itself, which is known as hardware.

Programming Languages

The instructions which govern this operation must be given to the device in absolutely exhaustive detail. […] All these procedures require the use of some code to express the logical and the algebraical definition of the problem under consideration, as well as the necessary numerical material (von Neumann, 1993).

Programming Paradigms

A programming paradigm is a specific approach to the programming process (Brookshear & Brylow, 2020).

Procedural (Imperative)

x <- seq(1, 10)
y <- numeric()
z <- 0

for (i in x) {
  y <- c(y, i + 1)
}

for (i in y) {
  z <- z + i
}

z / length(x)
#> [1] 6.5

Functional (Declarative)

seq(1, 10) |>
  vapply(
    \(x) x + 1,
    numeric(1)
  ) |>
  mean()
#> [1] 6.5

Object-Oriented (Imperative)

library(S7)

dog <- new_class(
  name = "dog",
  properties = list(
    name = class_character,
    age = class_numeric
  )
)

lola <- dog(name = "Lola", age = 11)
speak <- new_generic("speak", "x")

method(speak, dog) <- function(x) "Woof"

speak(lola)
#> [1] "Woof"

Low and High-Level

Open-Source Prog. Lang.

Why R?

R is an excellent language to get started with programming because it’s specialized. Unlike some other languages, it doesn’t have an overwhelming amount of features for you to learn.

Python is Fine Too, I Guess…

In a general sense, Python is also good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.

It’s Not What You Think

Programming in movies versus programming in real life:

A Bit of R History

R is a free and open-source programming language designed for data analysis and graphics (Ihaka & Gentleman, 1996).

It was developed by Ross Ihaka and Robert Gentleman in the statistics department at the University of Auckland (New Zealand) in 1991 and introduced to the scientific community in 1993 (Peng, 2022).

Community

Important R Com. and Events

How to Learn More

Module 2
R Basics

Installing R

Installing R is very simple.

You just need to download the installer from the Comprehensive R Archive Network (CRAN).

Go to Download R for [YOUR OS] and then click on the base link (the base version of R).

Simply follow the instructions of the installer.

Installing R

The Positron IDE

An IDE (Integrated Development Environment) is a software that provides a set of tools to help the programmer to write code. It puts together things like a text editor, a console, a debugger, and a file manager.

Positron is flavour of Visual Studio Code, the most used IDE for programming in any language. It supersedes RStudio, the main IDE for R for many years, focusing in providing a better experience for programming in R and Python.

Note: Always use technical softwares in English.

The Positron IDE

Installing Positron

Creating a Project

Project-oriented workflow.

A project consolidates all related files and resources in one place, ensuring that your work is reproducible and well-organized.

It is important to have a structured project directory. A best practice is to follow the structure made for R Packages (Marwick et al., 2018).

Every data project must be accompanied, at least, by a README file, explaining the project, and a LICENSE file, defining the terms of use of the data (See Open Source Licenses here).

Creating a Project

Add. a README & LICENSE File

You can use the usethis R package to create these files.

First, call the package using:

# install.packages("usethis")
library(usethis)

Add a README file using:

use_readme_md()

Add a GNU General Public License version 3 file using:

use_gpl3_license()

Add. a README & LICENSE File

Quarto Notebooks

Markdown Basics

Markdown is a lightweight markup language with plain text formatting syntax. The great thing about Markdown is that you can forget about the formatting and focus on the content.

Quarto Notebooks are a type of Markdown document that allows you to mix code, text, and output in a single document. These slides are all written in Markdown.

Learn more about the Markdown syntax here.

Creating a Quarto Notebook

Objects

Everything in R is an object.

Everything that happens in R is the result of a function call.

Scalars (0D): 1

Vectors (1D): c(1, 2, 3)

Matrices (2D):
matrix(1:9, nrow = 3, ncol = 3)

Arrays (nD):
array(1:27, dim = c(3, 3, 3))

Vectorized Operations

In R, most operations are vectorized, meaning they operate on all elements of a vector at once, without needing explicit loops.

This makes code more concise, readable, and less error-prone, while improving efficiency compared to non-vectorized languages.

c(1, 2, 3) + 1
#> [1] 2 3 4
c(1, 2, 3) * c(1, 2, 3)
#> [1] 1 4 9

Primary Data Types

These are part of the atomic data types in R.

c(1, 2, 3) |> is.atomic()
#> [1] TRUE
as.Date("2026-01-01") |> class()
#> [1] "Date"
as.Date("2026-01-01") |> typeof()
#> [1] "double"

Data Classes

  • logical (e.g., TRUE, FALSE)
  • integer (e.g., 1, 2, 3)
  • double (e.g., 1.0, 2.0, 3.0)
  • character (e.g., “Maria”, “John”)
  • factor (e.g., 1 = “Male”, 2 = “Female”)
  • complex (e.g., 1+2i, 3+4i)
  • raw (e.g., 01, 0f, 10)
  • ts (e.g., time series data)
  • function (e.g., function objects)

And much more…

Non-Atomic Objects

Non-atomic types are objects that can be composed of different data types.

They can also be recursive objects, meaning they can contain themselves as elements.

list() |> is.atomic()
#> [1] FALSE
list() |> is.recursive()
#> [1] TRUE
c("a", "b", "c", c("d", "e"))
#> [1] "a" "b" "c" "d" "e"
c(1L, 2L, 3L, c(4L, 5L))
#> [1] 1 2 3 4 5
c(1.123, 2.123, 3.123, c(4.123, 5.123))
#> [1] 1.123 2.123 3.123 4.123 5.123
c(TRUE, FALSE, TRUE, c(FALSE, TRUE))
#> [1]  TRUE FALSE  TRUE FALSE  TRUE
list(list(TRUE, 1L, pi, "a"))
#> [[1]]
#> [[1]][[1]]
#> [1] TRUE
#> 
#> [[1]][[2]]
#> [1] 1
#> 
#> [[1]][[3]]
#> [1] 3.141593
#> 
#> [[1]][[4]]
#> [1] "a"

Operations

Left Hand Side Operator Right Hand Side

Arithmetic Operators

1 + 1
#> [1] 2
1 - 1
#> [1] 0
1 * 2
#> [1] 2
1 / 2
#> [1] 0.5
2^2
#> [1] 4
5 %/% 2
#> [1] 2
5 %% 2
#> [1] 1

Assignment Operators

  • `<-`() Leftwards assignment
    (use this!)
  • `<<-`() Leftwards assignment (Global Environment)
  • `->`() Rightwards assignment
  • `->>`() Rightwards assignment (Global Environment)
  • `=`() Equals sign assignment
x <- 1
x
#> [1] 1
x <<- 1
x
#> [1] 1
1 -> x
x
#> [1] 1
1 ->> x
x
#> [1] 1
x = 1
x
#> [1] 1
1 = x # Error!

Names

Names

R has strict rules regarding the names of objects.

  • Names can only contain letters, numbers, periods (.), and underscores (_).
  • Names can’t start with a number or an underscore.
  • Names can’t start with a period and be followed by a number.
  • Names are case-sensitive.
  • Names can’t be reserved words.

variable-name Bad (kebab-case)

variable.name Good, but not advisable

variable_name Good (snake_case), most used in R

1variable_name Bad

_variable_name Bad

.1variable_name Bad

variableName Good (camelCase)

VariableName Good (PascalCase)

TRUE Bad (reserved word)

Comparison Operators

1 == 1
#> [1] TRUE
1 != 1
#> [1] FALSE
2 > 1
#> [1] TRUE
2 < 1
#> [1] FALSE
1 >= 1
#> [1] TRUE
2 <= 1
#> [1] FALSE

Logical Operators

c(TRUE, FALSE) & c(TRUE, TRUE)
#> [1]  TRUE FALSE
TRUE && FALSE
#> [1] FALSE
c(TRUE, FALSE) && c(TRUE, TRUE) # Error!
c(TRUE, FALSE) | c(TRUE, TRUE)
#> [1] TRUE TRUE
TRUE || FALSE
#> [1] TRUE
c(TRUE, FALSE) || c(TRUE, TRUE) # Error!
!TRUE
#> [1] FALSE

Other Operators

1:10
#>  [1]  1  2  3  4  5  6  7  8  9 10
x <- -5
y <- 0

x:y # Bad! Use `seq()` instead.
#> [1] -5 -4 -3 -2 -1  0
"a" %in% c("a", "b", "c")
#> [1] TRUE
x <- matrix(1:4, nrow = 2, ncol = 2, byrow = TRUE)
y <- matrix(5:8, nrow = 2, ncol = 2, byrow = TRUE)

x %*% y
#>      [,1] [,2]
#> [1,]   19   22
#> [2,]   43   50

Operator Precedence

  1. `^`() Exponent (Right to Left)
  2. -x, +x Unary minus, Unary plus
    (Left to Right)
  3. `%%`() Modulus (Left to Right)
  4. `*`(), `/`() Multiplication, Division
    (Left to Right)
  5. `+`(), `-`() Addition, Subtraction
    (Left to Right)
  6. `==`(), `!=`() Comparisons
    (Left to Right)
  1. `!`() Logical NOT (Left to Right)
  2. `&`(), `&&`() Logical AND (Left to Right)
  3. `|`(), `||`() Logical OR (Left to Right)
  4. `->`(), `->>`() Rightward assignment
    (Left to Right)
  5. `<-`(), `<<-`() Leftward assignment
    (Right to Left)
  6. `=`(): Leftward assignment
    (Right to Left)
2^2 + 1 * 2
#> [1] 6
!FALSE && TRUE
#> [1] TRUE

Subsetting

Atomic

  • `[`() 1 level extraction
c("a", "b", "c")[1]
#> [1] "a"

Indexation

Indexation in R starts at 1!

c("a", "b", "c")[0]
#> character(0)
c("a", "b", "c")[1]
#> [1] "a"
c("a", "b", "c")[2]
#> [1] "b"
c("a", "b", "c")[3]
#> [1] "c"

Non-Atomic

data <- list(a = 1, b = list(2, 3))
data[1]
#> $a
#> [1] 1
data[[1]]
#> [1] 1
data$a # == data[["a"]]
#> [1] 1
data[["b"]][[2]]
#> [1] 3

Subsetting

Data Frames

  • x[i, ] Extract line i
  • x[, j] Extract column/variable j
  • x[i, j] Extract line i from column/variable j
data <- data.frame(a = c(1, 2), b = c(4, 5))

data
#>   a b
#> 1 1 4
#> 2 2 5
data["a"]
#>   a
#> 1 1
#> 2 2
data[["a"]] # == data$a
#> [1] 1 2
data[1, ]
#>   a b
#> 1 1 4
data[, 1]
#> [1] 1 2
data[1, 1]
#> [1] 1

Special Values

NA (Not Available)

Missing values must be explicitly declared in R. For that R uses the NA value.

NA comes in different flavors, for example: NA_integer_, NA_real_, NA_character_, NA_complex_.

If you use just NA, R will use the most appropriate type.

c(1, NA, 3)
#> [1]  1 NA  3
is.na(c(1, NA, 3))
#> [1] FALSE  TRUE FALSE

NaN (Not a Number)

0 / 0
#> [1] NaN
is.nan(0 / 0)
#> [1] TRUE

Inf (Infinite)

1 / 0
#> [1] Inf
-1 / 0
#> [1] -Inf
is.infinite(1 / 0)
#> [1] TRUE

Tip: See the naniar package.

Pipes

R Native Pipe
(Introduced in R 4.1.0)

object |> function()
object |> function(1, par = _)
pi |>
  round() |>
  rep("pi", times = _)
#> [1] "pi" "pi" "pi"

magrittr Pipe

object %>% function()
object %>% function(1, par = .)
object %>% function(1, .)
library(magrittr)
pi %>%
  round() %>%
  rep("pi", times = .)
#> [1] "pi" "pi" "pi"

Click here to learn the differences between the base R and magrittr pipes.

Commenting

🚨 Only if necessary 🚨

Your code must speak for itself.

In data analysis code, only use comments to record important things. If you need comments to explain what your code is doing, consider rewriting your code to be clearer (Wickham, n.d.-a).

# This is a comment.

1 + 1 # This is also a comment.

# 1 + 1 # You can also comment code that you don't want to run.

Documentation

One the most important things in programming is to know how to find help. In R, you can use the help() function to get help on a function or package.

Since R has more an academic focus, documentation is usually plentiful and very well written.

help(sum)
?sum

Documentation Websites: The logolink R package documentation.

Control Flow: Choices

if (condition) true_action
if (condition) true_action else false_action
x <- 60

if (x > 90) {
  "A"
} else if (x > 80) {
  "B"
} else if (x > 50 && x < 100) {
  "C"
} else if (x > 1) {
  "D"
} else {
  "E"
}
#> [1] "C"

else if statements are optional and can be repeated as needed.

else is also optional.

is.* Functions

x <- "a"

if (is.numeric(x)) {
  "A"
} else {
  "B"
}
#> [1] "B"

Exercise (Choices)

Write an R script that asks the user to select a traffic light color (red, yellow, or green). Based on the user’s selection, the script should print the appropriate action:

  • If the user selects red, print Stop!
  • If the user selects yellow, print Slow down!
  • If the user selects green, print Go!

Use menu() to ask for the color selection. Note that the function returns the index of the selected choice (1, 2, or 3).

color <- menu(
  choices = c("red", "yellow", "green"),
  title = "Select a traffic light color:"
)

Use print() to display the action.

print("Your action here")
#> [1] "Your action here"

Exercise (Choices)

Expected output:

color <- menu(
  choices = c("red", "yellow", "green"),
  title = "Select a traffic light color:"
)
#> Select a traffic light color:
#>
#> 1: red
#> 2: yellow
#> 3: green

#> Selection: 2

#>
#> [Your code goes here!]
#>

#> [1] "Slow down!"

Answer:

Code
# color <- menu(
#   choices = c("red", "yellow", "green"),
#   title = "Select a traffic light color:"
# )

color <- 2  # For demonstration purposes

if (color == 1) {
  print("Stop!")
} else if (color == 2) {
  print("Slow down!")
} else if (color == 3) {
  print("Go!")
} else {
  print("Invalid selection")
}
#> [1] "Slow down!"

Control Flow: Loops

🚨 Avoid loops 🚨 (if you can). Use functionals instead.

while (condition) perform_action
x <- 0

while (x < 2) {
  x <- x + 1

  print(x)
}
#> [1] 1
#> [1] 2
repeat perform_action

Use break to exit a loop early.

x <- 0

repeat {
  if (x == 0)  break
}
for (item in vector) perform_action
for (i in 1:3) {
  print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3

Use next to skip to the next iteration.

for (i in 1:3) {
  if (i == 2) {
    next
  }

  print(i)
}
#> [1] 1
#> [1] 3

Exercise (for Loops)

Write an R script that goes through temperatures for 7 days and gives weather advice for each day.

  1. You have temperature data for a week:
temperatures <- c(18, 22, 14, 27, 7, 21, 31)
  1. Use a for loop to go through each temperature.

  2. For each day, use if/else to give advice.

  • If temperature is below 18°C, print Day x: Cold - Wear a coat!
  • If temperature is between 18 and 23°C, print Day x: Nice - Perfect weather!
  • If temperature is above 23°C, print Day x: Hot - Stay hydrated!
  1. Replace x with the day number (1-7).

Exercise (for Loops)

Use seq_along() to iterate over the indices of the temperatures vector.

temperatures |> seq_along()
#> [1] 1 2 3 4 5 6 7

Use paste() or paste0() to paste values.

i <- 1

paste() uses " " as the default separator.

paste("Day", i, ": Cold - Wear a coat!")
#> [1] "Day 1 : Cold - Wear a coat!"

paste0() uses "" (empty string) as the default separator.

paste0("Day ", i, ": Cold - Wear a coat!")
#> [1] "Day 1: Cold - Wear a coat!"

Exercise (for Loops)

Expected output:

temperatures <- c(18, 22, 14, 27, 7, 21, 31)

#>
#> [Your code goes here!]
#>

#> [1] "Day 1: Nice - Perfect weather!"
#> [1] "Day 2: Nice - Perfect weather!"
#> [1] "Day 3: Cold - Wear a coat!"
#> [1] "Day 4: Hot - Stay hydrated!"
#> [1] "Day 5: Cold - Wear a coat!"
#> [1] "Day 6: Nice - Perfect weather!"
#> [1] "Day 7: Hot - Stay hydrated!"

Solution:

Code
temperatures <- c(18, 22, 14, 27, 7, 21, 31)

for (i in seq_along(temperatures)) {
  i_temperature <- temperatures[i]

  if (i_temperature < 18) {
    paste0("Day ", i, ": Cold - Wear a coat!") |> print()
  } else if (i_temperature <= 23) {
    paste0("Day ", i, ": Nice - Perfect weather!") |> print()
  } else {
    paste0("Day ", i, ": Hot - Stay hydrated!") |> print()
  }
}
#> [1] "Day 1: Nice - Perfect weather!"
#> [1] "Day 2: Nice - Perfect weather!"
#> [1] "Day 3: Cold - Wear a coat!"
#> [1] "Day 4: Hot - Stay hydrated!"
#> [1] "Day 5: Cold - Wear a coat!"
#> [1] "Day 6: Nice - Perfect weather!"
#> [1] "Day 7: Hot - Stay hydrated!"

Exercise (while Loops)

Write an R script that keeps asking the user to guess a secret number until they get it right or run out of attempts.

  1. Set a secret number.

  2. Give the user 5 attempts to guess.

  3. Use a while loop that continues as long as the user hasn’t guessed the number and still has attempts left.

  4. Inside the loop:

  • Ask for a guess.
  • Check if the guess is correct. If yes, exit the loop.
  • If wrong, check if the guess is too high OR too low and give a hint.
  • Decrease the number of attempts.
  1. After the loop, print whether they won or lost.

Exercise (while Loops)

Use sample() to set a secret number between 1 and 100.

secret <- sample(1:100, 1)
secret
#> [1] 93

Set attempts and correct variables to control the loop.

attempts <- 5
correct <- FALSE

Use readline() to ask for a guess and as.numeric() to convert it to a number.

guess <-
  readline(prompt = "Enter your guess: ") |>
  as.numeric()

Exercise (while Loops)

Expected behavior:

secret <- sample(1:100, 1)
attempts <- 5
correct <- FALSE

#>
#> [Your code goes here!]
#>

#> [1] "You have 5 attempts left."
#> [Enter your guess: 50
#> [[1] "Too high! Try again."
#> [[1] "You have 4 attempts left."
#> [Enter your guess: 30
#> [[1] "Too high! Try again."
#> [[1] "You have 3 attempts left."
#> [Enter your guess: 20
#> [[1] "Too high! Try again."
#> [[1] "You have 2 attempts left."
#> [Enter your guess: 10
#> [[1] "Correct! You won!"

Solution:

Code
secret <- sample(1:100, 1)
attempts <- 5
correct <- FALSE

while (attempts > 0 && !correct) {
  paste("You have", attempts, "attempts left.") |> print()

  guess <-
    readline(prompt = "Enter your guess: ") |>
    as.numeric()

  if (guess == secret) {
    correct <- TRUE
  } else if (guess < secret) {
    print("Too low! Try again.")
  } else {
    print("Too high! Try again.")
  }

  attempts <- attempts - 1
}

if (correct) {
  print("Correct! You won!")
} else {
  print(paste("Game over! The secret number was", secret))
}

Functions

name <- function(parameter_1, parameter_2, ...) expression

The last evaluated expression is the return value of the function.

foo <- function(x, y = 1) {
  x + y
}

foo(9)
#> [1] 10

Shorthand Not./Anonymous Fun.
(Introduced in R 4.1.0)

foo <- function(x) x(9)

foo(\(x) x + 1)
#> [1] 10

You can use return() to return a value before the end of the function, but this is a bad practice.

foo <- function(x) {
  return(x + 1)

  "a"
}

foo(9)
#> [1] 10

Exercise (Functions)

Write a function that checks if a password meets basic security requirements.

  1. Create a function called validate_password that takes one parameter: password.

  2. Check if the password is a character string and has at least 8 characters.

  3. The function should return:

  • Password must be text if it’s not a character.
  • Password too short if it’s less than 8 characters.
  • Valid password if all conditions are met.

Test your function with different inputs to ensure it works correctly.

Exercise (Functions)

Use is.character() to check if the password is a character string.

password <- "mypassword"
password |> is.character()
#> [1] TRUE

Use nchar() to get the number of characters in the password.

password |> nchar()
#> [1] 10

Use stop() to throw an error if the password is invalid.

stop("Password must be text")
#> Error:
#>! Password must be text.

Exercise (Functions)

Expected behavior:

#>
#> [Your code goes here!]
#>

validate_password(12345678)
#> Error:
#> ! Password must be text

validate_password("hello")
#> Error:
#> ! Password too short

validate_password("hello123")
#> [1] "Valid password"

Solution:

Code
validate_password <- function(password) {
  if (!is.character(password)) {
    stop("Password must be text")
  } else if (nchar(password) < 8) {
    stop("Password too short")
  } else {
    message("Valid password")
  }
}

validate_password(12345678)
#> Error:
#> ! Password must be text
validate_password("hello")
#> Error:
#> ! Password too short
validate_password("hello123")
#> Valid password

Functionals

A functional is a function that takes a function as an input and returns a vector as output (Wickham, 2019).

It makes the programmer’s work easier by applying a function to a vector without needing a loop. These functions are typically written in C (a lower-level programming language), making them very fast.

R has native functionals (e.g., lapply()), but prefer solutions from the purrr package from Tidyverse.

triple <- function(x) x * 3
with_for_loop <- function(x) {
  out <- numeric()
  for (i in x) out <- c(out, triple(i))
  out
}
library(purrr)

with_map <- function(x) map_dbl(x, triple)
library(microbenchmark)

microbenchmark(
  with_for_loop(1:1000),
  with_map(1:1000),
  times = 100,
  check = "equal"
)
#> Unit: microseconds
#>                   expr      min        lq      mean   median       uq      max
#>  with_for_loop(1:1000) 1054.845 1195.9870 1623.9336 1252.052 1679.917 5111.942
#>       with_map(1:1000)  335.771  361.8655  433.7118  396.855  451.498 1626.814
#>  neval cld
#>    100  a 
#>    100   b

Exercise (Functionals)

Use the map() function, from the purrr package, to adjust student grades.

  1. Load the purrr library using library(purrr).

  2. Use the following test scores from three different classes.

test_scores <- list(
  class_a = c(72, 85, 68, 90, 77),
  class_b = c(65, 70, 62, 75, 68),
  class_c = c(88, 92, 85, 95, 90)
)
  1. Use map() to add 5 points to every score.

  2. Finally, use map() again to multiply each score by 1.1 (10% increase).

Hints:

  • Don’t forget to load the purrr package!
  • Always use the pipe operator (|>) to improve code readability.
  • Use anonymous functions with map().
  • If needed, see map() documentation with ?map.

Exercise (Functionals)

Expected output:

library(purrr)

test_scores <- list(
  class_a = c(72, 85, 68, 90, 77),
  class_b = c(65, 70, 62, 75, 68),
  class_c = c(88, 92, 85, 95, 90)
)

#>
#> [Your code goes here!]
#>

#> $class_a
#> [1]  84.7  99.0  80.3 104.5  90.2
#>
#> $class_b
#> [1] 77.0 82.5 73.7 88.0 80.3
#>
#> $class_c
#> [1] 102.3 106.7  99.0 110.0 104.5

Solution:

Code
library(purrr)

test_scores <- list(
  class_a = c(72, 85, 68, 90, 77),
  class_b = c(65, 70, 62, 75, 68),
  class_c = c(88, 92, 85, 95, 90)
)

test_scores |>
  map(\(x) x + 5) |>
  map(\(x) x * 1.1)
#> $class_a
#> [1]  84.7  99.0  80.3 104.5  90.2
#> 
#> $class_b
#> [1] 77.0 82.5 73.7 88.0 80.3
#> 
#> $class_c
#> [1] 102.3 106.7  99.0 110.0 104.5

Environments

Environments are data structures that powers scoping (Wickham, 2019).

Scoping: The act of finding the value associated with a name.

foo <- function() {
  x <- 20

  x
}

x <- 10

foo()
#> [1] 20
y <- 45

bar <- function() {
  y <- 20

  y
}

y
#> [1] 45
z <- 100

foo <- function() {
  z <<- 50

  invisible()
}

z
#> [1] 100
foo()
z
#> [1] 50
.GlobalEnv$foo
#> function () 
#> {
#>     z <<- 50
#>     invisible()
#> }

Packages

The Unix philosophy

Each program must do one thing well (McIlroy et al., 1978).

In R, the fundamental unit of shareable code is the package.

A package bundles together code, data, documentation, and tests, and is easy to share with others (Wickham & Bryan, 2023).

install.packages("package_name")
library("package_name")
library(hms)
library(lubridate)
library(mctq)
so(parse_hm("23:00"), dhours(2))
#> 01:00:00
package_name::function_name()
hms::parse_hm("23:00")
#> 23:00:00

Note: Other programming languages refer to packages as libraries.

Packages

R is just the core. It needs packages to do anything useful.

A typical installation of R comes with a set of packages, like:

  • base: Basic R functions (e.g., sum())
  • datasets: Some datasets for testing and teaching (e.g., mtcars)
  • graphics: The basic graphics functions (e.g., plot())
  • grDevices: The graphics devices (e.g., pdf())
  • methods: The built-in object-oriented programming system
  • parallel: Support for parallel computation
  • stats: Basic statistical functions (e.g., t.test())
  • utils: Utility functions (e.g., install.packages())

Other Kind of Objects

Based on the atomic types, we can create other types of objects.

Using the right kind of object in R is very important, because these objects also include methods to deal with particular types of data.

For example, time can be expressed in different ways, like linear time (e.g., durations, periods, data-time, intervals) and circular time (e.g., time of day). We can also have time series, which are a sequence of data points taken at successive points in time.

Date
(Days since 1970-01-01 (UNIX epoch))

as.Date("1970-01-01") |> as.numeric()
#> [1] 0
as.Date("1970-01-01") + 1
#> [1] "1970-01-02"

POSIXct
(Seconds since 1970-01-01 (UNIX epoch))

as.POSIXct("1970-01-01 00:00:00", tz = "UTC") |>
  as.numeric()
#> [1] 0
as.POSIXct("1970-01-01 00:00:00", tz = "UTC") + 1
#> [1] "1970-01-01 00:00:01 UTC"

The Tidyverse

Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy (The Tidy Manifesto), grammar, and data structures.

It was created by Hadley Wickham, a prominent figure in the R community and author of several key references for this course. Hadley Wickham’s contributions have significantly shaped modern R programming.

Main Tidyverse Packages

The Tidyverse also has a meta-package that installs or load the most important packages from the collection.

install.packages("tidyverse")
library("tidyverse")
  • readr: Read flat files (e.g., txt, csv, tsv) into R.
  • stringr: A fresh approach to string manipulation in R.
  • lubridate: Make working with dates in R just that little bit easier.
  • dplyr: A grammar of data manipulation.
  • purrr: A functional programming toolkit for R.
  • forcats: Tools for working with categorical variables (factors).
  • ggplot2: An implementation of the Grammar of Graphics in R.

The rOpenSci

rOpenSci is a non-profit initiative founded in 2011 that brings together a community of researchers and developers committed to open science. They create and maintain high-quality R packages that provide access to data and tools from diverse sources, ensuring reliability and reproducibility in scientific research.

All packages go through a peer-review process, which ensures that they are well written and reliable.

Example: mctq R package peer-review.

Base R Solutions

You can do most of the things you need with the packages that come bundled with R. However, that is not the most efficient way to do things today.

In this course we are going to focus on the Tidyverse and rOpenSci packages, which are the most modern and efficient way to work with data in R.

If you want a deeper understanding of the R language, I encourage you to explore the base R solutions.

Best Practices

Always follow the Tidyverse style guide to write clean and efficient R code:

Tidyverse Style Guide

Variable and function names should use only lowercase letters, numbers, and _. Use underscores (_) (so called snake case) to separate words within a name.

# Good
day_one
day_1

# Bad
DayOne
dayone

Tidy Design Principles

The tidyverse has four guiding principles:

  • It is human centered.
  • It is consistent.
  • It is composable.
  • It is inclusive.

How to Learn More

Module 3
Data Munging

Data Science

Data science is the study of the generalizable extraction of knowledge from data (Dhar, 2023).

For some, data science is just statistics (Broman, 2013) (hype statistics). For others, it’s a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).

Data Engineering

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning (Reis & Housley, 2022).

First Things First

You only analyze or visualize data when you already have clean, tidy, and validated data.

The processing for getting data ready for analysis is called data munging. You can also see it as data wrangling (usually when dealing with machine learning) or, simply, data cleaning.

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Lohr, 2014).

What Is Data After All?

Ackoff’s DIKW pyramid (Ackoff, 1989)

Data < Information < Knowledge < Wisdom

Data versus the interpretation of the data.

What Is Data After All?

Data is an abstraction. It’s a representation of the world around us. Without context, it has no meaning.

Statistical Value Chain

A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. […] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear (van der Loo & Jonge, 2018).

Raw Data

With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source (van der Loo & Jonge, 2018).

If the researcher has made any modifications to the raw data, it is not the raw form of the data (S. E. Ellis & Leek, 2018).

Tidy Data

Tidy Data

Untidy to Tidy

Valid Data

Data validation techniques are used to ensure that data is accurate, consistent, and reliable.

Examples of invalid data:

  • Impossible values (e.g., negative age).
  • Biologically implausible values (BIVs) (e.g., a child with a height of 2 meters).
  • Improbable values (e.g., a person 200 years old).
  • Duplicated values (e.g., the same person with two different ages).

Tip: For details on identifying biologically implausible values (BIVs) in anthropometric data, see the World Health Organization (WHO) anthro R package.

Tabular Data

Daily air quality measurements in New York (May to September 1973).

?airquality

Hierarchical Data

Data can be hierarchical, with multiple levels of organization.

Interest rate representing the adjusted average rate of overnight repurchase agreements backed by Brazil’s federal government securities held in the Special System for Settlement and Custody (SELIC). Reported in % per day.

library(httr2)

request("https://api.bcb.gov.br/") |>
  req_url_path_append("dados") |>
  req_url_path_append("serie") |>
  req_url_path_append("bcdata.sgs.11") |>
  req_url_path_append("dados") |>
  req_url_query(
    formato = "json",
    dataInicial = "01/12/2024",
    dataFinal = "03/12/2024"
  ) |>
  req_user_agent("github.com/danielvartan/r-course") |>
  req_perform() |>
  resp_body_json()

#> [[1]]
#> [[1]]$data
# [1] "02/12/2024"

# [[1]]$valor
# [1] "0.041957"


# [[2]]
# [[2]]$data
# [1] "03/12/2024"

# [[2]]$valor
# [1] "0.041957"

Spatial Data

Not all data is tabular; spatial data can be very large and complex.

Excel cannot handle spatial data, and GUI-based statistical softwares, when capable of handling spatial data, are often limited and struggles with performance issues.

Projected monthly average maximum temperatures (°C) for June 2021–2040, based on the global climate model (GCM) ACCESS-CM2.

Spreadsheet Syndrome

Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data. (Klein et al., 1992)

Relational Databases

Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns) (Coronel & Morris, 2019).

Data Documentation

Proper documentation is crucial for data sharing, reproducibility, and ensuring that others can understand and use your data effectively.

Here are some examples and guides to help you document your data:

Check S. E. Ellis & Leek (2018) for a detailed guide on data documentation.

Open Data Formats

There are many open data formats available for researchers to use. Open can mean different things, but in this context, it means that the format is not proprietary and can be used by anyone. Here are some examples:

⚠️ Excel Files Are Not an Open Data Format! ⚠️

Data Frames

Data frames are a special type of list used for storing data tables. They are the most common way of storing data in R.

data.frame(a = 1:3, b = c("A", "B", "C"))
library(dplyr)
tibble(a = 1:3, b = c("A", "B", "C"))
library(mctq)
?std_mctq
View(std_mctq)

The tibble Package

tibble is a modern take on data frames in R, offering improved usability and integration with the Tidyverse. It enhances readability while retaining essential data frame features.

install.packages("tibble")

Using R native data.frame class:

mtcars
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Using tibble (Tidyverse way):

library(dplyr) # or library(tibble)

mtcars |> as_tibble()
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

How to Read Data in R

It will depend on what king of data you are working with.

Different from Excel or GUI-based statistical software, R can deal with any kind of data.

Examples of R functions to read data:

Meet the Penguins

Created by the great Allison Horst, the author of these beautiful illustrations.

install.packages("palmerpenguins")

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

These data were collected from 2007–2009 by Dr. Kristen Gorman with the Palmer Station, part of the US Long Term Ecological Research Network (LTER).

We will use this package to get familiar with R.

Tip: See this video by Renan, Michele, and Mucuvinha (2026) on the Mundo Sem Fim YouTube channel to learn more about these penguins.

Meet the Penguins

Inspecting the Raw Data File

🕵 Known your data.

  1. Click here to download the Palmer Penguins raw data file.
  2. Save it in a folder named data-raw in the root of your project.
  3. Using the file manager, open the file in Positron.

The data documentation can be accessed by running:

library(palmerpenguins)
?penguins_raw

Inspecting the Raw Data File

Before importing the data to R, let’s first take a look at the content of the data file in a text editor.

palmerpenguins-raw.csv
studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,NA
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18,195,3250,FEMALE,8.36821,-25.33302,NA
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,NA,NA,NA,NA,NA,NA,NA,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,NA
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,NA
PAL0708,7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.4606,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,NA,NA,NA,No blood sample obtained.
PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42,20.2,190,4250,NA,9.13362,-25.09368,No blood sample obtained for sexing.
PAL0708,11,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A1,Yes,2007-11-09,37.8,17.1,186,3300,NA,8.63243,-25.21315,No blood sample obtained for sexing.
PAL0708,12,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A2,Yes,2007-11-09,37.8,17.3,180,3700,NA,NA,NA,No blood sample obtained.
PAL0708,13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A1,Yes,2007-11-15,41.1,17.6,182,3200,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,14,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A2,Yes,2007-11-15,38.6,21.2,191,3800,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,15,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A1,Yes,2007-11-16,34.6,21.1,198,4400,MALE,8.55583,-25.22588,NA
PAL0708,16,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A2,Yes,2007-11-16,36.6,17.8,185,3700,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,17,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N9A1,Yes,2007-11-12,38.7,19,195,3450,FEMALE,9.18528,-25.06691,NA

The Working Directory

Pointing to files inside your project.

getwd()
#> E:\r-course
list.files()
#> [1] "data-raw" "README.md" "LICENSE.md" "r-course.Rproj"
list.files("data-raw")
#> [1] "palmerpenguins-raw.csv"
list.files("./data-raw")
#> [1] "palmerpenguins-raw.csv"
file <- "./data-raw/palmerpenguins-raw.csv"

🚨 Never use setwd()! 🚨

Click here to learn why.

The here Package

here is a package that helps you use relative paths in your R projects.

It turns file management much easier and OS independent.

install.packages("here")
library(here)
here()
#> E:\r-course
here("data-raw", "palmerpenguins-raw.csv")
#> E:\r-course\data-raw\palmerpenguins-raw.csv

The readr Package

readr provides a fast and easy way to read rectangular data from delimited files, such as Comma-Separated Values (CSV) and Tab-Separated Values (TSV).

install.packages("readr")
library(readr)
?read_csv
?read_csv2
?read_delim

Importing the Raw Data

library(dplyr)
library(here)
library(readr)
data <-
  here("data-raw", "palmerpenguins-raw.csv") |>
  read_csv(col_types = cols(.default = "c"))
data |> glimpse()
#> Rows: 344
#> Columns: 17
#> $ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
#> $ `Sample Number`       <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10…
#> $ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
#> $ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
#> $ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
#> $ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
#> $ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
#> $ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
#> $ `Date Egg`            <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-…
#> $ `Culmen Length (mm)`  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.…
#> $ `Culmen Depth (mm)`   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8"…
#> $ `Flipper Length (mm)` <chr> "181", "186", "195", NA, "193", "190", "181", "1…
#> $ `Body Mass (g)`       <chr> "3750", "3800", "3250", NA, "3450", "3650", "362…
#> $ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
#> $ `Delta 15 N (o/oo)`   <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.6649…
#> $ `Delta 13 C (o/oo)`   <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "…
#> $ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

Inspecting the Raw Data

data |> View()

The janitor Package

janitor provides simple functions for examining and cleaning dirty data.

install.packages("janitor")

clean_names()

The clean_names() function transforms data frame column names into machine-readable formats.

This makes it easier to work with the data.

data |> names()
#>  [1] "studyName"           "Sample Number"       "Species"            
#>  [4] "Region"              "Island"              "Stage"              
#>  [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
#> [10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
#> [13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
#> [16] "Delta 13 C (o/oo)"   "Comments"
library(janitor)

data <- data |> clean_names()
data |> names()
#>  [1] "study_name"        "sample_number"     "species"          
#>  [4] "region"            "island"            "stage"            
#>  [7] "individual_id"     "clutch_completion" "date_egg"         
#> [10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
#> [13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
#> [16] "delta_13_c_o_oo"   "comments"

Data Masking

Most tidyverse functions use data masking, allowing you to refer to data frame columns directly by their names (e.g., my_variable) instead of using more verbose syntax like data[["my_variable"]].

This is part of a concept called tidy evaluation.

library(dplyr)
some_data <- tibble(a = c(1, 2, 3), b = c(4, 5, 6))

With Data Masking

some_data |> pull(a) # No quotes
#> [1] 1 2 3

Without Data Masking

some_data[["a"]]
#> [1] 1 2 3

The tidyr Package

tidyr provides a set of functions that help you to tidy your data.

install.packages("tidyr")
library(dplyr)
library(tidyr)
library(stringr)
data <-
  data |>
  separate_wider_delim(
    col = stage,
    delim = ", ",
    names = c("life_stage", "egg_stage"),
  )
data |> glimpse()
#> Rows: 344
#> Columns: 18
#> $ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
#> $ sample_number     <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ life_stage        <chr> "Adult", "Adult", "Adult", "Adult", "Adult", "Adult"…
#> $ egg_stage         <chr> "1 Egg Stage", "1 Egg Stage", "1 Egg Stage", "1 Egg …
#> $ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
#> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
#> $ date_egg          <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ culmen_length_mm  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
#> $ delta_15_n_o_oo   <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.66496", …
#> $ delta_13_c_o_oo   <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "-25.…
#> $ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

The dplyr Package

dplyr is one the most important packages in the Tidyverse. It provides a grammar for data manipulation.

install.packages("dplyr")
  • mutate(): Create, modify, and delete columns.
  • transmute(): Creates a new data frame containing only specified computations
  • select(): Keep or drop columns using their names and types.
  • slice(): Subset rows using their positions.

select()

The select() function is used to select columns from a data frame.

In our case, we are not interested in using all the variables in the raw data. We will select only the variables we need.

library(dplyr)
data <-
  data |>
  select(
    date_egg, species, island, culmen_length_mm, culmen_depth_mm,
    flipper_length_mm, body_mass_g, sex
  )
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ date_egg          <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ culmen_length_mm  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

rename()

rename()

Let’s rename culmen to bill for clarity. Likewise, we’ll change date_egg to year, extracting the year value in another moment.

library(dplyr)
data <-
  data |>
  rename(
    bill_length_mm = culmen_length_mm,
    bill_depth_mm = culmen_depth_mm,
    year = date_egg
  )
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm     <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

mutate()

mutate()

The mutate() function is used to create new columns or modify existing columns in a data frame.

Let’s assign proper classes to the variables. We also need to transform the year column, extracting just the year value.

library(dplyr)
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm     <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

mutate()

For categorical variables, it is helpful to inspect their unique values.

If the vector contains a limited number of unique values, it is likely a good candidate to be converted into a factor for better data handling and analysis.

data |> pull(species) |> unique()
#> [1] "Adelie Penguin (Pygoscelis adeliae)"      
#> [2] "Gentoo penguin (Pygoscelis papua)"        
#> [3] "Chinstrap penguin (Pygoscelis antarctica)"
data |> pull(island) |> unique()
#> [1] "Torgersen" "Biscoe"    "Dream"
data |> pull(sex) |> unique()
#> [1] "MALE"   "FEMALE" NA

mutate()

library(dplyr)
library(lubridate)
library(stringr)
data <-
  data |>
  mutate(
    year =
      year |>
      as.Date() |>
      year() |>
      as.integer(),
    species =
      case_match(
        species,
        "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
        "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
        "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo"
      ) |>
      as.factor(),
    island = as.factor(island),
    bill_length_mm = bill_length_mm |> as.numeric(),
    bill_depth_mm = bill_depth_mm |> as.numeric(),
    flipper_length_mm = flipper_length_mm |> as.integer(),
    body_mass_g = body_mass_g |> as.integer(),
    sex =
      sex |>
      str_to_lower() |>
      as.factor()
  )

mutate()

library(dplyr)
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…

relocate()

relocate()

Let’s organize our columns in a more logical order.

The year column is best placed next to the sex column.

library(dplyr)
data <-
  data |>
  relocate(year, .after = sex)
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Inspecting the Valid Data

data |> View()

Comparing the Valid Data

library(palmerpenguins)
identical(data, penguins)
#> [1] TRUE
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Exporting the Data

Always save your data in non-binary formats like csv to ensure that it can be read by other software.

rds stands for R Data Serialization. It is a binary data format that allows you to load an object into R preserving all its attributes.

library(here)
library(readr)
if (!dir.exists(here("data"))) dir.create(here("data"))
data |> write_csv(here("data", "palmerpenguins-valid.csv"))
data |> write_rds(here("data", "palmerpenguins-valid.rds"))
test <- here("data", "palmerpenguins-valid.rds") |> read_rds()

Tip: Store your data in research repositories like The Open Science Framework (See the osfr package). If you are working with sensitive or human data, ensure it is encrypted before storing it in the cloud (See the lockr package).

Other Tools

You’ve learned the basic grammar for data manipulation using Tidyverse, but many important tools remain to be covered. The following slides introduce some of them.

I strongly encourage you to take the time to explore the documentation website of each Tidyverse package to discover more functions that can help you in your data analysis workflow. I guarantee your future self will thank you!

Now that we’ve reconstructed the penguins dataset from the palmerpenguins package, we’ll load the package and use the built-in dataset for the following examples. There’s no need to import the valid data into R.

filter()

filter()

The filter() function is used to filter rows from a data frame based on specified conditions.

library(dplyr)
library(palmerpenguins)
penguins |>
  filter(
    species == "Adelie",
    island %in% c("Dream", "Torgersen"),
    body_mass_g > 4000,
    bill_length_mm < 40
  )
#> # A tibble: 10 × 8
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.2          19.6               195        4675
#>  2 Adelie  Torgersen           34.6          21.1               198        4400
#>  3 Adelie  Dream               39.2          21.1               196        4150
#>  4 Adelie  Dream               39.8          19.1               184        4650
#>  5 Adelie  Dream               39.6          18.8               190        4600
#>  6 Adelie  Torgersen           35.1          19.4               193        4200
#>  7 Adelie  Dream               39.6          18.1               186        4450
#>  8 Adelie  Dream               37.5          18.5               199        4475
#>  9 Adelie  Dream               39.7          17.9               193        4250
#> 10 Adelie  Dream               39.2          18.6               190        4250
#> # ℹ 2 more variables: sex <fct>, year <int>

Exercise (filter)

The dplyr package provides a dataset called starwars that contains information about characters from the Star Wars universe. Filter the starwars dataset to include only characters who meet the following criteria:

  1. Belong to the species Droid
  2. Have a height greater than 100 cm
  3. Have a mass less than 150 kg

Exercise (filter)

library(dplyr)
starwars |> glimpse()
#> Rows: 87
#> Columns: 14
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
#> $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

Exercise (filter)

Solution:

Code
library(dplyr)

starwars |>
  filter(
    species == "Droid",
    height > 100,
    mass < 150
  )
#> # A tibble: 2 × 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender   
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
#> 1 C-3PO    167    75 <NA>       gold       yellow           112 none  masculine
#> 2 IG-88    200   140 none       metal      red               15 none  masculine
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Joins

Data analysis typically requires working with multiple data frames that need to be combined to answer your research questions.

Joins

These combinations are called joins. There are different types of joins depending on how you want to combine the data frames.

Joins

A left join retains all rows from x. Each row in x appears in the output, matching with y where possible or with NA values when no match exists.

Joins

x <- tibble(key = 1:3, val_x = c("x1", "x2", "x3"))
y <- tibble(key = c(1, 2, 4), val_y = c("y1", "y2", "y3"))
x
#> # A tibble: 3 × 2
#>     key val_x
#>   <int> <chr>
#> 1     1 x1   
#> 2     2 x2   
#> 3     3 x3
y
#> # A tibble: 3 × 2
#>     key val_y
#>   <dbl> <chr>
#> 1     1 y1   
#> 2     2 y2   
#> 3     4 y3
x |> left_join(y, by = join_by(key))
#> # A tibble: 3 × 3
#>     key val_x val_y
#>   <dbl> <chr> <chr>
#> 1     1 x1    y1   
#> 2     2 x2    y2   
#> 3     3 x3    <NA>

Exercise (Joins)

Perform a join to combine band_members with band_instruments2. These two datasets can be found in the dplyr package.

Use left_join() and join_by() for this exercise.

library(dplyr)
band_members
#> # A tibble: 3 × 2
#>   name  band   
#>   <chr> <chr>  
#> 1 Mick  Stones 
#> 2 John  Beatles
#> 3 Paul  Beatles
band_instruments2
#> # A tibble: 3 × 2
#>   artist plays 
#>   <chr>  <chr> 
#> 1 John   guitar
#> 2 Paul   bass  
#> 3 Keith  guitar

Exercise (Joins)

Solution:

Code
library(dplyr)

band_members |>
  left_join(
    band_instruments2,
    join_by(name == artist)
  )
#> # A tibble: 3 × 3
#>   name  band    plays 
#>   <chr> <chr>   <chr> 
#> 1 Mick  Stones  <NA>  
#> 2 John  Beatles guitar
#> 3 Paul  Beatles bass

Pivot

One of the most common issues is data stored in wide format where column names actually represent values of a variable.

library(tidyr)
table4b
#> # A tibble: 3 × 3
#>   country         `1999`     `2000`
#>   <chr>            <dbl>      <dbl>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

Pivot

tidyr provides two functions to deal with this kind if issue: pivot_longer() and pivot_wider().

library(tidyr)
relig_income
#> # A tibble: 18 × 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
#>  1 Agnostic      27        34        60        81        76       137        122
#>  2 Atheist       12        27        37        52        35        70         73
#>  3 Buddhist      27        21        30        34        33        58         62
#>  4 Catholic     418       617       732       670       638      1116        949
#>  5 Don’t k…      15        14        15        11        10        35         21
#>  6 Evangel…     575       869      1064       982       881      1486        949
#>  7 Hindu          1         9         7         9        11        34         47
#>  8 Histori…     228       244       236       238       197       223        131
#>  9 Jehovah…      20        27        24        24        21        30         15
#> 10 Jewish        19        19        25        25        30        95         69
#> 11 Mainlin…     289       495       619       655       651      1107        939
#> 12 Mormon        29        40        48        51        56       112         85
#> 13 Muslim         6         7         9        10         9        23         16
#> 14 Orthodox      13        17        23        32        32        47         38
#> 15 Other C…       9         7        11        13        13        14         18
#> 16 Other F…      20        33        40        46        49        63         46
#> 17 Other W…       5         2         3         4         2         7          3
#> 18 Unaffil…     217       299       374       365       341       528        407
#> # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#> #   `Don't know/refused` <dbl>
relig_income |>
  pivot_longer(
    cols = !religion,
    names_to = "income",
    values_to = "count"
  )
#> # A tibble: 180 × 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # ℹ 170 more rows

Exercise (Pivot)

Perform a pivot to transform the table4a dataset, found in the tidyr package, from wide to long format using the pivot_longer() function.

The table4a dataset displays the number of tuberculosis cases documented by the World Health Organization (WHO) in Afghanistan, Brazil, and China between 1999 and 2000.

library(tidyr)
table4a
#> # A tibble: 3 × 3
#>   country     `1999` `2000`
#>   <chr>        <dbl>  <dbl>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

Exercise (Pivot)

Solution:

Code
library(tidyr)

table4a |>
  pivot_longer(
    cols = !country,
    names_to = "year",
    values_to = "cases"
  )
#> # A tibble: 6 × 3
#>   country     year   cases
#>   <chr>       <chr>  <dbl>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

How to Learn More

Module 4
Exploratory Data Analysis

glimpse()

library(dplyr)
library(palmerpenguins)
penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

head() & tail()

library(dplyr)
library(palmerpenguins)
penguins |> head(3)
#> # A tibble: 3 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgersen           39.1          18.7               181        3750
#> 2 Adelie  Torgersen           39.5          17.4               186        3800
#> 3 Adelie  Torgersen           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>
penguins |> tail(3)
#> # A tibble: 3 × 8
#>   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Chinstrap Dream            49.6          18.2               193        3775
#> 2 Chinstrap Dream            50.8          19                 210        4100
#> 3 Chinstrap Dream            50.2          18.7               198        3775
#> # ℹ 2 more variables: sex <fct>, year <int>
penguins |>
  pull(bill_length_mm) |>
  head(3)
#> [1] 39.1 39.5 40.3

slice_sample()

library(dplyr)
library(palmerpenguins)
penguins |> slice_sample(n = 15)
#> # A tibble: 15 × 8
#>    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
#>  1 Gentoo    Biscoe             46.5          14.4               217        4900
#>  2 Adelie    Dream              40.8          18.4               195        3900
#>  3 Gentoo    Biscoe             50.5          15.9               225        5400
#>  4 Chinstrap Dream              50.8          19                 210        4100
#>  5 Chinstrap Dream              51.3          18.2               197        3750
#>  6 Chinstrap Dream              45.4          18.7               188        3525
#>  7 Gentoo    Biscoe             49.8          15.9               229        5950
#>  8 Adelie    Dream              40.7          17                 190        3725
#>  9 Chinstrap Dream              50.9          17.9               196        3675
#> 10 Gentoo    Biscoe             47.8          15                 215        5650
#> 11 Chinstrap Dream              43.5          18.1               202        3400
#> 12 Adelie    Biscoe             38.6          17.2               199        3750
#> 13 Adelie    Dream              37.3          17.8               191        3350
#> 14 Adelie    Torgers…           39            17.1               191        3050
#> 15 Adelie    Dream              42.3          21.2               191        4150
#> # ℹ 2 more variables: sex <fct>, year <int>

unique

library(dplyr)
library(palmerpenguins)
penguins |>
  pull(species) |>
  unique()
#> [1] Adelie    Gentoo    Chinstrap
#> Levels: Adelie Chinstrap Gentoo
penguins |>
  pull(island) |>
  unique()
#> [1] Torgersen Biscoe    Dream    
#> Levels: Biscoe Dream Torgersen
penguins |>
  pull(sex) |>
  unique()
#> [1] male   female <NA>  
#> Levels: female male

count()

library(dplyr)
library(palmerpenguins)
penguins |>
  count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#>   species   island        n
#>   <fct>     <fct>     <int>
#> 1 Adelie    Biscoe       44
#> 2 Adelie    Dream        56
#> 3 Adelie    Torgersen    52
#> 4 Chinstrap Biscoe        0
#> 5 Chinstrap Dream        68
#> 6 Chinstrap Torgersen     0
#> 7 Gentoo    Biscoe      124
#> 8 Gentoo    Dream         0
#> 9 Gentoo    Torgersen     0

count()

library(dplyr)
library(magrittr)
library(palmerpenguins)
penguins |>
  count(species, island, .drop = FALSE) |>
  mutate(
    n_cum = cumsum(n),
    n_per = n |>
      divide_by(sum(n)) |>
      multiply_by(100) |>
      round(3),
    n_per_cum = n_per |> cumsum()
  )
#> # A tibble: 9 × 6
#>   species   island        n n_cum n_per n_per_cum
#>   <fct>     <fct>     <int> <int> <dbl>     <dbl>
#> 1 Adelie    Biscoe       44    44  12.8      12.8
#> 2 Adelie    Dream        56   100  16.3      29.1
#> 3 Adelie    Torgersen    52   152  15.1      44.2
#> 4 Chinstrap Biscoe        0   152   0        44.2
#> 5 Chinstrap Dream        68   220  19.8      64.0
#> 6 Chinstrap Torgersen     0   220   0        64.0
#> 7 Gentoo    Biscoe      124   344  36.0     100  
#> 8 Gentoo    Dream         0   344   0       100  
#> 9 Gentoo    Torgersen     0   344   0       100

Statistics

The R packages that come with a typical R installation provide a set of basic statistical functions.

library(dplyr)
library(palmerpenguins)
bill_length_mm <- penguins |> pull(bill_length_mm)
bill_length_mm |> mean(na.rm = TRUE)
#> [1] 43.92193
bill_length_mm |> var(na.rm = TRUE)
#> [1] 29.80705
bill_length_mm |> sd(na.rm = TRUE)
#> [1] 5.459584
bill_length_mm |> min(na.rm = TRUE)
#> [1] 32.1
bill_length_mm |> max(na.rm = TRUE)
#> [1] 59.6
bill_length_mm |> range(na.rm = TRUE)
#> [1] 32.1 59.6

Statistics

library(dplyr)
library(palmerpenguins)
bill_length_mm <- penguins |> pull(bill_length_mm)
bill_length_mm |> quantile(na.rm = TRUE)
#>     0%    25%    50%    75%   100% 
#> 32.100 39.225 44.450 48.500 59.600
bill_length_mm |> quantile(prob = 0.25, na.rm = TRUE)
#>    25% 
#> 39.225
bill_length_mm |> median(na.rm = TRUE)
#> [1] 44.45
bill_length_mm |> quantile(prob = 0.75, na.rm = TRUE)
#>  75% 
#> 48.5
bill_length_mm |> IQR(na.rm = TRUE)
#> [1] 9.275

Statistics

For skewness and kurtosis, use the moments package.

library(dplyr)
library(moments)
library(palmerpenguins)
bill_length_mm <- penguins |> pull(bill_length_mm)
bill_depth_mm <- penguins |> pull(bill_depth_mm)
bill_length_mm |> skewness(na.rm = TRUE)
#> [1] 0.05288481
bill_length_mm |> kurtosis(na.rm = TRUE)
#> [1] 2.119235
bill_length_mm |> sum(na.rm = TRUE)
#> [1] 15021.3
bill_length_mm |> prod(na.rm = TRUE)
#> [1] Inf
cor(bill_length_mm, bill_depth_mm, use = "complete.obs")
#> [1] -0.2350529

summarize()

library(dplyr)
library(palmerpenguins)
penguins |>
  summarize(
    n = n(),
    mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
    mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
    mean_bill_depth_mm = mean(bill_depth_mm, na.rm = TRUE),
    mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
    .by = c("species", "sex")
  )
#> # A tibble: 8 × 7
#>   species   sex        n mean_body_mass_g mean_bill_length_mm mean_bill_depth_mm
#>   <fct>     <fct>  <int>            <dbl>               <dbl>              <dbl>
#> 1 Adelie    male      73            4043.                40.4               19.1
#> 2 Adelie    female    73            3369.                37.3               17.6
#> 3 Adelie    <NA>       6            3540                 37.8               18.3
#> 4 Gentoo    female    58            4680.                45.6               14.2
#> 5 Gentoo    male      61            5485.                49.5               15.7
#> 6 Gentoo    <NA>       5            4588.                45.6               14.6
#> 7 Chinstrap female    34            3527.                46.6               17.6
#> 8 Chinstrap male      34            3939.                51.1               19.3
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>

Exercise (summarize)

Create a comprehensive species profile for the Palmer Penguins dataset. For each penguin species, calculate:

  • Number of individuals observed
  • Number of islands they inhabit
  • Mean body mass (in grams)
  • Mean bill length (in mm)
  • Mean flipper length (in mm)

All of this information should be generated in a single, elegant operation.

penguins |>
  summarize(
    #> [Your code goes here!]
  )

Exercise (summarize)

Use n() to count the number of observations.

library(dplyr)
letters
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
tibble(a = letters) |>
  summarize(n = n())
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    26

Use n_distinct() to count the number of unique values.

library(dplyr)
c("a", "b", "b", "a") |> n_distinct()
#> [1] 2

Exercise (summarize)

Solution:

Code
penguins |>
  summarize(
    n = n(),
    n_islands = n_distinct(island),
    mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
    mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
    .by = "species"
  )
#> # A tibble: 3 × 6
#>   species       n n_islands mean_body_mass_g mean_bill_length_mm
#>   <fct>     <int>     <int>            <dbl>               <dbl>
#> 1 Adelie      152         3            3701.                38.8
#> 2 Gentoo      124         1            5076.                47.5
#> 3 Chinstrap    68         1            3733.                48.8
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>

The summarytools Package

The summarytools package makes it easy to create frequency and descriptive statistics tables for single variables with minimal code.

install.packages("gtsummary")
library(palmerpenguins)
library(summarytools)
penguins |> freq(var = species, round.digits = 3)
#> Frequencies  
#> penguins$species  
#> Type: Factor  
#> 
#>                   Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
#> --------------- ------ --------- -------------- --------- --------------
#>          Adelie    152    44.186         44.186    44.186         44.186
#>       Chinstrap     68    19.767         63.953    19.767         63.953
#>          Gentoo    124    36.047        100.000    36.047        100.000
#>            <NA>      0                              0.000        100.000
#>           Total    344   100.000        100.000   100.000        100.000
penguins |> descr(var = bill_depth_mm, round.digits = 3)
#> Descriptive Statistics  
#> penguins$bill_depth_mm  
#> N: 344  
#> 
#>                     bill_depth_mm
#> ----------------- ---------------
#>              Mean          17.151
#>           Std.Dev           1.975
#>               Min          13.100
#>                Q1          15.600
#>            Median          17.300
#>                Q3          18.700
#>               Max          21.500
#>               MAD           2.224
#>               IQR           3.100
#>                CV           0.115
#>          Skewness          -0.142
#>       SE.Skewness           0.132
#>          Kurtosis          -0.923
#>           N.Valid         342.000
#>                 N         344.000
#>         Pct.Valid          99.419

summarize() Approach

If you need more control over the descriptive statistics, then you will need to use summarize().

stats_summary <- function(x) {
  penguins |>
  summarize(
    n = n(),
    n_valid = .data[[x]] |>
      magrittr::extract(!is.na(.data[[x]])) |>
      length(),
    n_valid_per = n_valid |> # == (n_valid / n) * 100
      divide_by(n) |>
      multiply_by(100),
    mean = .data[[x]] |> mean(na.rm = TRUE),
    var = .data[[x]] |> var(na.rm = TRUE),
    sd = .data[[x]] |> sd(na.rm = TRUE),
    min = .data[[x]] |> min(na.rm = TRUE),
    quartile_1 = .data[[x]] |> quantile(prob = 0.25, na.rm = TRUE),
    median = .data[[x]] |> median(na.rm = TRUE),
    quartile_3 = .data[[x]] |> quantile(prob = 0.75, na.rm = TRUE),
    max = .data[[x]] |> max(na.rm = TRUE),
    iqr = .data[[x]] |> IQR(na.rm = TRUE),
    range = .data[[x]] |> range(na.rm = TRUE) |> diff(),
    skewness = .data[[x]] |> skewness(na.rm = TRUE),
    kurtosis = .data[[x]] |> kurtosis(na.rm = TRUE),
    .by = species
  ) |>
  pivot_longer(
    cols = -species,
    names_to = "statistic",
    values_to = "value"
  ) |>
  pivot_wider(
    names_from = species,
    values_from = value
  )
}
stats_summary("bill_length_mm")
#> # A tibble: 15 × 4
#>    statistic    Adelie  Gentoo Chinstrap
#>    <chr>         <dbl>   <dbl>     <dbl>
#>  1 n           152     124       68     
#>  2 n_valid     151     123       68     
#>  3 n_valid_per  99.3    99.2    100     
#>  4 mean         38.8    47.5     48.8   
#>  5 var           7.09    9.50    11.2   
#>  6 sd            2.66    3.08     3.34  
#>  7 min          32.1    40.9     40.9   
#>  8 quartile_1   36.8    45.3     46.3   
#>  9 median       38.8    47.3     49.6   
#> 10 quartile_3   40.8    49.6     51.1   
#> 11 max          46      59.6     58     
#> 12 iqr           4       4.25     4.73  
#> 13 range        13.9    18.7     17.1   
#> 14 skewness      0.160   0.643   -0.0886
#> 15 kurtosis      2.81    4.20     2.95

The Grammar of Tables

The gt package is the most powerful package for creating and customizing tables in R.

It provides a flexible grammar for table creation and styling.

The gt Package

library(dplyr)
library(gt)
library(palmerpenguins)
library(stringr)
penguins_summary <-
  penguins |>
  filter(!is.na(sex)) |>
  summarize(
    n = n(),
    mean_body_mass_g = mean(body_mass_g, na.rm = TRUE),
    mean_bill_length_mm = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
    .by = c("species", "sex")
  ) |>
  mutate(sex = sex |> str_to_title()) |>
  arrange(species, sex)

The gt Package

penguins_summary
#> # A tibble: 6 × 6
#>   species   sex        n mean_body_mass_g mean_bill_length_mm
#>   <fct>     <chr>  <int>            <dbl>               <dbl>
#> 1 Adelie    Female    73            3369.                37.3
#> 2 Adelie    Male      73            4043.                40.4
#> 3 Chinstrap Female    34            3527.                46.6
#> 4 Chinstrap Male      34            3939.                51.1
#> 5 Gentoo    Female    58            4680.                45.6
#> 6 Gentoo    Male      61            5485.                49.5
#> # ℹ 1 more variable: mean_flipper_length_mm <dbl>

It’s All About Layers

Code
penguin_table <-
  penguins_summary |>
  gt(groupname_col = "species") |>
  cols_label(
    sex = "Sex",
    n = "N",
    mean_body_mass_g = "Body Mass (g)",
    mean_bill_length_mm = "Bill Length (mm)",
    mean_flipper_length_mm = "Flipper Length (mm)"
  )

penguin_table
Sex N Body Mass (g) Bill Length (mm) Flipper Length (mm)
Adelie
Female 73 3368.836 37.25753 187.7945
Male 73 4043.493 40.39041 192.4110
Chinstrap
Female 34 3527.206 46.57353 191.7353
Male 34 3938.971 51.09412 199.9118
Gentoo
Female 58 4679.741 45.56379 212.7069
Male 61 5484.836 49.47377 221.5410

It’s All About Layers

Code
penguin_table <-
  penguin_table |>
  tab_header(
    title = md("**Antarctic Penguin Characteristics**"),
    subtitle = "Comparison between species and sexes"
  )

penguin_table
Antarctic Penguin Characteristics
Comparison between species and sexes
Sex N Body Mass (g) Bill Length (mm) Flipper Length (mm)
Adelie
Female 73 3368.836 37.25753 187.7945
Male 73 4043.493 40.39041 192.4110
Chinstrap
Female 34 3527.206 46.57353 191.7353
Male 34 3938.971 51.09412 199.9118
Gentoo
Female 58 4679.741 45.56379 212.7069
Male 61 5484.836 49.47377 221.5410

It’s All About Layers

Code
penguin_table <-
  penguin_table |>
  fmt_number(
    columns = starts_with("mean"),
    decimals = 2
  )

penguin_table
Antarctic Penguin Characteristics
Comparison between species and sexes
Sex N Body Mass (g) Bill Length (mm) Flipper Length (mm)
Adelie
Female 73 3,368.84 37.26 187.79
Male 73 4,043.49 40.39 192.41
Chinstrap
Female 34 3,527.21 46.57 191.74
Male 34 3,938.97 51.09 199.91
Gentoo
Female 58 4,679.74 45.56 212.71
Male 61 5,484.84 49.47 221.54

It’s All About Layers

Code
penguin_table <-
  penguin_table |>
  data_color(
    columns = mean_body_mass_g,
    palette = "Blues"
  )

penguin_table
Antarctic Penguin Characteristics
Comparison between species and sexes
Sex N Body Mass (g) Bill Length (mm) Flipper Length (mm)
Adelie
Female 73 3,368.84 37.26 187.79
Male 73 4,043.49 40.39 192.41
Chinstrap
Female 34 3,527.21 46.57 191.74
Male 34 3,938.97 51.09 199.91
Gentoo
Female 58 4,679.74 45.56 212.71
Male 61 5,484.84 49.47 221.54

It’s All About Layers

Code
penguin_table <-
  penguin_table |>
  tab_source_note(
    source_note = "Source: Palmer Station LTER / palmerpenguins package."
  ) |>
  tab_footnote(
    footnote = "Averages calculated after removing missing values.",
    locations = cells_column_labels(columns = starts_with("mean"))
  )

penguin_table
Antarctic Penguin Characteristics
Comparison between species and sexes
Sex N Body Mass (g)1 Bill Length (mm)1 Flipper Length (mm)1
Adelie
Female 73 3,368.84 37.26 187.79
Male 73 4,043.49 40.39 192.41
Chinstrap
Female 34 3,527.21 46.57 191.74
Male 34 3,938.97 51.09 199.91
Gentoo
Female 58 4,679.74 45.56 212.71
Male 61 5,484.84 49.47 221.54
1 Averages calculated after removing missing values.
Source: Palmer Station LTER / palmerpenguins package.

It’s All About Layers

Code
penguin_table <-
  penguin_table |>
  tab_style(
    style = cell_fill(color = "#E8F4F8"),
    locations = cells_row_groups()
  ) |>
  tab_style(
    style = cell_text(size = px(18)),
    locations = cells_title(groups = "title")
  ) |>
  tab_options(
    table.border.top.style = "solid",
    table.border.bottom.style = "solid",
    heading.border.bottom.style = "solid",
    column_labels.border.bottom.style = "solid",
    row_group.border.top.style = "solid",
    row_group.border.bottom.style = "solid"
  )

penguin_table
Antarctic Penguin Characteristics
Comparison between species and sexes
Sex N Body Mass (g)1 Bill Length (mm)1 Flipper Length (mm)1
Adelie
Female 73 3,368.84 37.26 187.79
Male 73 4,043.49 40.39 192.41
Chinstrap
Female 34 3,527.21 46.57 191.74
Male 34 3,938.97 51.09 199.91
Gentoo
Female 58 4,679.74 45.56 212.71
Male 61 5,484.84 49.47 221.54
1 Averages calculated after removing missing values.
Source: Palmer Station LTER / palmerpenguins package.

Exercise (gt)

Using the starwars dataset from the dplyr package, create a summary table that displays the mean height and mass of characters grouped by their species.

Customize the table with appropriate titles, labels, and formatting to enhance its readability.

library(dplyr)
starwars |> glimpse()
#> Rows: 87
#> Columns: 14
#> $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
#> $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
#> $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
#> $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
#> $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
#> $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
#> $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
#> $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
#> $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
#> $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
#> $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
#> $ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
#> $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
#> $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

Exercise (gt)

Solution:

Code
library(dplyr)
library(gt)
library(tidyr)

starwars_summary <-
  starwars |>
  drop_na(species, height, mass) |>
  summarize(
    mean_height = mean(height, na.rm = TRUE),
    mean_mass = mean(mass, na.rm = TRUE),
    .by = "species"
  )

starwars_summary |>
  gt() |>
  cols_label(
    species = "Species",
    mean_height = "Mean Height (cm)",
    mean_mass = "Mean Mass (kg)"
  ) |>
  tab_header(
    title = md("**Star Wars Character Characteristics**"),
    subtitle = "Mean height and mass by species"
  ) |>
  fmt_number(
    columns = starts_with("mean"),
    decimals = 2
  ) |>
  tab_source_note(
    source_note = "Source: Star Wars API / dplyr package."
  )
Star Wars Character Characteristics
Mean height and mass by species
Species Mean Height (cm) Mean Mass (kg)
Human 180.25 81.31
Droid 140.00 69.75
Wookiee 231.00 124.00
Rodian 173.00 74.00
Hutt 175.00 1,358.00
Yoda's species 66.00 17.00
Trandoshan 190.00 113.00
Mon Calamari 180.00 83.00
Ewok 88.00 20.00
Sullustan 160.00 68.00
Neimodian 191.00 90.00
Gungan 210.00 74.00
Dug 112.00 40.00
Zabrak 175.00 80.00
Twi'lek 178.00 55.00
Aleena 79.00 15.00
Vulptereen 94.00 45.00
Toong 163.00 65.00
Cerean 198.00 82.00
Nautolan 196.00 87.00
Tholothian 184.00 50.00
Kel Dor 188.00 80.00
Geonosian 183.00 80.00
Mirialan 168.00 53.10
Clawdite 168.00 55.00
Besalisk 198.00 102.00
Kaminoan 229.00 88.00
Skakoan 193.00 48.00
Togruta 178.00 57.00
Kaleesh 216.00 159.00
Pau'an 206.00 80.00
Source: Star Wars API / dplyr package.

gt Extensions

The gt package also has several extensions that provide additional functionality for specific use cases.

gtsummary provides a simple way to create publication-ready summary tables for statistical models and data frames.

install.packages("gtsummary")

gtextras provides additional themes, color scales, and utilities to enhance gt tables.

install.packages("gtextras")

The Grammar of Graphics

The Grammar of Graphics, by Leland Wilkinson, set the foundation thinking about data visualization.

We will based this part of the course on the principles found in this book.

The ggplot2 Package

The ggplot2 Package

The most powerful and flexible package for data visualization.

A Tidyverse package based on the principles of The Grammar of Graphics.

It’s All About Layers

Code
library(ggplot2)
library(palmerpenguins)

penguins |> ggplot()

Code
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  )

It’s All About Layers

Code
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 4)

Code
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE
  ) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  )

Bar Plots

Code
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(
    aes(
      x = island,
      fill = species
    )
  ) +
  geom_bar() +
  facet_wrap(
    facets = vars(species),
    ncol = 1
  ) +
  coord_flip() +
  labs(
    x = "Island",
    y = "Frequency"
  )

Exercise (Bar Plot)

Create a bar plot using the starwars dataset that shows the frequency of characters by their sex variable.

Apply the following customizations:

  • Drop any missing values from sex before creating the plot using the drop_na() function from the tidyr package.
  • Select a fill color from the colors() function. View available color tones here.
  • Use coord_flip() to display horizontal bars.
  • Add appropriate axis labels and a title to the plot.
  • Include a caption that credits the data source.

Exercise (Bar Plot)

Solution:

Code
library(dplyr)
library(forcats)
library(ggplot2)
library(stringr)
library(tidyr)

starwars |>
  drop_na(sex) |>
  mutate(
    sex = sex |>
      str_to_title() |>
      as.factor() |>
      fct_infreq()
  ) |>
  ggplot(aes(x = sex)) +
  geom_bar(fill = "dodgerblue4") +
  coord_flip() +
  labs(
    title = "Distribution of Star Wars Characters by Sex",
    x = "Sex",
    y = "Frequency",
    caption = "Source: Star Wars API / dplyr package"
  )

Histograms

Code
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(
    aes(
      x = flipper_length_mm,
      fill = species
    )
  ) +
  geom_histogram(
    alpha = 0.5,
    position = "identity"
  ) +
  labs(
    x = "Flipper Length (mm)",
    y = "Frequency",
    fill = "Species"
  )

Exercise (Histogram)

The ggplot2 package includes the diamonds dataset containing information about diamond quality and price. Using this dataset, create a histogram to visualize the distribution of the price variable.

Apply the following customizations:

  • Select a fill color from the colors() function. View available color tones here.
  • Add appropriate axis labels and a title to the plot.
  • Include a caption that credits the data source.

Exercise (Histogram)

library(ggplot2)
diamonds |> glimpse()
#> Rows: 53,940
#> Columns: 10
#> $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
#> $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
#> $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
#> $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
#> $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
#> $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
#> $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
#> $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
#> $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Exercise (Histogram)

Solution:

Code
library(ggplot2)
library(tidyr)

diamonds |>
  ggplot(aes(x = price)) +
  geom_histogram(
    fill = "goldenrod4",
    color = "white"
  ) +
  labs(
    title = "Distribution of Diamond Prices",
    x = "Price (USD)",
    y = "Frequency",
    caption = "Source: ggplot2 package"
  )

Boxplots

Code
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(
    bill_length_mm,
    species
  ) |>
  ggplot(
    aes(
      x = species,
      y = bill_length_mm,
      fill = species
    )
  ) +
  geom_boxplot(
    outlier.color = "red"
  ) +
  geom_jitter(
   width = 0.2,
   alpha = 0.1
  ) +
  labs(
    x = "Species",
    y = "Bill Length (mm)",
    fill = "Species"
  )

Exercise (Boxplot)

The ggplot2 package includes the msleep dataset containing information about the sleep habits of various mammals. Using this dataset, create a boxplot to visualize the distribution of the sleep_total variable across different vore (dietary categories).

Apply the following customizations:

  • Drop any missing values for sleep_total and vore before creating the plot using the drop_na() function from the tidyr package.
  • Use different fill colors for each vore category by mapping the fill aesthetic to the vore variable.
  • Add appropriate axis labels and a title to the plot.
  • Include a caption that credits the data source.

Exercise (Boxplot)

library(ggplot2)
msleep |> glimpse()
#> Rows: 83
#> Columns: 11
#> $ name         <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Greater shor…
#> $ genus        <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bos", "Bra…
#> $ vore         <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi", "carn…
#> $ order        <chr> "Carnivora", "Primates", "Rodentia", "Soricomorpha", "Art…
#> $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", NA, "dome…
#> $ sleep_total  <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1, 3.0, 5…
#> $ sleep_rem    <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0.6, 0.8, …
#> $ sleep_cycle  <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.3833333, N…
#> $ awake        <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9, 21.0, 1…
#> $ brainwt      <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0.07000, 0…
#> $ bodywt       <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.490, 0.04…

Exercise (Boxplot)

Solution:

Code
library(ggplot2)
library(tidyr)

msleep |>
  drop_na(
    sleep_total,
    vore
  ) |>
  ggplot(
    aes(
      x = vore,
      y = sleep_total,
      fill = vore
    )
  ) +
  geom_boxplot() +
  labs(
    title = "Distribution of Total Sleep by Dietary Category",
    x = "Dietary Category (Vore)",
    y = "Total Sleep (hours)",
    caption = "Source: ggplot2 package"
  )

Scatter Plots

Code
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE
  ) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  )

Exercise (Scatter Plot)

The ggplot2 package includes the mpg dataset containing information about various car models and their fuel efficiency. Using this dataset, create a scatter plot to visualize the relationship between displ (engine displacement) and hwy (highway miles per gallon).

Apply the following customizations:

  • Add a regression line using the geom_smooth() function.
  • Add appropriate axis labels and a title to the plot.
  • Include a caption that credits the data source.

Exercise (Scatter Plot)

library(ggplot2)
mpg |> glimpse()
#> Rows: 234
#> Columns: 11
#> $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
#> $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
#> $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
#> $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
#> $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
#> $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
#> $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
#> $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
#> $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
#> $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
#> $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Exercise (Scatter Plot)

Solution:

Code
library(ggplot2)

mpg |>
  ggplot(
    aes(
      x = displ,
      y = hwy
    )
  ) +
  geom_point() +
  geom_smooth() +
  labs(
    title = "Relationship between Engine Displacement and Highway MPG",
    x = "Engine Displacement (L)",
    y = "Highway Miles per Gallon",
    caption = "Source: ggplot2 package"
  )

Facets

Code
library(dplyr)
library(ggplot2)
library(magrittr)
library(palmerpenguins)
library(stringr)

penguins |>
  mutate(
    sex = sex |> str_to_title()
  ) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = sex
    )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    color = "black"
  ) +
  facet_wrap(vars(species)) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Sex"
  )

The patchwork Package

patchwork provides a simple and intuitive way to combine multiple ggplot2 plots into a single layout.

install.packages("patchwork")

Combined Plots

Code
library(ggplot2)
library(palmerpenguins)
library(patchwork)

plot_hist <-
  penguins |>
  ggplot(
    aes(
      x = flipper_length_mm,
      fill = species
    )
  ) +
  geom_histogram(
    alpha = 0.5,
    position = "identity"
  ) +
  labs(
    x = "Flipper Length (mm)",
    y = "Frequency"
  ) +
  theme(legend.position = "none")

plot_boxplot <-
  penguins |>
  ggplot(
    aes(
      x = species,
      y = bill_length_mm,
      fill = species
    )
  ) +
  geom_boxplot(outlier.color = "red") +
  geom_jitter(
    width = 0.2,
    alpha = 0.1
  ) +
  labs(
    x = "Species",
    y = "Bill Length (mm)",
    fill = "Species"
  ) +
  theme(
    axis.title.x = element_blank(),
    legend.position = "none"
  )

plot_scatter <-
  penguins |>
  drop_na(
    body_mass_g,
    flipper_length_mm,
    species
  ) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE
  ) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  theme(legend.position = "none")

(plot_hist + plot_boxplot) / plot_scatter + plot_annotation(tag_levels = "A")

ggplot2 Extensions

Like gt, the ggplot2 package has several extensions that provide additional functionality for specific use cases. You can check a list of them here.

ggwordcloud provides a way to create word clouds using ggplot2.

install.packages("ggwordcloud")

tidyplots provides a user-friendly code-based interface for creating customizable and insightful plots.

install.packages("tidyplots")

Correlation Matrices

Code
library(dplyr)
library(GGally)
library(ggplot2)
library(palmerpenguins)

penguins |>
  select(species, body_mass_g, ends_with("_mm")) |>
  ggpairs(aes(color = species))

Color Palettes

Color palettes can be sequential, diverging, or qualitative (discrete). Here are some examples of two popular packages for color palettes in R.

viridis: Colorblind-Friendly Color Maps

Code
library(ggplot2)
library(palmerpenguins)
library(tidyr)
library(viridis)

penguins |>
  drop_na(
    flipper_length_mm,
    species
  ) |>
  ggplot(
    aes(
      x = flipper_length_mm,
      fill = species
    )
  ) +
  geom_density(
    alpha = 0.5,
    position = "identity"
  ) +
  scale_fill_viridis(discrete = TRUE) +
  labs(
    x = "Flipper Length (mm)",
    y = "Density",
    fill = "Species"
  ) +
  theme(
    text = element_text(size = 20),
    legend.position = "none"
  )

RColorBrewer: Common Color Palettes

Code
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(
    flipper_length_mm,
    species
  ) |>
  ggplot(
    aes(
      x = flipper_length_mm,
      fill = species
    )
  ) +
  geom_density(
    alpha = 0.5,
    position = "identity"
  ) +
  scale_fill_brewer(palette = "Set1") +
  labs(
    x = "Flipper Length (mm)",
    y = "Density",
    fill = "Species"
  ) +
  theme(
    text = element_text(size = 20),
    legend.position = "none"
  )

How to Choose the Right Vis.

Check out the amazing work by Yan Holtz. Visit From Data to Viz too see the diagram below and many others.

How to Learn More

Module 5
Modeling

The Tidymodels Framework

The Tidymodels framework is a collection of packages for modeling and machine learning using Tidyverse principles.

It is created by the same team that developed Tidyverse and is designed to work seamlessly with it.

Although it is a relative newcomer to the R ecosystem (2018), it has quickly gained popularity due to its simplicity and consistency.

Tidymodels Packages

Like Tidyverse, Tidymodels also has a meta-package that installs or load the most important packages from the collection.

install.packages("tidymodels")
library("tidymodels")
  • rsample: Provides infrastructure for efficient data splitting and resampling.
  • parsnip: A tidy, unified interface to models.
  • recipes: A tidy interface to data pre-processing tools for feature engineering.
  • workflows: A package to bundle the pre-processing, modeling, and post-processing together.
  • infer: A statistical grammar for inferential statistics.

Hypothesis Testing

A hypothesis is a statement about a population parameter.

The goal of a hypothesis test is to decide, based on a sample from the population, which of two complementary hypotheses is true.

The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis and the alternative hypothesis. They are denoted by \(\text{H}_{0}\) and \(\text{H}_{1}\), respectively.

A hypothesis testing procedure or hypothesis test is a rule that specifies:

  1. For which sample values the decision is made to accept \(\text{H}_{0}\) as true.
  2. For which sample values \(\text{H}_{0}\) is rejected and \(\text{H}_{1}\), is accepted as true.

Power Analysis

Decision about \(\text{H}_{0}\) \(\text{H}_{0}\) True \(\text{H}_{0}\) False
Accept Correct inference
(True negative)
(\(1 - \alpha\))
Type II error
(False negative)
(\(\beta\))
Reject Type I error
(False positive)
(\(\alpha\))
Correct inference
(True positive)
(\(1 - \beta\))

Power Analysis

Code
library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.8,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "greater"
)
#> +--------------------------------------------------+
#> |             SAMPLE SIZE CALCULATION              |
#> +--------------------------------------------------+
#> 
#> Welch's T-Test (Independent Samples)
#> 
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#>   H0 (Null Claim) : d - null.d <= 0 
#>   H1 (Alt. Claim) : d - null.d > 0 
#> 
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#>   Sample Size            = 310 and 310  <<
#>   Type 1 Error (alpha)   = 0.050
#>   Type 2 Error (beta)    = 0.200
#>   Statistical Power      = 0.8

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "one.sided",
  plot = TRUE,
  verbose = FALSE
)

Power Analysis

Code
library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.3,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "greater",
)
#> +--------------------------------------------------+
#> |             SAMPLE SIZE CALCULATION              |
#> +--------------------------------------------------+
#> 
#> Welch's T-Test (Independent Samples)
#> 
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#>   H0 (Null Claim) : d - null.d <= 0 
#>   H1 (Alt. Claim) : d - null.d > 0 
#> 
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#>   Sample Size            = 64 and 64  <<
#>   Type 1 Error (alpha)   = 0.050
#>   Type 2 Error (beta)    = 0.698
#>   Statistical Power      = 0.302

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "one.sided",
  plot = TRUE,
  verbose = FALSE
)

Power Analysis

Code
library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.999,
  alpha = 0.001,
  welch.df = TRUE,
  alternative = "greater",
)
#> +--------------------------------------------------+
#> |             SAMPLE SIZE CALCULATION              |
#> +--------------------------------------------------+
#> 
#> Welch's T-Test (Independent Samples)
#> 
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#>   H0 (Null Claim) : d - null.d <= 0 
#>   H1 (Alt. Claim) : d - null.d > 0 
#> 
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#>   Sample Size            = 1913 and 1913  <<
#>   Type 1 Error (alpha)   = 0.001
#>   Type 2 Error (beta)    = 0.001
#>   Statistical Power      = 0.999

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "one.sided",
  plot = TRUE,
  verbose = FALSE
)

Type I Errors (\(\alpha\))

Type II Errors (\(\beta\))

The p-value Problem

Large samples and sensitivity

Is a difference of \(0.00001\) valid?

Statistical ritual versus Statistical thinking

Comparison of a 95% of confidence level (\(\alpha = 0.05\)) and an n-dependent p-value curve. The parameter \(n_{\alpha}\) represents the minimum sample size to detect statistically significant differences among compared groups. The parameter \(n_{\gamma}\) represents the convergence point of the p-value curve. When the p-value curve expresses practical differences, the area under the red curve (\(A_{p(n)}\)) is smaller than the area under the constant function \(\alpha = 0.05\) (\(A_{\alpha = 0.05}\)) when it is evaluated between \(0\) and \(n_{\gamma}\).

Cohen’s Benchmark

[…] in many circumstances, all that is intended by “proving” the null hypothesis is that the ES [Effect Size] is not necessarily zero but small enough to be negligible
(Cohen, 1988, p. 461).

Cohen’s Benchmark

Test Relevant Effect Size
Effect Size Classes
Small Medium Large
Comparison of independent means \(d\), \(\Delta\), Hedges’ \(g\) 0.20 0.50 0.80
Comparison of two correlations \(q\) 0.10 0.30 0.50
Difference between proportions Cohen’s \(g\) 0.05 0.15 0.25
Correlation \(r\) 0.10 0.30 0.50
\(r^{2}\) 0.01 0.09 0.25
Crosstabulation \(w\), \(\varphi\), \(V\), \(C\) 0.10 0.30 0.50
ANOVA \(f\) 0.10 0.25 0.40
\(\eta^{2}\) 0.01 0.06 0.14
Multiple regression \(R^{2}\) 0.02 0.13 0.26
\(f^{2}\) 0.02 0.15 0.35
Notes: The rationale for most of these benchmarks can be found in Cohen (1988) at the following pages: Cohen’s \(d\) (p. 40), \(q\) (p. 115), Cohen’s \(g\) (pp. 147–149), \(r\) and \(r^{2}\) (pp. 79–80), Cohen’s \(w\) (pp. 224–227), \(f\) and \(\eta^{2}\) (pp. 285–287), \(R^{2}\) and \(f^{2}\) (pp. 413–414).

The p-value Problem: Example

\(\Delta\text{R}^{2}\) = 0.00388
Cohen’s \(f^{2}\) = 0.00414

Latitudinal cline of chronotype (Leocadio-Miguel et al., 2017).

Critique of Leocadio-Miguel et al. latitude article (Vartanian, 2024).

The ASA statement on p-values (Wasserstein & Lazar, 2016).

Authors who rely solely on the p-value demonstrate a preference for statistical rituals over statistical reasoning (Gigerenzer, 2004).

Note: The HO score (Horne & Östberg, 1976) goes from 16 to 86, with higher scores indicating a preference for morningness.

What Test Should I Use?

Check Antoine Soetewey’s flowchart to help you decide the appropriate statistical test for your data.

The infer Package

infer is a Tidymodels package that provides a statistical grammar for inferential statistics.

install.packages("infer")

It also offers pipeline examples for various hypothesis tests. These pipelines can serve as a helpful starting point for implementing other types of models.

The infer Package

  • specify() specifies the variable, or relationship between variables, of interest.
  • hypothesize() declares the null hypothesis.
  • generate() generates data reflecting the null hypothesis or using the bootstrap.
  • calculate() calculates summary statistics from either the observed data to form the observed test statistic, or from the generated data to form the null distribution of test statistics.
  • visualize() plots the null distribution of test statistics.

t-Test: Hypothesis

Is there a meaningful difference in body mass between male and female Adelie penguins?

Sexual dimorphism (physical differences between sexes) is common in many bird species. This could have implications for understanding their ecology and behavior.

To test this, we will perform a t-Test for Independent Samples using a randomization-based empirical null distribution approach.

Our test type-1 error rate (\(\alpha\)) will be set at 0.05 and our type-2 error rate (\(\beta\)) will be set at 0.2, giving us a power of 0.8.

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]

t-Test: Hypothesis

Is there a meaningful difference in body mass between male and female Adelie penguins?

To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval. Cohen’s benchmark for a medium effect size (\(d\) = 0.5) will be used as our Minimum Effect Size (MES) (Cohen, 1988).

Tip: See Perezgonzalez (2015) to learn more about data testing and practical significance.

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: \text{Cohen's} \ d < \text{MES} \\ \text{H}_{1}: \text{Cohen's} \ d \geq \text{MES} \\ \end{cases} \]

Cohen’s \(d\)

Cohen’s \(d\) is a measure of effect size that indicates the standardized difference between two means. When comparing two independent means, Cohen’s \(d\) for two tails t-test can be calculated as:

\[ d = \frac{|\text{m}_{A} - \text{m}_{B}|}{s_{\text{pooled}}} \]

Where:

  • \(\text{m}_{A}\) and \(\text{m}_{B}\) are the sample means of the two groups.
  • \(s_{pooled}\) is the pooled standard deviation.

Cohen’s \(d\)

The spooled standard deviation (\(s_{pooled}\)) is calculated using the following formula:

\[ s_{pooled} = \sqrt{\frac{(n_{1} - 1)s_{1}^{2} + (n_{2} - 1)s_{2}^{2}}{n_{1} + n_{2} - 2}} \]

Where:

  • \(n_{1}\) and \(n_{2}\) are the sample sizes of the two groups.
  • \(s_{1}^{2}\) and \(s_{2}^{2}\) are the sample variances of the two groups.

The pwr and pwrss Packages

These two packages provide functions for performing power analysis and sample size calculations for various statistical tests, including t-tests, ANOVA, regression, and more.

pwr provides basic functions for power analysis.

install.packages("pwr")

pwrss provides functions to perform power and sample size calculations for various statistical tests.

install.packages("pwrss")

t-Test: Power Analysis

The power (\(1 - \beta\)) of a statistical test is the probability that it will yield statistically significant results (Cohen, 1988).

A power analysis helps determine the minimum sample size required to detect an effect of a given size with a desired level of confidence (Cohen, 1988). We need to check if our sample size will be sufficient to achieve this confidence.

library(dplyr)
library(palmerpenguins)
library(tidyr)
penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  count(sex)
#> # A tibble: 2 × 2
#>   sex        n
#>   <fct>  <int>
#> 1 female    73
#> 2 male      73

t-Test: Power Analysis

A power analysis for a t-test with an expected effect size of 0.5 (\(\text{Cohen's} \ d\)) indicates that we would need approximately 64 participants in each group to achieve a power of 0.8 at a significance level of 0.05.

library(pwr)
pwr_analysis <- pwr.t.test(
  d = 0.5,
  sig.level = 0.05,
  power = 0.8,
  type = "two.sample",
  alternative = "two.sided"
)
pwr_analysis
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 63.76561
#>               d = 0.5
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

t-Test: Power Analysis

Code
library(pwr)

pwr_analysis |>
  plot.power.htest(
    xlab = "Sample Size Per Group",
    ylab = "Power (1 - Beta)",
    main = NULL
  )

t-Test: Power Analysis

Code
library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.5,
  mu2 = 0,
  power = 0.8,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "not equal"
)
#> +--------------------------------------------------+
#> |             SAMPLE SIZE CALCULATION              |
#> +--------------------------------------------------+
#> 
#> Welch's T-Test (Independent Samples)
#> 
#> ---------------------------------------------------
#> Hypotheses
#> ---------------------------------------------------
#>   H0 (Null Claim) : d - null.d = 0 
#>   H1 (Alt. Claim) : d - null.d != 0 
#> 
#> ---------------------------------------------------
#> Results
#> ---------------------------------------------------
#>   Sample Size            = 64 and 64  <<
#>   Type 1 Error (alpha)   = 0.050
#>   Type 2 Error (beta)    = 0.199
#>   Statistical Power      = 0.801

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "two.sided",
  plot = TRUE,
  verbose = FALSE
)

t-Test: Assumptions

🔗 Independence of observations.

🔔 Normality of the distribution of the response variable (body_mass_g) within each group (sex is the explanatory/independent variable).

⚖️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and our permutation approach do not require this assumption).

t-Test: Boxplots

Code
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  ggplot(
    aes(
      x = sex,
      y = body_mass_g,
      fill = sex
    )
  ) +
  geom_boxplot(outlier.color = "red") +
  geom_jitter(
    width = 0.2,
    alpha = 0.1
  ) +
  labs(
    x = "Sex",
    y = "Body Mass (g)",
    fill = "Sex"
  )

t-Test: Histograms

Code
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  ggplot(
    aes(
      x = body_mass_g,
      fill = sex
    )
  ) +
  geom_histogram(
    position = "identity",
    alpha = 0.7
  ) +
  labs(
    x = "Body Mass (g)",,
    y = "Frequency",
    fill = "Sex"
  )

t-Test: Density Plots

Code
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  ggplot(
    aes(
      x = body_mass_g,
      fill = sex
    )
  ) +
  geom_density(alpha = 0.7) +
  labs(
    x = "Body Mass (g)",,
    y = "Density",
    fill = "Sex"
  )

t-Test: Q-Q Plots

Code
library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  ggplot(
    aes(
      sample = body_mass_g,
      color = sex
    )
  ) +
  stat_qq() +
  stat_qq_line(color = "black") +
  facet_wrap(vars(sex)) +
  labs(
    x = "Theoretical Quantiles",
    y = "Sample Quantiles",
    color = "Sex"
  )

t-Test: Assumptions

✅ Independence of observations.

✅ Normality of the distribution of the response variable (body_mass_g) within each group (sex is the explanatory/independent variable).

⏭️ Homogeneity of variances between groups (only if using Student’s t-test; Welch’s t-test and our permutation approach do not require this assumption).

t-Test: Observed Statistic

The t-statistic for comparing two independent means can be calculated using the following formula:

\[ t = \frac{\text{m}_{A} - \text{m}_{B}}{\sqrt{\frac{s_{A}^{2}}{n_{A}} + \frac{s_{B}^{2}}{n_{B}}}} \]

Where:

  • \(\text{m}_{A}\) and \(\text{m}_{B}\) are the sample means of the two groups.
  • \(s_{A}^{2}\) and \(s_{B}^{2}\) are the sample variances of the two groups.
  • \(n_{A}\) and \(n_{B}\) are the sample sizes of the two groups.

t-Test: Observed Statistic

library(dplyr)
library(magrittr)
library(palmerpenguins)
library(tidyr)
male <-
  penguins |>
  filter(
    species == "Adelie",
    sex == "male"
  ) |>
  drop_na(body_mass_g, sex) |>
  pull(body_mass_g)
female <-
  penguins |>
  filter(
    species == "Adelie",
    sex == "female"
  ) |>
  drop_na(body_mass_g, sex) |>
  pull(body_mass_g)
t_statistic <-
  mean(male, na.rm = TRUE) |>
  subtract(mean(female, na.rm = TRUE)) |>
  divide_by(
    var(male, na.rm = TRUE) |>
      divide_by(length(male)) |>
      add(
        var(female, na.rm = TRUE) |>
          divide_by(length(female))
      ) |>
      sqrt()
  )
t_statistic
#> [1] 13.12629

t-Test: Observed Statistic

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)
observed_statistic <-
  penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  specify(body_mass_g ~ sex) |>
  hypothesize(null = "independence") |>
  calculate(
    stat = "t",
    order = c("male", "female")
  )
observed_statistic
#> Response: body_mass_g (numeric)
#> Explanatory: sex (factor)
#> Null Hypothesis: in...
#> # A tibble: 1 × 1
#>    stat
#>   <dbl>
#> 1  13.1

t-Test: Null Distribution

What would our data look like if there was no difference between sexes?

One way to simulate this scenario is to randomly shuffle the sex values among the body_mass_g values. This breaks any real association between sex and body mass.

We then calculate the t-statistic for this new dataset. Repeating this process many times allows us to build a distribution of t-statistics that represent what we would expect to see if there were no real difference in body mass between male and female Adelie penguins.

t-Test: Null Distribution

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)
null_dist <-
  penguins |>
  filter(species == "Adelie") |>
  drop_na(body_mass_g, sex) |>
  specify(body_mass_g ~ sex) |>
  hypothesize(null = "independence") |>
  generate(
    reps = 1000,
    type = "permute"
  ) |>
  calculate(
    stat = "t",
    order = c("male", "female")
  )
null_dist
#> Response: body_mass_g (numeric)
#> Explanatory: sex (factor)
#> Null Hypothesis: in...
#> # A tibble: 1,000 × 2
#>    replicate    stat
#>        <int>   <dbl>
#>  1         1  1.11  
#>  2         2 -0.414 
#>  3         3 -2.09  
#>  4         4  1.23  
#>  5         5  0.639 
#>  6         6 -0.847 
#>  7         7 -0.333 
#>  8         8  0.0270
#>  9         9 -1.27  
#> 10        10  0.144 
#> # ℹ 990 more rows

t-Test: Visualizing

Code
library(infer)
library(ggplot2)

null_dist |>
  visualize() +
  shade_p_value(
    obs_stat = observed_statistic,
    direction = "two-sided"
  ) +
  labs(
    title = NULL,
    x = "t-statistic",
    y = "Frequency"
  ) +
  theme(
    text = element_text(size = 14)
  )

t-Test: p-value

null_dist |>
  get_p_value(
    obs_stat = observed_statistic,
    direction = "two-sided"
  )
#> # A tibble: 1 × 1
#>   p_value
#>     <dbl>
#> 1       0

t-Test: Observed Effect Size

library(dplyr)
library(effectsize)
library(palmerpenguins)
library(tidyr)
male <-
  penguins |>
  filter(
    species == "Adelie",
    sex == "male"
  ) |>
  drop_na(body_mass_g, sex) |>
  pull(body_mass_g)
female <-
  penguins |>
  filter(
    species == "Adelie",
    sex == "female"
  ) |>
  drop_na(body_mass_g, sex) |>
  pull(body_mass_g)
effect_size <-
  cohens_d(
    x = male,
    y = female,
    mu = 0,
    ci = 0.95,
    alternative = "two.sided"
  )
effect_size
#> Cohen's d |       95% CI
#> ------------------------
#> 2.17      | [1.76, 2.58]
#> 
#> - Estimated using pooled SD.
effect_size |>
  interpret_hedges_g(rules = "cohen1988")
#> Cohen's d |       95% CI | Interpretation
#> -----------------------------------------
#> 2.17      | [1.76, 2.58] |          large
#> 
#> - Estimated using pooled SD.
#> - Interpretation rule: cohen1988

t-Test: Conclusion

Is there a meaningful difference in body mass between male and female Adelie penguins?

Our analysis found a statistically significant difference in means (\(t\) = 13.1, \(p\)-value < 0.001). The observed effect size was large and exceed the Minimal Effect Size (MES) threshold (\(d\) = 2.17, 95% CI [1.76, 2.58]).

Since we could reliably detect effects of 0.5 (\(\text{Cohen's} \ d\)) or larger, the power of our test remains high, indicating that the probability of a false negative (\(\beta\)) is very low.

Based on these results, we conclude that there is a meaningful difference in body mass between male and female Adelie penguins, with male penguins having a higher mean body mass. Therefore, we reject the null hypothesis in favor of the alternative hypothesis.

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{1}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{1}: d \geq \text{MES} \\ \end{cases} \]

Model Diagnostics

Model diagnostics are crucial!

It’s essential to verify that all model assumptions hold. However, a discussion on this topic is beyond the scope of this course.

You can find these assumptions in most statistical textbooks, or you can look at the original papers that introduced the models (e.g., fot t-Tests, see Student (1908)).

Objective Assumption Tests

🚨 Avoid Using! 🚨

Objective assumption tests (e.g., Anderson–Darling test) are not advisable for large samples, as they can be overly sensitive to minor deviations. Additionally, they might overlook visual patterns that are not captured by a single metric.

Usually, a visual inspection of the data is the preferred approach in most cases.

For a straightforward critique of normality tests specifically, refer to this article by Greener (2020).

See also: Kozak & Piepho (2018), Schucany & Ng (2006), and Shatz (2024).

How to Learn More

How to Learn More

StatQuest
by Josh Starmer

Very Normal
by Christian Pascual

Module 6
Exercise

Your Mission

Answer the following question using the data provided:

Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 between municipalities in clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?

MISFS-R Clusters

The Multidimensional Index for Sustainable Food Systems (MISFS) is a tool designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices.

The MISFS-R is a revised version that introduces new indicators and a refined methodology for calculating the index.

For more details, see Carvalho et al. (2021) and Norde et al. (2023).

MISFS-R Clusters

⚠️ Warning ⚠️

This exercise is for educational purposes only.

The data used in this exercise requires further cleaning and validation before it can be used in real-world applications. For the purposes of this analysis, we will assume the data is valid, reliable, and satisfies all assumptions underlying the statistical tests performed, even though this may not hold in practice.

Please note that the results of the statistical test may not be valid due to these simplifications.

In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.

Data Science Workflow

Remember the data science workflow:

Steps

1. Create a new project and open it in Positron.

2. Download the Quarto template file, rename it as index.qmd, and store it in the project root.

3. Download the template references file, name it as reference.bib and store it in the project root.

4. Read and review the Quarto template.

5. Perform an a priori power analysis.

6. Download the data.

7. Inspect the data file.

8. Import the data into R.

9. Clean, tidy, and validate the data.

10. Transform the data as needed.

11. Save the processed data.

12. Conduct a brief exploratory data analysis.

13. Assess the model assumptions.

14. Model the data.

15. Write your conclusions.

16. Render the report.

The Data

Notes

🧮 Recalculate the percentage.

❌ Remove municipalities with less than 10 monitored children.

❌ Remove municipalities where the number of children consuming ultra-processed foods exceeds the number monitored.

🙋‍♂️ If you’re stuck, ask for help.

🎉 Have fun!

Solution

You can find the solution to this exercise at the following links:

Report

https://danielvartan.github.io/r-course-exercise

Code Repository

https://github.com/danielvartan/r-course-exercise

Conclusion

I Hope You Enjoyed the Course!

Now That You’re an R Superstar

I hope you feel more confident in your R programming skills and are ready to tackle new challenges. Remember, learning is a continuous journey, and the R community is there to support you.

🇧🇷 For Brazilian graduate students, consider exploring the abnt Quarto format. It helps you write your thesis or dissertation in compliance with the Brazilian Association of Technical Standards (ABNT) standards while ensuring full reproducibility. Check out an example here.

🙋‍♂️ If you have any questions or need assistance, don’t hesitate to reach out!

Post-Course Survey

How to Learn More?

Here are other great resources to learn more about R and programming in general.

🎓 Online Courses

Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 1 [Introduction to computer science with python – Part 1] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos (University of São Paulo, Brazil) (pt-BR)

Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 2 [Introduction to computer science with python – Part 2] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos-2s (University of São Paulo, Brazil) (pt-BR)

Peng, R. D., Leek, J., & Caffo, B. (n.d.). Data science specialization. [Online course]. Coursera. https://www.coursera.org/specializations/jhu-data-science (Johns Hopkins University, United States)

How to Learn More?

🎥 Videos

Thenório, I. (2022, March 22). A saga dos computadores [The computer saga] [YouTube video]. https://www.youtube.com/playlist?list=PLYjrJH3e_wDOA5mxhiMxE6yslcIzU5NkX (pt-BR)

Code.org. (2018, January 30). How computers work [YouTube video]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td

Jago, M. (Director). (2014, August 29). Turing machines explained [YouTube video]. Computerphile. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e

Lockerbie, T. (2025, December 26). I made my phone’s chip 2,000,000× bigger and flew inside [YouTube video]. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e

Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [YouTube video]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7

How to Learn More?

📙 Books

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz

Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf

Bryan, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com (strongly recommended)

Wickham, H. (n.d.). Tidy design principles. https://design.tidyverse.org

Wickham, H. (n.d.). The tidyverse style guide. https://style.tidyverse.org

Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz

Examples of Data Reports

Closing Remarks

License: GPLv3 License: CC BY-NC-SA 4.0

This presentation was created with the Quarto Publishing System. The code and materials are available on GitHub.

References

In accordance with the American Psychological Association (APA) Style, 7th edition.

Ackoff, R. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16, 3–9.
Allaire, J. J., Teague, C., Xie, Y., & Dervieux, C. (n.d.). Quarto [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.5960048
Bernardes, M. E. M. (2009). Ensino e aprendizagem como unidade dialética na atividade pedagógica [Teaching and learning as a dialectical unity in pedagogical activity]. Psicologia Escolar e Educacional, 13(2), 235–242. https://doi.org/10.1590/S1413-85572009000200005
Box, G. E. P. (1979). Robustness in the strategy of scientific model building. In R. L. Launer & G. N. Wilkinson (Eds.), Robustness in statistics (pp. 201–236). Academic Press. https://doi.org/10.1016/B978-0-12-438150-6.50018-2
Boyer, C. B., & Merzbach, U. C. (2011). A history of mathematics (3rd ed.). John Wiley & Sons. (Original work published in 1968)
Broman, K. (2013, April 5). Data science is statistics. The stupidest thing... https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics
Brookshear, J. G., & Brylow, D. (2020). Computer science: An overview (13 ed., Global edition). Pearson.
Bryan, J., & Hester, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com
Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf
Cao, L. (2017). Data science: A comprehensive overview. ACM Computing Surveys, 50(3), 43. https://doi.org/10.1145/3076253
Carvalho, A. M. de, Verly Jr, E., Marchioni, D. M., & Jones, A. D. (2021). Measuring sustainable food systems in Brazil: A framework and multidimensional index to evaluate socioeconomic, nutritional, and environmental aspects. World Development, 143, 105470. https://doi.org/10.1016/j.worlddev.2021.105470
Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.
Chang, W. (2018). R graphics cookbook: Practical recipes for visualizing data (2nd ed.). O’Reilly Media. https://r-graphics.org
Code.org. (2018, January 30). How computers work [Video recording]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Coronel, C., & Morris, S. A. (2019). Database systems: Design, implementation, and management (13th ed.). Cengage.
Couch, S., Bray, A., Ismay, C., Chasnovski, E., Baumer, B., & Çetinkaya-Rundel, M. (2021). Infer: An R package for tidyverse-friendly statistical inference. Journal of Open Source Software, 6(65), 3661. https://doi.org/10.21105/joss.03661
DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics (OCLC: ocn502674206) (4th ed.). Addison-Wesley.
Denning, P. J. (2005). Is computer science science? Communications of the ACM, 48(4), 27–31. https://doi.org/10.1145/1053291.1053309
Dhar, V. (2023). Data science and prediction. Communications of the ACM, 56(12), 64–73. https://doi.org/10.1145/2500499
Ellis, P. D. (Ed.). (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. Cambridge University Press.
Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. https://doi.org/10.1080/00031305.2017.1375987
Engels, F. (2020). Dialética da natureza [Dialektik der natur] (N. Schneider, Trans.). Boitempo. (Original work published in 1873)
Freire, P. (2011). Pedagogia da autonomia: Saberes necessários à prática educativa [Pedagogy of autonomy: Necessary knowledge for educational practice]. Paz e Terra.
Frey, B. B. (Ed.). (2022). The SAGE encyclopedia of research design (2nd ed.). SAGE Publications. https://doi.org/10.4135/9781071812082
Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033
Gómez-de-Mariscal, E., Guerrero, V., Sneider, A., Jayatilaka, H., Phillip, J. M., Wirtz, D., & Muñoz-Barrutia, A. (2021). Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets. Scientific Reports, 11(1, 1), 20942. https://doi.org/10.1038/s41598-021-00199-5
Greener, R. (2020, August 4). Stop testing for normality. Medium. https://towardsdatascience.com/stop-testing-for-normality-dba96bb73f90
Greinert, R. A. G., & Martins, M. O. (2026, January 27). Acampamos na Antártida (Ep. 4) [We camped in Antarctica (Ep. 4)] [Video recording]. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e
Grosser, M., Bumann, H., & Wickham, H. (2021). Advanced R solutions. CRC Press.
Hartnett, K. (2021, March 23). Matrix multiplication inches closer to mythic goal. Quanta Magazine. https://www.quantamagazine.org/mathematicians-inch-closer-to-matrix-multiplication-goal-20210323
Horne, J. A., & Östberg, O. (1976). A self-assessment questionnaire to determine morningness-eveningness in human circadian rhythms. International Journal of Chronobiology, 4(2), 97–110.
Howell, D. C. (2013). Statistical methods for psychology (8th ed.). Wadsworth Cengage Learning.
Ifrah, G. (2001). The universal history of computing: From the abacus to the quantum computer (E. F. Harding, Trans.). John Wiley & Sons. (Original work published in 1994)
Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314. https://doi.org/10.1080/10618600.1996.10474713
Jago, M. (2014, August 29). Turing machines explained [We camped in Antarctica (Ep. 4)] [Video recording]. Computerphile. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e
Kahanwal, B. (2013). Abstraction level taxonomy of programming language frameworks. International Journal of Programming Languages and Applications, 3(4). https://doi.org/10.5121/ijpla.2013.3401
Klein, L. A., Cerullo, M. J., & Cerullo, M. V. (1992). Integration of the microcomputer into the changing accounting curriculum: An analysis of recent findings. Computers & Education, 19(3), 209–217. https://doi.org/10.1016/0360-1315(92)90114-K
Kozak, M., & Piepho, H.-P. (2018). What’s normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. Journal of Agronomy and Crop Science, 204(1), 86–98. https://doi.org/10.1111/jac.12220
Kuhn, M., & Silge, J. (2022). Tidy modeling with R: A framework for modeling in the tidyverse. O’Reilly Media. https://www.tmwr.org
Leocadio-Miguel, M. A., Louzada, F. M., Duarte, L. L., Areas, R. P., Alam, M., Freire, M. V., Fontenele-Araujo, J., Menna-Barreto, L., & Pedrazzoli, M. (2017). Latitudinal cline of chronotype. Scientific Reports, 7(1), 5437. https://doi.org/10.1038/s41598-017-05797-w
Lockerbie, T. (2025, December 26). I made my phone’s chip 2,000,000× bigger and flew inside [Video recording]. https://youtu.be/QtW1lQITckE?si=4_wgsHUd96Dvmy-e
Lohr, S. (2014, August 18). For big-data scientists, Janitor work” is key hurdle to insights. The New York Times: Technology. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986
McIlroy, M. D., Pinson, E. N., & Tague, B. A. (1978). UNIX Time-Sharing System: Forward. Bell System Technical Journal, 57(6), 1899–1904. https://archive.org/details/bstj57-6-1899/mode/2up
Neyman, J., & Pearson, E. S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A(1/2), 175–240. https://doi.org/10.2307/2331945
Neyman, J., & Pearson, E. S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294. https://doi.org/10.2307/2332112
Norde, M. M., Porciuncula, L., Garrido, G., Nunes-Galbes, N. M., Sarti, F. M., Marchioni, D. M. L., & de Carvalho, A. M. (2023). Measuring food systems sustainability in heterogenous countries: The Brazilian multidimensional index updated version applicability. Sustainable Development, 31(1), 91–107. https://doi.org/10.1002/sd.2376
Papert, S. (2020). Mindstorms: Children, computers, and powerful ideas. Basic Books.
Peng, R. D. (2022). R programming for data science. https://bookdown.org/rdpeng/rprogdatascience
Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00223
Popper, K. R. (1979). Objective knowledge: An evolutionary approach. Oxford University Press. (Original work published in 1972)
R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.r-project.org
Reis, J., & Housley, M. (2022). Fundamentals of data engineering: Plan and build robust data systems. O’Reilly.
Rowley, J. (2007). The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information Science, 33(2), 163–180. https://doi.org/10.1177/0165551506070706
Schucany, W. R., & Ng, H. K. T. (2006). Preliminary goodness-of-fit tests for normality do not validate the one-sample Student t. Communications in Statistics - Theory and Methods, 35(12), 2275–2286. https://doi.org/10.1080/03610920600853308
Shatz, I. (2024). Assumption-checking rather than (just) testing: The importance of visualization and effect size in statistical diagnostics. Behavior Research Methods, 56(2), 826–845. https://doi.org/10.3758/s13428-023-02072-x
Sistema de Vigilância Alimentar e Nutricional, Coordenação-Geral de Alimentação e Nutrição, Departamento de Promoção da Saúde, Coordenação Setorial de Tecnologia da Informação, Secretaria de Atenção Primária à Saúde, & Ministério da Saúde. (n.d.). Microdados dos acompanhamentos de estado nutricional [Microdata on nutritional status monitoring] [Dataset]. openDataSUS. https://dadosabertos.saude.gov.br/dataset/sisvan-estado-nutricional
Student. (1908). The probable error of a mean. Biometrika, 6(1). https://doi.org/10.2307/2331554
Thenório, I. (2022, March 22). A saga dos computadores [The computer saga] [Video recording]. https://www.youtube.com/playlist?list=PLYjrJH3e_wDOA5mxhiMxE6yslcIzU5NkX
Tisue, S., & Wilensky, U. (2004). NetLogo: A simple environment for modeling complexity. Center for Connected Learning and Computer-Based Modeling. http://www.ccl.sesp.northwestern.edu/papers/netlogo-iccs2004.pdf
Ushey, K., & Wickham, H. (2025). renv: Project environments [Computer software]. https://doi.org/10.32614/CRAN.package.renv
van der Loo, M., & Jonge, E. de. (2018). Statistical data cleaning with applications in R. John Wiley & Sons.
Vartanian, D. (2024). Is latitude associated with chronotype? [Corrected version, University of São Paulo]. https://doi.org/10.11606/D.100.2025.tde-02042025-063648
von Neumann, J. (1993). First draft of a report on the EDVAC. IEEE Annals of the History of Computing, 15(4), 27–75. https://doi.org/10.1109/85.238389
Vygotsky, L. S. (2012). Thought and language (E. Hanfmann, G. Vakar, & A. Kozulin, Trans.; Rev. exp. ed.). The MIT Press. (Original work published in 1934)
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2). https://doi.org/10.1080/00031305.2016.1154108
Wickham, H. (n.d.-a). The tidyverse style guide. https://style.tidyverse.org
Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz
Wickham, H. (2023). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html
Wickham, H., & Bryan, J. (2023). R packages (2nd ed.). O’Reilly Media. https://r-pkgs.org
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz
Wilkinson, L. (2005). The grammar of graphics (J. Chambers, D. Hand, & W. Härdle, Eds.; 2nd ed.). Springer.
Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [Video recording]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7

Thank You!