An Introduction to the R Programming Language

Daniel Vartanian

Jaqueline Lopes Pereira

Hi there! 👋

This course will introuce you to the R programming language.

Here are the main topics:

Introduction
Computer Science
R Basics
Data Munging
Exploratory Data Analysis
Modeling
Exercise
Conclusion

Introduction

Pre-Course Survey

Course Objectives

Grasp the foundational principles and key concepts of computer science.
Understand the structure and essential components of the R programming language.
Master fundamental techniques for data munging.
Develop skills in creating insightful data visualizations.
Learn techniques for conducting exploratory data analysis.
Gain an introduction to basic modeling approaches.

(Future course? An Introduction to Git)

Course Content

Schedule

The course will take place in one of the laboratories of the Public Health School (FSP) of the University of São Paulo (USP).

The classes are distributed over 5 days, with a total of 9 hours.

Monday (24/03) 13:00-14:00

Tuesday (25/03) 13:00-14:00

Wednesday (26/03) 13:00-18:00

Thursday (27/03) 13:00-14:00

Friday (28/03) 13:00-14:00

Class Dynamic

Theory ➡️ Practice.

🏋 In-class activities.

📓 Final project.

😌 No formal evaluation.

Main References

Learn from Mistakes

Mistakes will happen. Don’t let them discourage you!

Getting Help

Don’t be afraid to ask questions, even if they seem simple. Everyone has to start somewhere!

The programming community is very active and there are many resources available to help you learn and solve problems.

One of the best ways to get help is to search or ask questions on forums like Stack Overflow.

Notes

💾 Bring a flash drive/pen drive.

👩‍💻 Try to use the same computer for all classes.

💬 Always open this presentation in the browser.

❌ Don’t miss a class.

🙋‍♂️ Ask questions.

🤝 Help each other.

🎉 Have fun!

Course Presentation

Always have this presentation open in your browser.

We will use it to navigate through the course content.

Use it to access the course materials and exercises.

Tip: Want to create QR Codes like this? Check the qrcode R package.

danielvartan.github.io/r-course

Module 1
Computer Science

Let’s Start from Zero

This is a programming course.

As such, we must understand the basics of how a computer works.

It is essential to start from materiality; otherwise, things stop making sense.

If you understand R at its core, you can navigate any challenge you encounter.

Computer Science

The science of information processes and their interactions with the world (Denning, 2005).

Computer science studies information processes both artificial and natural (Denning, 2005). It seeks to build a scientific foundation for such topics as computer design, computer programming, information processing, algorithmic solutions of problems, and the algorithmic process itself (Brookshear & Brylow, 2020).

Abstractions

Computer Architecture

Stored-Program Concept (1945) (AKA “von Neumann Architecture”)

First proposed by J. P. Eckert and his team (1944-1945).

Input -> Storage -> Processing -> Output

Binary Arith. & Boolean Logic

Decimal	Binary
0	0
1	1
2	10
3	11
4	100

Addition
01 +
01 +
01 =
11

x	y	x ∧ y	x ∨ y
0	0	0	0
0	1	0	1
1	0	0	1
1	1	1	1

x	¬x
0	1
1	0

Logic Gates

What We Are Really Doing When We Program?

Programming Languages

The instructions which govern this operation must be given to the device in absolutely exhaustive detail. […] All these procedures require the use of some code to express the logical and the algebraical definition of the problem under consideration, as well as the necessary numerical material. (von Neumann, 1993)

Open-Source Prog. Lang.

Why R?

R is an excellent language to get started with programming because it’s specialized. Unlike some other languages, it doesn’t have an overwhelming amount of features for you to learn.

Python is Fine Too, I Guess…

In a general sense, Python is also good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.

It’s Not What You Think

Programming in movies versus programming in real life:

A Bit of R History

R is a free and open-source programming language designed for data analysis graphics (Ihaka & Gentleman, 1996).

It was developed by Ross Ihaka and Robert Gentleman in the statistics department at the University of Auckland (New Zealand) in 1991 and introduced to the scientific community in 1993 (Peng, 2022).

Community

Important R Comm. and Events

How to Learn More

Module 2
R Basics

Installing R

Installing R is very simple.

You just need to download the installer from the Comprehensive R Archive Network (CRAN).

Go to Download R for [YOUR OS] and then click on the base link—The base version of R.

Simply follow the instructions of the installer.

Installing R

The RStudio IDE

An IDE (Integrated Development Environment) is a software that provides a set of tools to help the programmer to write code. It puts together things like a text editor, a console, a debugger, and a file manager.

RStudio is the most popular IDE for R. Like R, it is also free and open-source.

Installing RStudio

Creating a Project

Project-oriented workflow: A project consolidates all related files and resources in one place, ensuring that your work is reproducible and well-organized.

It is important to have a structured project directory. A best practice is to follow the structure made for R Packages (Marwick et al., 2018).

Every data project must be accompanied, at least, by a README file, explaining the project, and a LICENSE file, defining the terms of use of the data (See Open Source Licenses here).

Creating a Project

Adding a README & LICENSE File

You can use the usethis R package to create these files.

First, call the package using:

# install.packages("usethis")
library(usethis)

Add a README file using:

use_readme_md()

Add a MIT License file using:

use_mit_license()

Quarto Notebooks

Markdown Basics

Markdown is a lightweight markup language with plain text formatting syntax. The great thing about Markdown is that you can forget about the formatting and focus on the content.

Quarto Notebooks are a type of Markdown document that allows you to mix code, text, and output in a single document. These slides are all written in Markdown.

Learn more about the Markdown syntax here.

Objects

Everything in R is an object.

Everything that happens in R is the result of a function call.

Scalars (0D): 1

Vectors (1D): c(1, 2, 3)

Matrices (2D):
matrix(1:9, nrow = 3, ncol = 3)

Arrays (nD):
array(1:27, dim = c(3, 3, 3))

Vectorized Operations

In R, most operations are vectorized, meaning they operate on all elements of a vector at once, without needing explicit loops.

This makes code more concise, readable, and less error-prone, while improving efficiency compared to non-vectorized languages.

c(1, 2, 3) + 1
#> [1] 2 3 4

c(1, 2, 3) * c(1, 2, 3)
#> [1] 1 4 9

Primary Data Types

These are part of the atomic data types in R.

logical: TRUE/FALSE.
integer: 1, 2, 3.
double: 1.0, 2.0, 3.0.
character: “Hello, World!”.

typeof(as.Date("2025-01-01"))
#> [1] "double"

Data Classes

character (e.g., “Maria”, “John”)
factor
(e.g., 1 = “Male”, 2 = “Female”)
integer (e.g., 1, 2, 3)
double (e.g., 1.0, 2.0, 3.0)
complex (e.g., 1 + 2i, 3 + 4i)
logical (e.g., TRUE, FALSE)

Date (e.g., 2023-01-01) (Linear time)
POSIXct (e.g., 2023-01-01 00:00:00) (linear time)
Interval (e.g., 2023-01-01 00:00:00–2023-12-15 15:40:00) (linear time)
Duration (e.g., 1 year, 2 months, 3 days) (linear time)
Period (e.g., 1 year, 2 months, 3 days) (linear(ish) time)
hms (e.g., 01:00:00) (Circular time)

Non-Atomic Objects

Atomic types are non-recursive objects, i.e., objects that can’t hold themselfs as an entry (e.g., logical, integer, double, character).

Non-Atomic Objects

Matrices (matrix()) (2D)
Arrays (array()) (nD)
Lists (list()) (nD)
Data Frames (data.frame()) (a special case of list) (2D)

list(
  list(list(1), list(2)),
  list(list("a"), list("b")),
  list(TRUE)
)
#> [[1]]
#> [[1]][[1]]
#> [[1]][[1]][[1]]
#> [1] 1
#> 
#> 
#> [[1]][[2]]
#> [[1]][[2]][[1]]
#> [1] 2
#> 
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> [[2]][[1]][[1]]
#> [1] "a"
#> 
#> 
#> [[2]][[2]]
#> [[2]][[2]][[1]]
#> [1] "b"
#> 
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> [1] TRUE

Operations

Left Hand Side Operator Right Hand Side

Arithmetic Operators

`+`() Addition
`-`() Subtraction
`*`() Multiplication
`/`() Division
`^`() Exponentiation
`%/%`() Integer division
`%%`() Modulusbr(Remainder from Division)

1 + 1
#> [1] 2

1 - 1
#> [1] 0

1 * 2
#> [1] 2

1 / 2
#> [1] 0.5

2^2
#> [1] 4

5 %/% 2
#> [1] 2

5 %% 2
#> [1] 1

Assignment Operators

`<-`() Leftwards assignment
(Use this!)
`<<-`() Leftwards assignment (Global Environment)
`->`() Rightwards assignment
`->>`() Rightwards assignment (Global Environment)
`=`() Equals sign assignment

x <- 1; x
#> [1] 1

x <<- 1; x
#> [1] 1

1 -> x; x
#> [1] 1

1 ->> x; x
#> [1] 1

x = 1; x
#> [1] 1

1 = x; x # Error!

Names

R has strict rules regarding the names of objects.

Names can only contain letters, numbers, dots, and underscores.
Names can’t start with a number or a dot.
Names can’t be reserved words (e.g., if, else, TRUE, FALSE).
Names are case-sensitive.

variable-name Bad

variable.name Good, but not advisable

variable_name Good (snake_case) — Most used in R

1variable_name Bad

.variable_name Bad

variableName Good (camelCase)

VariableName Good (PascalCase)

Comparison Operators

`==`() Equal to
`!=`() Not equal to
`>`() Greater than
`<`() Less than
`>=`() Greater than or equal to
`<=`() Less than or equal to

1 == 1
#> [1] TRUE

1 != 2
#> [1] TRUE

2 > 1
#> [1] TRUE

1 < 2
#> [1] TRUE

1 >= 1
#> [1] TRUE

1 <= 1
#> [1] TRUE

Logical Operators

`&`() AND (Element-wise)
`&&`() AND
`|`() OR (Element-wise)
`||`() OR
`!`() NOT

c(TRUE, FALSE) & c(TRUE, TRUE)
#> [1]  TRUE FALSE

TRUE && FALSE
#> [1] FALSE

c(TRUE, FALSE) && c(TRUE, TRUE) # Error!

c(TRUE, FALSE) | c(TRUE, TRUE)
#> [1] TRUE TRUE

TRUE || FALSE
#> [1] TRUE

c(TRUE, FALSE) || c(TRUE, TRUE) # Error!

!TRUE
#> [1] FALSE

Other Operators

`:`() Sequence operator
`%in%`() Matching operator
`%*%`() Matrix multiplication

1:10
#>  [1]  1  2  3  4  5  6  7  8  9 10

"a" %in% c("a", "b", "c")
#> [1] TRUE

matrix(1:4, nrow = 2) %*% matrix(1:4, nrow = 2)
#>      [,1] [,2]
#> [1,]    7   15
#> [2,]   10   22

Operator Precedence

`^`() Exponent (Right to Left)
-x, +x Unary minus, Unary plus (Left to Right)
`%%`() Modulus (Left to Right)
`*`(), `/`() Multiplication, Division (Left to Right)
`+`(), `-`() Addition, Subtraction (Left to Right)
`==`(), `!=`() Comparisons (Left to Right)

`!`() Logical NOT (Left to Right)
`&`(), `&&`() Logical AND (Left to Right)
`|`(), `||`() Logical OR (Left to Right)
`->`(), `->>`() Rightward assignment (Left to Right)
`<-`(), `<<-`() Leftward assignment (Right to Left)
`=`(): Leftward assignment (Right to Left)

2^2 + 1 * 2
#> [1] 6

!FALSE && TRUE
#> [1] TRUE

Subsetting

Atomic

`[`() 1 level extraction

c("a", "b", "c")[1]
#> [1] "a"

Indexation

Indexation starts at 1 in R!

c("a", "b", "c")[1]
#> [1] "a"

c("a", "b", "c")[2]
#> [1] "b"

c("a", "b", "c")[0]
#> character(0)

Non-Atomic

`[`() 1 level extraction
`[[`() 2 level extraction
`$`() 2 level extraction

data <- list(a = 1, b = 2)

data[1]
#> $a
#> [1] 1

data[[1]]
#> [1] 1

data$a
#> [1] 1

data[["a"]] # == data$a
#> [1] 1

Subsetting

Data Frames

x[i, ] Extract line i
x[, j] Extract column/variable j
x[i, j] Extract line i from column/variable j

data <- data.frame(a = c(1, 2), b = c(4, 5))

data
#>   a b
#> 1 1 4
#> 2 2 5

data["a"]
#>   a
#> 1 1
#> 2 2

data[["a"]] # == data$a
#> [1] 1 2

data[1, ]
#>   a b
#> 1 1 4

data[, 1]
#> [1] 1 2

data[1, 1]
#> [1] 1

Special Values

NA (Not Available)

Missing values must be explicitly declared in R. For that R uses the NA value.

NA comes in different flavors, for example: NA_integer_, NA_real_, NA_character_, NA_complex_.

If you use just NA, R will use the most appropriate type.

c(1, NA, 3)
#> [1]  1 NA  3

is.na(c(1, NA, 3))
#> [1] FALSE  TRUE FALSE

NaN (Not a Number)

0 / 0
#> [1] NaN

is.nan(0 / 0)
#> [1] TRUE

Inf (Infinite)

1 / 0
#> [1] Inf

-1 / 0
#> [1] -Inf

is.infinite(1 / 0)
#> [1] TRUE

Tip: See the naniar package.

Pipes

R Native Pipe
(Introduced in R 4.1.0)

object |> function()
object |> function(1, par = _)

pi |>
  round() |>
  rep("pi", times = _)
#> [1] "pi" "pi" "pi"

magrittr Pipe

object %>% function()
object %>% function(1, par = .)
object %>% function(1, .)

library(magrittr)

pi %>%
  round() %>%
  rep("pi", times = .)
#> [1] "pi" "pi" "pi"

Click here to learn the differences between the base R and magrittr pipes.

Control Flow: Choices

if (condition) true_action
if (condition) true_action else false_action

x <- 60

if (x > 90) {
    "A"
  } else if (x > 80) {
    "B"
  } else if (x > 50 && x < 100) {
    "C"
  } else if (x > 1) {
    "D"
  } else {
    "E"
  }
#> [1] "C"

is.* Functions

x <- "a"

if (is.numeric(x)) {
    "A"
  } else {
    "B"
  }
#> [1] "B"

Control Flow: Loops

🚨 Avoid loops 🚨 (if you can).

Especially when dealing with large datasets. Use functionals instead.

while (condition) perform_action

x <- 0

while (x < 5) {
  x <- x + 1
  print(x)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5

for (item in vector) perform_action

for (i in 1:5) {
  print(i)
}
#> [1] 1
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5

Functions

name <- function(parameter_1, parameter_2, ...) expression

The last evaluated expression is the return value of the function.

You can use return() to return a value before the end of the function, but this is a bad practice.

foo <- function(x) {
  x + 1
}

foo(9)
#> [1] 10

foo <- function(x) {
  return(x + 1)

  "a"
}

foo(9)
#> [1] 10

Shorthand Notation
(Introduced in R 4.1.0)

foo <- function(x) x(9)

foo(\(x) x + 1)
#> [1] 10

Functionals

A functional is a function that takes a function as an input and returns a vector as output (Wickham, 2019).

It makes the work of the programmer easier, because it allows to apply a function to a vector without the need of a loop. Because these functions are usually written in C (a low-level programminng language), they are very fast.

R have native functionals (e.g., lapply()), but prefer solutions from the purrr package.

triple <- function(x) x * 3

with_for_loop <- function(x) {
  out <- numeric()
  for (i in x) out <- c(out, triple(i))
  out
}

library(purrr)

with_map <- function(x) map_dbl(x, triple)

library(microbenchmark)

microbenchmark(
  with_for_loop(1:1000),
  with_map(1:1000),
  times = 100,
  check = "equal"
)
#> Unit: microseconds
#>                   expr     min       lq      mean   median       uq        max
#>  with_for_loop(1:1000) 902.153 1134.307 2893.9184 1270.279 1494.831 123181.670
#>       with_map(1:1000) 360.080  448.497  537.5125  513.731  578.594   1480.732
#>  neval cld
#>    100   a
#>    100   a

Enviroments

Scoping: The act of finding the value associated with a name.

x <- 10

foo <- function() {
  x <- 20

  x
}

foo()
#> [1] 20

foo <- function() {
  x <<- 50

  invisible()
}

foo()
x
#> [1] 50

Packages

The Unix philosophy

Each program must do one thing well (McIlroy et al., 1978).

In R, the fundamental unit of shareable code is the package.

A package bundles together code, data, documentation, and tests, and is easy to share with others (Wickham & Bryan, 2023).

Documentation example:
The mctq R package.

install.packages("package_name")

library("package_name")

library(hms)
library(lubridate)
library(mctq)

so(parse_hm("23:00"), dhours(2))
#> 01:00:00

package_name::function_name()

hms::parse_hm("23:00")
#> 23:00:00

Note: Other programming languages refer to packages as libraries.

Packages

R is just the core. It needs packages to do anything useful.

A tipycal installation of R comes with a set of packages, like:

base: Basic R functions (e.g., sum())
datasets: Some datasets for testing and teaching (e.g., mtcars)
graphics: The basic graphics functions (e.g., plot())
grDevices: The graphics devices (e.g., pdf())
methods: The built-in object-oriented programming system
parallel: Support for parallel computation
stats: Basic statistical functions (e.g., t.test())
utils: Utility functions (e.g., install.packages())

Other Kind of Objects

Based on the atomic types, we can create other types of objects.

Using the right kind of object in R is very important, because these objects also include methods to deal with particular types of data.

For example, time can be expressed in diffent ways, like linear time (e.g., durations, periods, data-time, intervals) and circular time (e.g., time of day). We can also have time series, which are a sequence of data points taken at successive points in time.

Date
(Days since 1970-01-01 (UNIX epoch))

as.Date("1970-01-01") |> as.numeric()
#> [1] 0

as.Date("1970-01-01") + 1
#> [1] "1970-01-02"

POSIXct
(Seconds since 1970-01-01 (UNIX epoch))

as.POSIXct("1970-01-01 00:00:00", tz = "UTC") |>
  as.numeric()
#> [1] 0

as.POSIXct("1970-01-01 00:00:00", tz = "UTC") + 1
#> [1] "1970-01-01 00:00:01 UTC"

Commenting

🚨 Only if necessary 🚨

Your code must speak for itself.

In data analysis code, only use comments to record important things. If you need comments to explain what your code is doing, consider rewriting your code to be clearer (Wickham, n.d.-a).

# This is a comment.

1 + 1 # This is also a comment.

# 1 + 1 # You can also comment code that you don't want to run.

Documentation

One the most important things in programming is to know how to find help. In R, you can use the help() function to get help on a function or package.

Since R has more an academic focus, documentation is usually plentiful and very well written.

help(mean)

?mean

Documentation Websites: The mctq R package documentation.

The Tidyverse

Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy (The Tidy Manifesto), grammar, and data structures.

It was created by Hadley Wickham, a prominent figure in the R community and author of several key references for this course. Hadley Wickham’s contributions have significantly shaped modern R programming.

Main Tidyverse Packages

The Tidyverse also has a meta-package that installs or load the most important packages from the collection.

install.packages("tidyverse")

library("tidyverse")

readr: Read flat files (csv, tsv, fwf) into R.
stringr: A fresh approach to string manipulation in R.
lubridate: Make working with dates in R just that little bit easier.
dplyr: A grammar of data manipulation.
purrr: A functional programming toolkit for R.
forcats: Tools for working with categorical variables (factors).
ggplot2: An implementation of the Grammar of Graphics in R.

The rOpenSci

rOpenSci is a non-profit initiative founded in 2011 that brings together a community of researchers and developers committed to open science. They create and maintain high-quality R packages that provide access to data and tools from diverse sources, ensuring reliability and reproducibility in scientific research.

All packages go through a peer-review process, which ensures that they are well written and reliable.

Example: mctq R package peer-review.

Base R Solutions

You can do most of the things you need with the packages that come bundled with R. However, that is not the most efficient way to do things today.

In this course we are going to focus on the Tidyverse and rOpenSci packages, which are the most modern and efficient way to work with data in R.

If you want a depper understanding of the R language, we encourage you to explore the base R solutions.

Best Practices

You can learn a lot about a person just by looking at their code 👀.

Tidyverse Style Guide

Variable and function names should use only lowercase letters, numbers, and _. Use underscores (_) (so called snake case) to separate words within a name.

# Good
day_one
day_1

# Bad
DayOne
dayone

Tidy Design Principles

The tidyverse has four guiding principles:

It is human centered.
It is consistent.
It is composable.
It is inclusive.

How to Learn More

Module 3
Data Munging

Data Science

Data science is the study of the generalizable extraction of knowledge from data (Dhar, 2023).

For some, data science is just statistics (Broman, 2013) (hype statistics). For others, it’s a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).

First Things First

You only analyze or visualize data when you already have clean, tidy, and validated data.

The processing for getting data ready for analysis is called data munging. You can also see it as data wrangling (usually when dealing with machine learning) or, simply, data cleaning.

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Lohr, 2014).

What Is Data After All?

Ackoff’s DIKW pyramid (Ackoff, 1989)

Data < Information < Knowledge < Wisdom

Data versus the interpretation of the data.

What Is Data After All?

Data is an abstraction. It’s a representation of the world around us. Without context, it has no meaning.

Statistical Value Chain

A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. […] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear (van der Loo & Jonge, 2018).

Raw Data

With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source. (van der Loo & Jonge, 2018).

If the researcher has made any modifications to the raw data, it is not the raw form of the data. (S. E. Ellis & Leek, 2018).

Tidy Data

Untidy to Tidy

Valid Data

Data validation techniques are used to ensure that data is accurate, consistent, and reliable.

Examples of invalid data:

Impossible values (e.g., negative age).
Inconsistent values (e.g., a person with a height of 2 meters and a weight of 20 kg).
Improbable values (e.g., a person 200 years old).
Duplicated values (e.g., the same person with two different ages).

Tip: Check out Mark van der Loo’s validate R package for data validation techniques.

Tabular Data

Daily air quality measurements in New York (May to September 1973).

?airquality

Hierarchical Data

Data can be hierarchical, with multiple levels of organization.

Interest rate representing the adjusted average rate of overnight repurchase agreements backed by Brazil’s federal government securities held in the Special System for Settlement and Custody (SELIC). Reported in % per day.

library(jsonlite)

paste0(
  "https://api.bcb.gov.br/dados/serie/",
  "bcdata.sgs.11", "/",
  "dados", "?",
  "formato=json", "&",
  "dataInicial=01/12/2024", "&",
  "dataFinal=03/12/2024"
) |>
  read_json()
#> [[1]]
#> [[1]]$data
# [1] "02/12/2024"

# [[1]]$valor
# [1] "0.041957"


# [[2]]
# [[2]]$data
# [1] "03/12/2024"

# [[2]]$valor
# [1] "0.041957"

Spatial Data

Not all data is tabular; spatial data can be very large and complex.

Excel cannot handle spatial data, and GUI-based statistical softwares, when capable of handling spatial data, are often limited and struggles with performance issues.

Projected monthly average maximum temperatures (°C) for June 2021–2040, based on the global climate model (GCM) ACCESS-CM2.

Spreadsheet Syndrome

Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data.

Relational Databases

Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns) (Coronel & Morris, 2019).

Data Documentation

Proper documentation is crucial for data sharing, reproducibility, and ensuring that others can understand and use your data effectively.

Here are some examples and guides to help you document your data:

Check S. E. Ellis & Leek (2018) for a detailed guide on data documentation.

Open Data Formats

There are many open data formats available for researchers to use. Open can mean different things, but in this context, it means that the format is not proprietary and can be used by anyone. Here are some examples:

⚠️ Excel files are not a open data format! ⚠️

Tabular Data

Hierarchical Data

Data Frames

Data frames are a special type of list used for storing data tables. They are the most common way of storing data in R.

data.frame(a = 1:3, b = c("A", "B", "C"))

library(dplyr)

tibble(a = 1:3, b = c("A", "B", "C"))

library(mctq)

?std_mctq

View(std_mctq)

The `tibble` Package

tibble is a modern take on data frames in R, offering improved usability and integration with the Tidyverse. It enhances readability while retaining essential data frame features.

install.packages("tibble")

Using R native data.frame class.

mtcars
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Using tibble (Tidyverse way).

library(dplyr) # or library(tibble)

mtcars |> as_tibble()
#> # A tibble: 32 × 11
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

How to Read Data in R

It will depend on what king of data you are working with.

Different from Excel or GUI-based statistical software, R can deal with any kind of data.

Examples of R functions to read data:

Tabular data: utils::read.csv(), readr::read_csv()
Geographical data: terra::vect(), st::st_read()
Proprietary formats: readxl::read_excel(), haven::read_dta() (Stata), haven::read_sav() (SPSS), haven::read_sas() (SAS)

Meet the Penguins

Created by the great Allison Horst, the author of these beautiful illustrations.

install.packages("palmerpenguins")

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

These data were collected from 2007–2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of th US Long Term Ecological Research Network.

We will use this package to get familiar with R.

Meet the Penguins

Inspecting the Raw Data File

🕵 Known your data.

Click here to download the Palmer Penguis raw data file.
Save it in a folder named data in the root of your project.
Using the file manager, open the file in RStudio.

The data documentation can be accessed by running:

library(palmerpenguins)

?penguins_raw

You can also use the following code to open the file in RStudio:

library(rstudioapi)

path_to_file("penguins_raw.csv") |> documentOpen()

Inspecting the Raw Data File

Before importing the data to R, let’s first take a look at the content of the data file in a text editor.

palmerpenguins-raw.csv

studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,NA
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18,195,3250,FEMALE,8.36821,-25.33302,NA
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,NA,NA,NA,NA,NA,NA,NA,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,NA
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,NA
PAL0708,7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.4606,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,NA,NA,NA,No blood sample obtained.
PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42,20.2,190,4250,NA,9.13362,-25.09368,No blood sample obtained for sexing.
PAL0708,11,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A1,Yes,2007-11-09,37.8,17.1,186,3300,NA,8.63243,-25.21315,No blood sample obtained for sexing.
PAL0708,12,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A2,Yes,2007-11-09,37.8,17.3,180,3700,NA,NA,NA,No blood sample obtained.
PAL0708,13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A1,Yes,2007-11-15,41.1,17.6,182,3200,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,14,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A2,Yes,2007-11-15,38.6,21.2,191,3800,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,15,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A1,Yes,2007-11-16,34.6,21.1,198,4400,MALE,8.55583,-25.22588,NA
PAL0708,16,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A2,Yes,2007-11-16,36.6,17.8,185,3700,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,17,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N9A1,Yes,2007-11-12,38.7,19,195,3450,FEMALE,9.18528,-25.06691,NA

The Working Directory

Pointing to files inside your project.

getwd()
#> E:\r-course

list.files()
#> [1] "data" "README.md" "LICENSE.md" "r-course.Rproj"

list.files("data")
#> [1] "palmerpenguins-raw.csv"

list.files("./data")
#> [1] "palmerpenguins-raw.csv"

file <- "./data/palmerpenguins-raw.csv"

🚨 Never use setwd()! 🚨

(How to detect a beginner R user)

Click here to learn why.

The `here` Package

here is a package that helps you use relative paths in your R projects.

It turns file management much easier and OS independent.

install.packages("here")

library(here)

here()
#> E:\r-course

here("data", "palmerpenguins-raw.csv")
#> E:\r-course\data\palmerpenguins-raw.csv

The `readr` Package

readr provides a fast and easy way to read rectangular data from delimited files, such as Comma-Separated Values (CSV) and Tab-Separated Values (TSV).

install.packages("readr")

library(readr)

?read_csv

?read_csv2

?read_delim

Importing the Raw Data

library(dplyr)
library(here)
library(readr)

data <-
  here("data", "palmerpenguins-raw.csv") |>
  read_csv(col_types = cols(.default = "c"))

data |> glimpse()
#> Rows: 344
#> Columns: 17
#> $ studyName             <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL…
#> $ `Sample Number`       <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10…
#> $ Species               <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie P…
#> $ Region                <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"…
#> $ Island                <chr> "Torgersen", "Torgersen", "Torgersen", "Torgerse…
#> $ Stage                 <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "Adu…
#> $ `Individual ID`       <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", …
#> $ `Clutch Completion`   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", …
#> $ `Date Egg`            <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-…
#> $ `Culmen Length (mm)`  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.…
#> $ `Culmen Depth (mm)`   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8"…
#> $ `Flipper Length (mm)` <chr> "181", "186", "195", NA, "193", "190", "181", "1…
#> $ `Body Mass (g)`       <chr> "3750", "3800", "3250", NA, "3450", "3650", "362…
#> $ Sex                   <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"…
#> $ `Delta 15 N (o/oo)`   <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.6649…
#> $ `Delta 13 C (o/oo)`   <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "…
#> $ Comments              <chr> "Not enough blood for isotopes.", NA, NA, "Adult…

Inspecting the Raw Data

data |> View()

The `janitor` Package

janitor provides simple functions for examining and cleaning dirty data.

install.packages("janitor")

`clean_names()`

The clean_names() function transforms data frame column names into machine-readable formats.

This makes it easier to work with the data.

data |> names()
#>  [1] "studyName"           "Sample Number"       "Species"            
#>  [4] "Region"              "Island"              "Stage"              
#>  [7] "Individual ID"       "Clutch Completion"   "Date Egg"           
#> [10] "Culmen Length (mm)"  "Culmen Depth (mm)"   "Flipper Length (mm)"
#> [13] "Body Mass (g)"       "Sex"                 "Delta 15 N (o/oo)"  
#> [16] "Delta 13 C (o/oo)"   "Comments"

library(janitor)

data <- data |> clean_names()

data |> names()
#>  [1] "study_name"        "sample_number"     "species"          
#>  [4] "region"            "island"            "stage"            
#>  [7] "individual_id"     "clutch_completion" "date_egg"         
#> [10] "culmen_length_mm"  "culmen_depth_mm"   "flipper_length_mm"
#> [13] "body_mass_g"       "sex"               "delta_15_n_o_oo"  
#> [16] "delta_13_c_o_oo"   "comments"

Data Masking

Most tidyverse functions use data masking, allowing you to refer to data frame columns directly by their names (e.g., my_variable) instead of using more verbose syntax like data[["my_variable"]].

This is part of a concept called tidy evaluation.

library(dplyr)

some_data <- tibble(a = c(1, 2, 3), b = c(4, 5, 6))

With Data Masking

some_data |> pull(a) # No quotes
#> [1] 1 2 3

Without Data Masking

some_data[["a"]]
#> [1] 1 2 3

The `tidyr` Package

tidyr provides a set of functions that help you to tidy your data.

install.packages("tidyr")

library(dplyr)
library(tidyr)
library(stringr)

data <-
  data |>
  separate_wider_delim(
    col = stage,
    delim = ", ",
    names = c("life_stage", "egg_stage"),
  )

data |> glimpse()
#> Rows: 344
#> Columns: 18
#> $ study_name        <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708…
#> $ sample_number     <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ region            <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "A…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ life_stage        <chr> "Adult", "Adult", "Adult", "Adult", "Adult", "Adult"…
#> $ egg_stage         <chr> "1 Egg Stage", "1 Egg Stage", "1 Egg Stage", "1 Egg …
#> $ individual_id     <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4A…
#> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"…
#> $ date_egg          <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ culmen_length_mm  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…
#> $ delta_15_n_o_oo   <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.66496", …
#> $ delta_13_c_o_oo   <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "-25.…
#> $ comments          <chr> "Not enough blood for isotopes.", NA, NA, "Adult not…

The `dplyr` Package

dplyr is one the most important packages in the Tidyverse. It provides a grammar for data manipulation.

install.packages("dplyr")

mutate(): Create, modify, and delete columns.
transmute(): Creates a new data frame containing only specified computations
select(): Keep or drop columns using their names and types.
slice(): Subset rows using their positions.

filter(): Keep rows that match a condition.
arrange(): Order rows using column values.
rename(): Rename columns.
relocate(): Change column order.
summarize(): Summarise each group down to one row.

`select()`

The select() function is used to select columns from a data frame.

In our case, we are not interested in using alll the variables in the raw data. We will select only the variables we need.

library(dplyr)

data <-
  data |>
  select(
    date_egg, species, island, culmen_length_mm, culmen_depth_mm,
    flipper_length_mm, body_mass_g, sex
  )

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ date_egg          <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ culmen_length_mm  <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ culmen_depth_mm   <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

`rename()`

Let’s rename culmen to bill for clarity. Likewise, we’ll change date_egg to year, extracting the year value in another moment.

library(dplyr)

data <-
  data |>
  rename(
    bill_length_mm = culmen_length_mm,
    bill_depth_mm = culmen_depth_mm,
    year = date_egg
  )

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm     <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

`mutate()`

The mutate() function is used to create new columns or modify existing columns in a data frame.

Let’s assign proper classes to the variables. We also need to transform the year column, extracting just the year value.

library(dplyr)

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1…
#> $ species           <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie Pengu…
#> $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
#> $ bill_length_mm    <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", …
#> $ bill_depth_mm     <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1…
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",…
#> $ body_mass_g       <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", …
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "F…

`mutate()`

For categorical variables, it is helpful to inspect their unique values.

If the vector contains a limited number of unique values, it is likely a good candidate to be converted into a factor for better data handling and analysis.

data |> pull(species) |> unique()
#> [1] "Adelie Penguin (Pygoscelis adeliae)"      
#> [2] "Gentoo penguin (Pygoscelis papua)"        
#> [3] "Chinstrap penguin (Pygoscelis antarctica)"

data |> pull(island) |> unique()
#> [1] "Torgersen" "Biscoe"    "Dream"

data |> pull(sex) |> unique()
#> [1] "MALE"   "FEMALE" NA

`mutate()`

library(dplyr)
library(lubridate)
library(stringr)

data <-
  data |>
  mutate(
    year =
      year |>
      as.Date() |>
      year() |>
      as.integer(),
    species =
      case_match(
        species,
        "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
        "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
        "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo"
      ) |>
      as.factor(),
    island = as.factor(island),
    bill_length_mm = bill_length_mm |> as.numeric(),
    bill_depth_mm = bill_depth_mm |> as.numeric(),
    flipper_length_mm = flipper_length_mm |> as.integer(),
    body_mass_g = body_mass_g |> as.integer(),
    sex =
      sex |>
      str_to_lower() |>
      as.factor()
  )

`mutate()`

library(dplyr)

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…

`filter()`

Let’s filter the data to keep only penguins with a body mass greater than 2000 grams and a bill length less than 50 mm.

library(dplyr)

data <-
  data |>
  filter(
    body_mass_g > 2000 | is.na(body_mass_g),
    bill_length_mm < 60 | is.na(bill_length_mm)
  )

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…

`relocate()`

Let’s organize our columns in a more logical order.

The year column is best placed next to the sex column.

library(dplyr)

data <-
  data |>
  relocate(year, .after = sex)

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Inspecting the Valid Data

data |> View()

Comparing the Valid Data

library(palmerpenguins)

identical(data, penguins)
#> [1] TRUE

data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Exporting the Data

Always save your data in non-binary formats like csv to ensure that it can be read by other software.

rds stands for R Data Serialization. It is a binary data format that allows you to load an object into R preserving all its attributes.

library(here)
library(readr)

if (!dir.exists(here("data"))) dir.create(here("data"))

data |> write_csv(here("data", "palmerpenguins-valid.csv"))

data |> write_rds(here("data", "palmerpenguins-valid.rds"))

Tip: Store your data in research repositories like The Open Science Framework (See the osfr package). If you are working with sensitive or human data, ensure it is encrypted before storing it in the cloud (See the lockr package).

How to Learn More

Module 4
Exploratory Data Analysis

Simple Statistics

The R packages that come with a typical R installation provide a set of basic statistical functions.

library(dplyr)
library(palmerpenguins)

bill_length_mm <- penguins |> pull(bill_length_mm)

bill_length_mm |> mean(na.rm = TRUE)
#> [1] 43.92193

bill_length_mm |> sd(na.rm = TRUE)
#> [1] 5.459584

bill_length_mm |> min(na.rm = TRUE)
#> [1] 32.1

bill_length_mm |> median(na.rm = TRUE)
#> [1] 44.45

bill_length_mm |> max(na.rm = TRUE)
#> [1] 59.6

Simple Statistics

library(dplyr)
library(palmerpenguins)

bill_length_mm <- penguins |> pull(bill_length_mm)
bill_depth_mm <- penguins |> pull(bill_depth_mm)

bill_length_mm |> quantile(na.rm = TRUE)
#>     0%    25%    50%    75%   100% 
#> 32.100 39.225 44.450 48.500 59.600

bill_length_mm |> quantile(prob = 0.25, na.rm = TRUE)
#>    25% 
#> 39.225

bill_length_mm |> quantile(prob = 0.75, na.rm = TRUE)
#>  75% 
#> 48.5

bill_length_mm |> sum(na.rm = TRUE)
#> [1] 15021.3

bill_length_mm |> prod(na.rm = TRUE)
#> [1] Inf

cor(bill_length_mm, bill_depth_mm, use = "complete.obs")
#> [1] -0.2350529

`head()` & `tail()`

library(dplyr)
library(palmerpenguins)

penguins |> head(3)
#> # A tibble: 3 × 8
#>   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#> 1 Adelie  Torgersen           39.1          18.7               181        3750
#> 2 Adelie  Torgersen           39.5          17.4               186        3800
#> 3 Adelie  Torgersen           40.3          18                 195        3250
#> # ℹ 2 more variables: sex <fct>, year <int>

penguins |> tail(3)
#> # A tibble: 3 × 8
#>   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
#> 1 Chinstrap Dream            49.6          18.2               193        3775
#> 2 Chinstrap Dream            50.8          19                 210        4100
#> 3 Chinstrap Dream            50.2          18.7               198        3775
#> # ℹ 2 more variables: sex <fct>, year <int>

penguins |>
  pull(bill_length_mm) |>
  head(3)
#> [1] 39.1 39.5 40.3

`slice_sample()`

library(dplyr)
library(palmerpenguins)

penguins |> slice_sample(n = 15)
#> # A tibble: 15 × 8
#>    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
#>  1 Gentoo    Biscoe             48.4          14.4               203        4625
#>  2 Adelie    Dream              43.2          18.5               192        4100
#>  3 Gentoo    Biscoe             42            13.5               210        4150
#>  4 Adelie    Biscoe             41.1          18.2               192        4050
#>  5 Adelie    Biscoe             38.6          17.2               199        3750
#>  6 Chinstrap Dream              55.8          19.8               207        4000
#>  7 Adelie    Dream              39.8          19.1               184        4650
#>  8 Adelie    Dream              40.3          18.5               196        4350
#>  9 Gentoo    Biscoe             45.5          13.9               210        4200
#> 10 Adelie    Torgers…           36.6          17.8               185        3700
#> 11 Gentoo    Biscoe             44.5          15.7               217        4875
#> 12 Chinstrap Dream              45.2          16.6               191        3250
#> 13 Chinstrap Dream              49            19.6               212        4300
#> 14 Adelie    Biscoe             36.4          17.1               184        2850
#> 15 Gentoo    Biscoe             46.2          14.5               209        4800
#> # ℹ 2 more variables: sex <fct>, year <int>

`glimpse()`

library(dplyr)
library(palmerpenguins)

penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex               <fct> male, female, female, NA, female, male, female, male…
#> $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

`count()`

library(dplyr)
library(palmerpenguins)

penguins |> count(species, island, .drop = FALSE)
#> # A tibble: 9 × 3
#>   species   island        n
#>   <fct>     <fct>     <int>
#> 1 Adelie    Biscoe       44
#> 2 Adelie    Dream        56
#> 3 Adelie    Torgersen    52
#> 4 Chinstrap Biscoe        0
#> 5 Chinstrap Dream        68
#> 6 Chinstrap Torgersen     0
#> 7 Gentoo    Biscoe      124
#> 8 Gentoo    Dream         0
#> 9 Gentoo    Torgersen     0

`summarize()`

library(dplyr)
library(palmerpenguins)

penguins |>
  summarize(
    mean = mean(flipper_length_mm, na.rm = TRUE),
    sd = sd(flipper_length_mm, na.rm = TRUE),
    .by = species
  )
#> # A tibble: 3 × 3
#>   species    mean    sd
#>   <fct>     <dbl> <dbl>
#> 1 Adelie     190.  6.54
#> 2 Gentoo     217.  6.48
#> 3 Chinstrap  196.  7.13

The `summarytools` Package

A simple package for descriptive statistics.

library(palmerpenguins)
library(summarytools)

penguins |> descr(bill_depth_mm)
#> Descriptive Statistics  
#> penguins  
#> N: 344  
#> 
#>                     bill_depth_mm
#> ----------------- ---------------
#>              Mean           17.15
#>           Std.Dev            1.97
#>               Min           13.10
#>                Q1           15.60
#>            Median           17.30
#>                Q3           18.70
#>               Max           21.50
#>               MAD            2.22
#>               IQR            3.10
#>                CV            0.12
#>          Skewness           -0.14
#>       SE.Skewness            0.13
#>          Kurtosis           -0.92
#>           N.Valid          342.00
#>                 N          344.00
#>         Pct.Valid           99.42

The `janitor` Package

The tabyl system from the janitor package provides more control over the output compared to the summarytools package.

Code

library(janitor)
library(knitr)
library(palmerpenguins)

penguins |>
  tabyl(island, species) |>
  adorn_totals(c("row", "col")) |>
  adorn_percentages("row") |>
  adorn_pct_formatting(rounding = "half up", digits = 0) |>
  adorn_ns() |>
  kable()

island	Adelie	Chinstrap	Gentoo	Total
Biscoe	26% (44)	0% (0)	74% (124)	100% (168)
Dream	45% (56)	55% (68)	0% (0)	100% (124)
Torgersen	100% (52)	0% (0)	0% (0)	100% (52)
Total	44% (152)	20% (68)	36% (124)	100% (344)

The `gt` Package

The most powerful package for creating and customizing tables in R.

It provides a flexible grammar for table creation and styling.

The Grammar of Graphics

The Grammar of Graphics, by Leland Wilkinson, set the foundation thinking about data visualization.

We will based this part of the course on the principles found in this book.

The `ggplot2` Package

The most powerful and flexible package for data visualization.

A Tidyverse package based on the principles of The Grammar of Graphics.

It’s All About Layers

Code

library(ggplot2)
library(palmerpenguins)

penguins |> ggplot()

Code

library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(body_mass_g, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  theme(text = element_text(size = 20), legend.position = "none")

It’s All About Layers

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(body_mass_g, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 4) +
  scale_color_brand_d(alpha = 0.7) +
  theme(text = element_text(size = 20), legend.position = "none")

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(body_mass_g, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 4) +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    linewidth = 2
  ) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_brand_d(alpha = 0.7) +
  theme(text = element_text(size = 20), legend.position = "none")

Bar Plots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)

penguins |>
  ggplot(aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip() +
  labs(
    x = "Island",
    y = "Count"
  ) +
  scale_fill_brand_d() +
  theme(legend.position = "none")

Histograms

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
  labs(x = "Flipper Length (mm)", y = "Frequency", fill = "Species") +
  scale_fill_brand_d()

Boxplots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(bill_length_mm, species) |>
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_boxplot(outlier.color = get_brand_color("dark-red")) +
  geom_jitter(width = 0.2, alpha = 0.1) +
  scale_fill_brand_d(alpha = 0.7) +
  labs(x = "Species", y = "Bill Length (mm)", fill = "Species")

Scatterplots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(body_mass_g, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_brand_d(alpha = 0.7)

Facets

Code

library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(stringr)
library(tidyr)

penguins |>
  drop_na(body_mass_g, flipper_length_mm, sex) |>
  mutate(sex = `levels<-`(sex, str_to_title(levels(sex)))) |>
  ggplot(aes(x = body_mass_g, y = flipper_length_mm, color = sex)) +
  geom_point(size = 2) +
  geom_smooth(
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    color = get_brand_color("brown")
  ) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Sex"
  ) +
  scale_color_brand_d(alpha = 0.7) +
  facet_wrap(~species)

Combined Plots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(patchwork)
library(tidyr)

plot_hist <-
  penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
  labs(x = "Flipper Length (mm)", y = "Frequency") +
  scale_fill_brand_d() +
  theme(legend.position = "none")

plot_boxplot <-
  penguins |>
  drop_na(bill_length_mm, species) |>
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_boxplot(outlier.color = get_brand_color("dark-red")) +
  geom_jitter(width = 0.2, alpha = 0.1) +
  labs(x = "Species", y = "Bill Length (mm)", fill = "Species") +
  scale_fill_brand_d(alpha = 0.7) +
  theme(axis.title.x = element_blank(), legend.position = "none")

plot_scatter <-
  penguins |>
  drop_na(body_mass_g, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = body_mass_g,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  labs(
    x = "Body Mass (g)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_brand_d(alpha = 0.7) +
  theme(legend.position = "none")

(plot_hist + plot_boxplot) / plot_scatter + plot_annotation(tag_levels = "A")

Correlation Matrices

Code

library(brandr)
library(dplyr)
library(GGally)
library(ggplot2)
library(palmerpenguins)

penguins |>
  select(species, body_mass_g, ends_with("_mm")) |>
  ggpairs(aes(color = species)) +
  scale_color_brand_d(alpha = 0.7) +
  scale_fill_brand_d(alpha = 0.7)

Color Palettes

Color palettes can be sequential, diverging, or qualitative (discrete). Here are some examples of two popular packages for color palettes in R.

viridis: Colorblind-Friendly Color Maps

Code

library(ggplot2)
library(palmerpenguins)
library(tidyr)
library(viridis)

penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_density(alpha = 0.5, position = "identity") +
  scale_fill_viridis(discrete = TRUE) +
  labs(x = "Flipper Length (mm)", y = "Density", fill = "Species") +
  theme(text = element_text(size = 20), legend.position = "none")

RColorBrewer: Common Color Palettes

Code

library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_density(alpha = 0.5, position = "identity") +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Flipper Length (mm)", y = "Density", fill = "Species") +
  theme(text = element_text(size = 20), legend.position = "none")

How to Choose the Right Vis.

Check out the amazing work by Yan Holtz. Visit From Data to Viz too see the diagram below and many others.

The R Graph Gallery

Another amazing work by Yan Holtz is the R Graph Gallery, which provides numerous examples of R graphics and how to create them.

How to Learn More

Module 5
Modeling

The Tidymodels Framework

The Tidymodels framework is a collection of packages for modeling and machine learning using Tidyverse principles.

It is created by the same team that developed the tidyverse and is designed to work seamlessly with it.

Although it is a relative newcomer to the R ecosystem (2018), it has quickly gained popularity due to its simplicity and consistency.

Tidymodels Packages

Like Tidyverse, Tidymodels also has a meta-package that installs or load the most important packages from the collection.

install.packages("tidymodels")

library("tidymodels")

rsample: Provides infrastructure for efficient data splitting and resampling.
parsnip: A tidy, unified interface to models.
recipes: A tidy interface to data pre-processing tools for feature engineering.
workflows: A package to bundle the pre-processing, modeling, and post-processing together.
infer: A statistical grammar for inferential statistics.

Power Analysis

Decision about $\text{H}_{0}$	$\text{H}_{0}$ True	$\text{H}_{0}$ False
Accept	Correct inference (True negative) ($1 - \alpha$)	Type II error (False negative) ($\beta$)
Reject	Type I error (False positive) ($\alpha$)	Correct inference (True positive) ($1 - \beta$)

Power Analysis

Code

library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.8,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "greater"
)
#>  Difference between Two means 
#>  (Independent Samples t Test) 
#>  H0: mu1 = mu2 
#>  HA: mu1 > mu2 
#>  ------------------------------ 
#>   Statistical power = 0.8 
#>   n1 = 310 
#>   n2 = 310 
#>  ------------------------------ 
#>  Alternative = "greater" 
#>  Degrees of freedom = 618 
#>  Non-centrality parameter = 2.49 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "greater",
  plot = TRUE,
  verbose = FALSE
)

Power Analysis

Code

library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.3,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "greater",
)
#>  Difference between Two means 
#>  (Independent Samples t Test) 
#>  H0: mu1 = mu2 
#>  HA: mu1 > mu2 
#>  ------------------------------ 
#>   Statistical power = 0.3 
#>   n1 = 64 
#>   n2 = 64 
#>  ------------------------------ 
#>  Alternative = "greater" 
#>  Degrees of freedom = 126 
#>  Non-centrality parameter = 1.131 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.7

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "greater",
  plot = TRUE,
  verbose = FALSE
)

Power Analysis

Code

library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = 0.2, # Cohen's d for small effect sizes
  mu2 = 0,
  power = 0.999,
  alpha = 0.001,
  welch.df = TRUE,
  alternative = "greater",
)
#>  Difference between Two means 
#>  (Independent Samples t Test) 
#>  H0: mu1 = mu2 
#>  HA: mu1 > mu2 
#>  ------------------------------ 
#>   Statistical power = 0.999 
#>   n1 = 1913 
#>   n2 = 1913 
#>  ------------------------------ 
#>  Alternative = "greater" 
#>  Degrees of freedom = 3824 
#>  Non-centrality parameter = 6.185 
#>  Type I error rate = 0.001 
#>  Type II error rate = 0.001

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = pwr_analysis$parms$alpha,
  alternative = "greater",
  plot = TRUE,
  verbose = FALSE
)

Type I Errors ($\alpha$)

Type II Errors ($\beta$)

The p-value Problem

Large samples and sensitivity

Is a difference of $0.00001$ valid?

Statistical ritual versus Statistical thinking

Comparison of a 95% of confidence level ($\alpha = 0.05$) and an n-dependent p-value curve. The parameter $n_{\alpha}$ represents the minimum sample size to detect statistically significant differences among compared groups. The parameter $n_{\gamma}$ represents the convergence point of the p-value curve. When the p-value curve expresses practical differences, the area under the red curve ($A_{p(n)}$) is smaller than the area under the constant function $\alpha = 0.05$ ($A_{\alpha = 0.05}$) when it is evaluated between $0$ and $n_{\gamma}$.

Cohen’s Benchmark

[…] in many circumstances, all that is intended by “proving” the null hypothesis is that the ES [Effect Size] is not necessarily zero but small enough to be negligible
(Cohen, 1988, p. 461).

Cohen’s Benchmark

Test	Relevant Effect Size	Effect Size Classes
Test	Relevant Effect Size	Small	Medium	Large
Comparison of independent means	$d$, $\Delta$, Hedges’ $g$	0.20	0.50	0.80
Comparison of two correlations	$q$	0.10	0.30	0.50
Difference between proportions	Cohen’s $g$	0.05	0.15	0.25
Correlation	$r$	0.10	0.30	0.50
	$r^{2}$	0.01	0.09	0.25
Crosstabulation	$w$, $\varphi$, $V$, $C$	0.10	0.30	0.50
ANOVA	$f$	0.10	0.25	0.40
	$\eta^{2}$	0.01	0.06	0.14
Multiple regression	$R^{2}$	0.02	0.13	0.26
	$f^{2}$	0.02	0.15	0.35
Notes: The rationale for most of these benchmarks can be found in Cohen (1988) at the following pages: Cohen’s $d$ (p. 40), $q$ (p. 115), Cohen’s $g$ (pp. 147–149), $r$ and $r^{2}$ (pp. 79–80), Cohen’s $w$ (pp. 224–227), $f$ and $\eta^{2}$ (pp. 285–287), $R^{2}$ and $f^{2}$ (pp. 413–414).

The p-value Problem: Example

$\Delta\text{R}^{2}$ = 0.00388
Cohen’s $f^{2}$ = 0.00414

Latitudinal cline of chronotype (Leocadio-Miguel et al., 2017).

Critique of Leocadio-Miguel et al. latitude article (Vartanian, 2025).

The ASA statement on p-values (Wasserstein & Lazar, 2016).

Authors who rely solely on the p-value demonstrate a preference for statistical rituals over statistical reasoning (Gigerenzer, 2004).

Note: The HO score (Horne & Östberg, 1976) goes from 16 to 86, with higher scores indicating a preference for morningness.

What Test Should I Use?

Check Antoine Soetewey’s flowchart to help you decide the appropriate statistical test for your data.

`infer` Pipeline Examples

infer is a Tidymodels package that provides a statistical grammar for inferential statistics.

It also offers pipeline examples for various hypothesis tests. These pipelines can serve as a helpful starting point for implementing other types of models.

⚠️ Warning ⚠️

The examples provided here are for educational purposes only.

In these examples, we are not verifying the assumptions underlying the statistical tests. For simplicity, we assume that the data satisfies all necessary assumptions, and the resulting models are valid.

Please note that the results of the statistical tests might not be valid due to these simplifications.

In practice, always ensure that the assumptions of the statistical tests you use are thoroughly checked and validated.

t-Test: Hypothesis

Is there a meaningful difference in flipper length between Adelie and Gentoo penguins?

To test this, we will perform a t-Test for Independent Samples using a randomization-based empirical null distribution approach.

To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval.

Cohen’s benchmark for a small effect size will be used as our Minimum Effect Size (MES) (Cohen, 1988).

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{a}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{a}: d \geq \text{MES} \\ \end{cases} \]

t-Test: Boxplots

Code

library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species %in% c("Adelie", "Gentoo")) |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
  geom_boxplot(outlier.color = get_brand_color("dark-red")) +
  geom_jitter(width = 0.2, alpha = 0.1) +
  scale_fill_brand_d(alpha = 0.7) +
  labs(x = "Species", y = "Flipper Length (mm)", fill = "Species")

t-Test: Density Plots

Code

library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species %in% c("Adelie", "Gentoo")) |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_density(alpha = 0.7) +
  scale_fill_brand_d() +
  labs(x = "Flipper Length (mm)", y = "Density", fill = "Species")

t-Test: Power Analysis

library(dplyr)
library(palmerpenguins)
library(tidyr)

penguins |>
  filter(species %in% c("Adelie")) |>
  drop_na(flipper_length_mm, species) |>
  nrow()
#> [1] 151

penguins |>
  filter(species %in% c("Gentoo")) |>
  drop_na(flipper_length_mm, species) |>
  nrow()
#> [1] 123

t-Test: Power Analysis

library(pwr)

pwr_analysis <- pwr.t.test(
  d = 0.2, # Cohen's d for small effect sizes
  sig.level = 0.05,
  power = 0.8,
  type = "two.sample",
  alternative = "two.sided"
)

pwr_analysis
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 393.4057
#>               d = 0.2
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

t-Test: Power Analysis

library(pwr)

pwr_analysis <- pwr.t.test(
  d = 0.5, # Cohen's d for medium effect sizes
  sig.level = 0.05,
  power = 0.8,
  type = "two.sample",
  alternative = "two.sided"
)

pwr_analysis
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 63.76561
#>               d = 0.5
#>       sig.level = 0.05
#>           power = 0.8
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

t-Test: Power Analysis

Code

library(pwr)

pwr_analysis |>
  plot.power.htest(
    xlab = "Sample Size Per Group",
    ylab = "Power (1 - Beta)",
    main = NULL
  )

t-Test: Observed Statistic

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)

observed_statistic <-
  penguins |>
  filter(species %in% c("Gentoo", "Adelie")) |>
  drop_na(flipper_length_mm, species) |>
  specify(flipper_length_mm ~ species) |>
  hypothesize(null = "independence") |>
  calculate(stat = "t", order = c("Gentoo", "Adelie"))

observed_statistic
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1 × 1
#>    stat
#>   <dbl>
#> 1  34.4

t-Test: Null Distribution

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)

null_dist <-
  penguins |>
  filter(species %in% c("Gentoo", "Adelie")) |>
  drop_na(flipper_length_mm, species) |>
  specify(flipper_length_mm ~ species) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "t", order = c("Gentoo", "Adelie"))

null_dist
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1,000 × 2
#>    replicate    stat
#>        <int>   <dbl>
#>  1         1  1.72  
#>  2         2  0.121 
#>  3         3 -1.98  
#>  4         4  0.267 
#>  5         5  0.161 
#>  6         6  0.0241
#>  7         7  0.532 
#>  8         8 -0.954 
#>  9         9 -0.751 
#> 10        10 -0.395 
#> # ℹ 990 more rows

t-Test: Visualizing

Code

library(infer)
library(ggplot2)

null_dist |>
  visualize() +
  shade_p_value(obs_stat = observed_statistic, direction = "two-sided") +
  labs(
    title = NULL,
    x = "t-statistic",
    y = "Frequency"
  ) +
  theme(text = element_text(size = 14))

t-Test: p-value

null_dist |>
  get_p_value(obs_stat = observed_statistic, direction = "two-sided")
#> # A tibble: 1 × 1
#>   p_value
#>     <dbl>
#> 1       0

t-Test: Effect Size

library(dplyr)
library(effectsize)
library(palmerpenguins)
library(tidyr)

adelie <-
  penguins |>
  filter(species == "Adelie") |>
  drop_na() |>
  pull(flipper_length_mm)

gentoo <-
  penguins |>
  filter(species == "Gentoo") |>
  drop_na() |>
  pull(flipper_length_mm)

effect_size <-
  cohens_d(
    x = gentoo,
    y = adelie,
    mu = 0,
    ci = 0.95,
    alternative = "two.sided"
  )

effect_size
#> Cohen's d |       95% CI
#> ------------------------
#> 4.14      | [3.71, 4.57]
#> 
#> - Estimated using pooled SD.

effect_size |> interpret_hedges_g(rules = "cohen1988")
#> Cohen's d |       95% CI | Interpretation
#> -----------------------------------------
#> 4.14      | [3.71, 4.57] |          large
#> 
#> - Estimated using pooled SD.
#> - Interpretation rule: cohen1988

t-Test: Power Analysis

library(pwrss)

pwr_analysis <- pwrss.t.2means(
  mu1 = gentoo |> mean(na.rm = TRUE),
  mu2 = adelie |> mean(na.rm = TRUE),
  sd1 = gentoo |> sd(na.rm = TRUE),
  sd2 = adelie |> sd(na.rm = TRUE),
  paired = FALSE,
  n2 = length(adelie),
  kappa = length(gentoo) / length(adelie),
  power = NULL,
  alpha = 0.05,
  welch.df = TRUE,
  alternative = "not equal"
)
#>  Difference between Two means 
#>  (Independent Samples t Test) 
#>  H0: mu1 = mu2 
#>  HA: mu1 != mu2 
#>  ------------------------------ 
#>   Statistical power = 1 
#>   n1 = 119 
#>   n2 = 146 
#>  ------------------------------ 
#>  Alternative = "not equal" 
#>  Degrees of freedom = 251.35 
#>  Non-centrality parameter = 33.506 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0

t-Test: Power Analysis

Code

library(pwrss)

power.t.test(
  ncp = pwr_analysis$ncp,
  df = pwr_analysis$df,
  alpha = 0.05,
  alternative = "not equal",
  plot = TRUE,
  verbose = FALSE
)

t-Test: Conclusion

Is there a meaningful difference in flipper length between Adelie and Gentoo penguins?

Our analysis found a statistically significant difference in means ($t$ = 34.4, $p$-value < 0.001). The observed effect size was very large and exceed the Minimal Effect Size (MES) threshold ($d$ = 4.13, 95% CI [3.70, 4.56]).

The power (1 - $\beta$) of the test is 1, indicating that the probability of a false negative is zero.

Based on these results, we conclude that there is a meaningful difference in flipper length between Adelie and Gentoo penguins, with Gentoo penguins having significantly longer flippers. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{a}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]

\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{a}: d \geq \text{MES} \\ \end{cases} \]

ANOVA: Hypothesis

Is there a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins?

To test this, we will perform a One-Way ANOVA using a randomization-based empirical null distribution approach.

To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval.

Cohen’s benchmark for a small effect size will be used as our Minimum Effect Size (MES) (Cohen, 1988).

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} = \mu_{C} \\ \text{H}_{a}: \mu_{i} \neq \mu_{j}, \ \text{for some} \ i, j \end{cases} \]

\[ \begin{cases} \text{H}_{0}: f < \text{MES} \\ \text{H}_{a}: f \geq \text{MES} \\ \end{cases} \]

ANOVA: Boxplots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
  geom_boxplot(outlier.color = get_brand_color("dark-red")) +
  geom_jitter(width = 0.2, alpha = 0.1) +
  scale_fill_brand_d(alpha = 0.7) +
  labs(x = "Species", y = "Flipper Length (mm)", fill = "Species")

ANOVA: Density Plots

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, species) |>
  ggplot(aes(x = flipper_length_mm, fill = species)) +
  geom_density(alpha = 0.7) +
  scale_fill_brand_d() +
  labs(x = "Flipper Length (mm)", y = "Density", fill = "Species")

ANOVA: Power Analysis

library(dplyr)
library(palmerpenguins)
library(tidyr)

adelie <-
  penguins |>
  filter(species == "Adelie") |>
  drop_na() |>
  pull(flipper_length_mm)

adelie |> length()
#> [1] 146

chinstrap <-
  penguins |>
  filter(species == "Chinstrap") |>
  drop_na() |>
  pull(flipper_length_mm)

chinstrap |> length()
#> [1] 68

gentoo <-
  penguins |>
  filter(species == "Gentoo") |>
  drop_na() |>
  pull(flipper_length_mm)

gentoo |> length()
#> [1] 119

ANOVA: Power Analysis

library(pwr)

pwr_analysis <- pwr.anova.test(
  k = 3,
  f = 0.25, # Cohen's f for medium effect sizes
  sig.level = 0.05,
  power = 0.8
)

pwr_analysis
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 3
#>               n = 52.3966
#>               f = 0.25
#>       sig.level = 0.05
#>           power = 0.8
#> 
#> NOTE: n is number in each group

ANOVA: Power Analysis

Code

library(pwr)

pwr_analysis |>
  plot.power.htest(
    xlab = "Sample size per group",
    ylab = "Power (1 - Beta)",
    main = NULL
  )

ANOVA: Observed Statistic

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)

observed_statistic <-
  penguins |>
  drop_na(flipper_length_mm, species) |>
  specify(flipper_length_mm ~ species) |>
  hypothesize(null = "independence") |>
  calculate(stat = "F")

observed_statistic
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1 × 1
#>    stat
#>   <dbl>
#> 1  595.

ANOVA: Null Distribution

library(dplyr)
library(infer)
library(palmerpenguins)
library(tidyr)

null_dist <-
  penguins |>
  drop_na(flipper_length_mm, species) |>
  specify(flipper_length_mm ~ species) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "F")

null_dist
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1,000 × 2
#>    replicate   stat
#>        <int>  <dbl>
#>  1         1 0.367 
#>  2         2 0.113 
#>  3         3 1.18  
#>  4         4 1.60  
#>  5         5 0.182 
#>  6         6 2.62  
#>  7         7 1.75  
#>  8         8 1.00  
#>  9         9 0.0295
#> 10        10 1.48  
#> # ℹ 990 more rows

ANOVA: Visualizing

Code

library(infer)
library(ggplot2)

null_dist |>
  visualize(method = "both") +
  shade_p_value(
    obs_stat = observed_statistic,
    direction = "greater"
  ) +
  labs(
    title = NULL,
    x = "F-statistic",
    y = "Frequency"
  ) +
  theme(text = element_text(size = 14))

ANOVA: p-value

null_dist |>
  get_p_value(
    obs_stat = observed_statistic,
    direction = "two-sided"
  )
#> # A tibble: 1 × 1
#>   p_value
#>     <dbl>
#> 1       0

ANOVA: Tukey HSD

Tukey Honest Significant Differences (HSD) test.

library(broom)

aov(flipper_length_mm ~ species, data = penguins) |>
  TukeyHSD(conf.level = 0.95) |>
  tidy()

ANOVA: Effect Size

library(effectsize)

effect_size <-
  aov(flipper_length_mm ~ species, data = penguins) |>
  cohens_f()

effect_size
#> # Effect Size for ANOVA
#> 
#> Parameter | Cohen's f |      95% CI
#> -----------------------------------
#> species   |      1.87 | [1.72, Inf]
#> 
#> - One-sided CIs: upper bound fixed at [Inf].

effect_size[[2]] |>
  f_to_eta2() |>
  interpret_eta_squared(rules = "cohen1992")
#> [1] "large"
#> (Rules: cohen1992)

ANOVA: Power Analysis

library(pwr)

pwr.anova.test(
  k = 3,
  n = min(length(adelie), length(chinstrap), length(gentoo)),
  f = effect_size[[2]],
  sig.level = 0.05
)
#> 
#>      Balanced one-way analysis of variance power calculation 
#> 
#>               k = 3
#>               n = 68
#>               f = 1.873274
#>       sig.level = 0.05
#>           power = 1
#> 
#> NOTE: n is number in each group

ANOVA: Conclusion

Is there a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins?

Our analysis found a statistically significant difference in means ($\text{F}$ = 595, $p$-value < 0.001). The observed effect size was very large and exceed the Minimal Effect Size (MES) threshold ($f$ = 1.873, 95% CI [1.72, Inf]).

The power (1 - $\beta$) of the test is 1, indicating that the probability of a false negative is zero.

Based on these results, we conclude that there is a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins, with Gentoo penguins having significantly longer flippers, followed by Chinstrap penguins. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.

\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} = \mu_{C} \\ \text{H}_{a}: \mu_{i} \neq \mu_{j}, \ \text{for some} \ i, j \end{cases} \]

\[ \begin{cases} \text{H}_{0}: f < \text{MES} \\ \text{H}_{a}: f \geq \text{MES} \\ \end{cases} \]

Linear Reg.: Hypothesis

Adding bill_depth_mm as a predictor improve the prediction of flipper_length_mm?

To test this, we will use Nested Linear Models.

To ensure practical significance, we will analyze the delta of the adjusted $\text{R}^2$ of the models considering a 95% confidence interval.

We will use Cohen’s benchmark for a small effect size as our Minimum Effect Size (MES) (Cohen, 1988).

\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} = 0 \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} > 0 \end{cases} \]

\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} < \text{MES} \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} \geq \text{MES} \end{cases} \]

Linear Reg.: Power Analysis

library(pwrss)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm) |>
  nrow()
#> [1] 342

pwrss.f.reg(
  r2 = 0.02, # Cohen's f^2 for small effect sizes
  k = 2,
  power = 0.8,
  alpha = 0.05
)
#>  Linear Regression (F test) 
#>  R-squared Deviation from 0 (zero) 
#>  H0: r2 = 0 
#>  HA: r2 > 0 
#>  ------------------------------ 
#>   Statistical power = 0.8 
#>   n = 476 
#>  ------------------------------ 
#>  Numerator degrees of freedom = 2 
#>  Denominator degrees of freedom = 472.108 
#>  Non-centrality parameter = 9.696 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2

Linear Reg.: Power Analysis

library(pwrss)
library(tidyr)

penguins |>
  drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm) |>
  nrow()
#> [1] 342

pwrss.f.reg(
  r2 = 0.15, # Cohen's f^2 for medium effect sizes
  k = 2,
  power = 0.8,
  alpha = 0.05
)
#>  Linear Regression (F test) 
#>  R-squared Deviation from 0 (zero) 
#>  H0: r2 = 0 
#>  HA: r2 > 0 
#>  ------------------------------ 
#>   Statistical power = 0.8 
#>   n = 58 
#>  ------------------------------ 
#>  Numerator degrees of freedom = 2 
#>  Denominator degrees of freedom = 54.7 
#>  Non-centrality parameter = 10.182 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0.2

Linear Reg.: Scatterplot 1

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)

penguins |>
  drop_na(bill_length_mm, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = bill_length_mm,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_smooth(
    inherit.aes = FALSE,
    mapping = aes(x = bill_length_mm, y = flipper_length_mm),
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    color = get_brand_color("black")
  ) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  labs(
    x = "Bill Length (mm)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_brand_d(alpha = 0.7)

Linear Reg.: Scatterplot 2

Code

library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)

penguins |>
  drop_na(bill_depth_mm, flipper_length_mm, species) |>
  ggplot(
    aes(
      x = bill_depth_mm,
      y = flipper_length_mm,
      color = species,
      shape = species
    )
  ) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
  geom_smooth(
    inherit.aes = FALSE,
    mapping = aes(x = bill_depth_mm, y = flipper_length_mm),
    method = "lm",
    formula = y ~ x,
    se = FALSE,
    color = get_brand_color("black")
  ) +
  labs(
    x = "Bill Depth (mm)",
    y = "Flipper Length (mm)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_brand_d(alpha = 0.7)

Linear Reg.: Prediction

Code

library(brandr)
library(broom)
library(ggplot2)
library(palmerpenguins)
# library(rutils) # github.com/danielvartan/rutils
library(tidyr)

data <- penguins |> drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm)

fit <-
  formula(flipper_length_mm ~ bill_length_mm + bill_depth_mm) |>
  lm(data)

fit |>
  augment(data) |>
  ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) +
  geom_point() +
  geom_line(
    aes(y = .fitted, color = "Prediction"),
    linewidth = 0.5,
    alpha = 0.5,
  ) +
  geom_function(
    aes(color = "Adjusted prediction"),
    fun = rutils:::lm_fun(fit, fix_all_but = 1, data = data),
    linewidth = 1
  ) +
  labs(
    x = "Bill depth (mm)",
    y = "Flipper length (mm)",
    subtitle = rutils:::lm_str_fun(fit, fix_all_but = 1),
    color = NULL
  ) +
  ggplot2::scale_color_manual(
    values = c(
      "Prediction" = get_brand_color("green"),
      "Adjusted prediction" = get_brand_color("red")
    )
  )

Linear Reg.: Plane

Code

library(dplyr)
library(palmerpenguins)
library(predict3d)
library(rgl)
library(tidyr)

user_matrix <-
  dplyr::tribble(
    ~a,         ~b,         ~c,          ~d,
    0.6233152,  -0.7817951, -0.01657271, 0,
    0.1739255,  0.1179437,  0.97767037,  0,
    -0.7623830, -0.6122792, 0.20949011,  0,
    0,          0,          0,           1
  ) |>
  as.matrix() |>
  `colnames<-`(NULL)

data <- penguins |> drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm)

formula(flipper_length_mm ~ bill_length_mm + bill_depth_mm) |>
  lm(data) |>
  predict3d(
    xlab = "Bill length (mm)",
    ylab = "Bill depth (mm)",
    zlab = "Flipper length (mm)",
    radius = 0.75,
    type = "s",
    color = "red",
    show.subtitle = FALSE,
    show.error = FALSE
  )

view3d(userMatrix = user_matrix, zoom = 0.9)

rglwidget(elementId = "1st") |>
  suppressMessages() |>
  suppressWarnings()

Linear Reg.: Fit (Restricted)

library(broom)
library(palmerpenguins)
library(parsnip)

model <-
  linear_reg() |>
  set_engine("lm")

fit_restricted <-
  model |>
  fit(flipper_length_mm ~ bill_length_mm, data = penguins)

fit_restricted |> tidy() |> adorn_rounding(5)
#> # A tibble: 2 × 5
#>   term           estimate std.error statistic p.value
#>   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)      127.       4.67       27.2       0
#> 2 bill_length_mm     1.69     0.105      16.0       0

fit_restricted_stats <- fit_restricted |> glance()

fit_restricted_stats |>
  tidyr::pivot_longer(cols = dplyr::everything()) |>
  adorn_rounding(5)
#> # A tibble: 12 × 2
#>    name              value
#>    <chr>             <dbl>
#>  1 r.squared         0.431
#>  2 adj.r.squared     0.429
#>  3 sigma            10.6  
#>  4 statistic       257.   
#>  5 p.value           0    
#>  6 df                1    
#>  7 logLik        -1293.   
#>  8 AIC            2591.   
#>  9 BIC            2603.   
#> 10 deviance      38394.   
#> 11 df.residual     340    
#> 12 nobs            342

Linear Reg.: Fit (Full)

library(broom)
library(palmerpenguins)
library(parsnip)

model <-
  linear_reg() |>
  set_engine("lm")

fit_full <-
  model |>
  fit(flipper_length_mm ~ bill_length_mm + bill_depth_mm, data = penguins)

fit_full |> tidy() |> adorn_rounding(5)
#> # A tibble: 3 × 5
#>   term           estimate std.error statistic p.value
#>   <chr>             <dbl>     <dbl>     <dbl>   <dbl>
#> 1 (Intercept)      194.      6.34        30.7       0
#> 2 bill_length_mm     1.41    0.0880      16.1       0
#> 3 bill_depth_mm     -3.24    0.243      -13.3       0

fit_full_stats <- fit_full |> glance()

fit_full_stats |>
  tidyr::pivot_longer(cols = dplyr::everything()) |>
  adorn_rounding(5)
#> # A tibble: 12 × 2
#>    name              value
#>    <chr>             <dbl>
#>  1 r.squared         0.626
#>  2 adj.r.squared     0.624
#>  3 sigma             8.63 
#>  4 statistic       284.   
#>  5 p.value           0    
#>  6 df                2    
#>  7 logLik        -1221.   
#>  8 AIC            2449.   
#>  9 BIC            2465.   
#> 10 deviance      25222.   
#> 11 df.residual     339    
#> 12 nobs            342

Linear Reg.: ANOVA

library(broom)
library(janitor)

anova(fit_restricted$fit, fit_full$fit) |>
  tidy() |>
  adorn_rounding(5)

Linear Reg.: Effect Size

library(effectsize)
library(psychometric)

fit_restricted_stats$adj.r.squared
#> [1] 0.4288992

fit_restricted_stats$adj.r.squared |> interpret_r2(rules = "cohen1988")
#> [1] "substantial"
#> (Rules: cohen1988)

fit_full_stats$adj.r.squared
#> [1] 0.6237289

fit_full_stats$adj.r.squared |> interpret_r2(rules = "cohen1988")
#> [1] "substantial"
#> (Rules: cohen1988)

delta <- fit_full_stats$adj.r.squared - fit_restricted_stats$adj.r.squared

delta
#> [1] 0.1948297

Linear Reg.: Effect Size

CI.Rsq(
  rsq = delta,
  n = penguins |> nrow(),
  k = 2
)
#>         Rsq      SErsq       LCL       UCL
#> 1 0.1948297 0.03782495 0.1206942 0.2689653

delta |> interpret_r2(rules = "cohen1988")
#> [1] "moderate"
#> (Rules: cohen1988)

Linear Reg.: Power Analysis

library(pwrss)
library(tidyr)

pwrss.f.reg(
  r2 = fit_full_stats$adj.r.squared,
  k = 2,
  n =
    penguins |>
    drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm) |>
    nrow(),
  power = NULL,
  alpha = 0.05
)
#>  Linear Regression (F test) 
#>  R-squared Deviation from 0 (zero) 
#>  H0: r2 = 0 
#>  HA: r2 > 0 
#>  ------------------------------ 
#>   Statistical power = 1 
#>   n = 342 
#>  ------------------------------ 
#>  Numerator degrees of freedom = 2 
#>  Denominator degrees of freedom = 339 
#>  Non-centrality parameter = 566.919 
#>  Type I error rate = 0.05 
#>  Type II error rate = 0

Linear Reg.: Conclusion

Adding bill_depth_mm as a predictor meaningful improve the prediction of flipper_length_mm?

Our analysis indicates that adding bill_depth_mm as a predictor improves the model adjusted $\text{R}^{2}$ in 0.19483 (95% CI [0.12069, 0.26897]) ($\text{F}$(2, 339) = 284, $p$-value < 0.001), exceeding the Minimal Effect Size (MES) threshold.

The power (1 - $\beta$) of the test is 1, indicating that the probability of a false negative is zero.

Based on these results, we conclude that bill_depth_mm meaningful improve the prediction of flipper_length_mm. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.

\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} = 0 \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} > 0 \end{cases} \]

\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} < \text{MES} \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} \geq \text{MES} \end{cases} \]

Model Diagnostics

Model diagnostics are crucial!

It’s essential to verify that all model assumptions hold. However, a discussion on this topic is beyond the scope of this course.

You can find these assumptions in most statistical textbooks, or you can look at the original papers that introduced the models (e.g., fot t-Tests, see Student (1908)).

Objective Assumption Tests

🚨 Avoid Using! 🚨

Objective assumption tests (e.g., Anderson–Darling test) are not advisable for large samples, as they can be overly sensitive to minor deviations. Additionally, they might overlook visual patterns that are not captured by a single metric.

Usually, a visual inspection of the data is the preferred approach in most cases.

For a straightforward critique of normality tests specifically, refer to this article by Greener (2020).

See also: Kozak & Piepho (2018), Schucany & Ng (2006), and Shatz (2024).

How to Learn More

How to Learn More

StatQuest
by Josh Starmer

Very Normal
by Christian Pascual

Module 6
Exercise

Your Mission

Answer the following question using the data provided:

Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?

MISFS-R Clusters

The Multidimensional Index for Sustainable Food Systems (MISFS) is a tool designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices.

The MISFS-R is a revised version that introduces new indicators and a refined methodology for calculating the index.

For more details, see Carvalho et al. (2021) and Norde et al. (2023).

MISFS-R Clusters

⚠️ Warning ⚠️

This exercise is for educational purposes only.

The data used in this exercise requires further cleaning and validation to ensure its suitability for real-world applications. For the purposes of this analysis, we will assume the data is valid and reliable, though this assumption may not hold in practice.

Additionally, we will not verify the assumptions underlying the statistical test. To simplify the exercise, we will proceed as though all necessary assumptions are satisfied.

Please note that the results of the statistical test may not be valid due to these simplifications.

In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.

The Data

The Template

To assist you with this task, we have prepared a comprehensive template. You can download it here.

This template outlines all the essential steps required to complete the exercise.

We have provided a head start on the analysis, so your task is to complete the remaining sections and ensure all steps are thoroughly addressed.

Data Science Workflow

Remember the data science workflow:

Steps

👣 Here are the steps you should follow:

1. Open a new Quarto project.

2. Copy the content of the template into the project’s Quarto file.

3. Download the data.

4. Inspect the data file.

5. Import the data into R.

6. Clean, tidy, and validate the data.

7. Save the validated data.

8. Conduct a brief exploratory data analysis.

9. Conduct a t-test to compare the groups.

10. Calculate the effect size.

11. Perform a power analysis.

12. Write your conclusions.

Notes

🧮 Recalculate the relative frequency.

❌ Remove municipalities with less than 10 monitored children.

🙋‍♂️ If you’re stuck, ask for help.

🎉 Have fun!

Conclusion

We Hope You Enjoyed the Course!

We sure enjoyed teaching you!

Now That You’re an R Rock Star

We hope you feel more confident in your R programming skills and are ready to tackle new challenges. Remember, learning is a continuous journey, and the R community is here to support you.

🇧🇷 For Brazilian graduate students, consider exploring the abnt Quarto format. It helps you write your thesis or dissertation in compliance with the Brazilian Association of Technical Standards (ABNT) standards while ensuring full reproducibility. Check out an example here.

🙋‍♂️ If you have any questions or need assistance, don’t hesitate to reach out to us!

Post-Course Survey

How to Learn More?

Here are other great resources to learn more about R and programming in general.

🎓 Online Courses

Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 1 [Introduction to computer science with python – Part 1] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos (University of São Paulo, Brazil) (pt-BR)

Kon, F. (n.d.). Introdução à ciência da computação com python – Parte 2 [Introduction to computer science with python – Part 2] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos-2s (University of São Paulo, Brazil) (pt-BR)

Peng, R. D., Leek, J., & Caffo, B. (n.d.). Data science specialization. [Online course]. Coursera. https://www.coursera.org/specializations/jhu-data-science (Johns Hopkins University, United States) (en-US)

How to Learn More?

🎥 Videos

Code.org. (2018, January 30). How computers work [YouTube video]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td

Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [YouTube video]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7

How to Learn More?

📙 Books

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz

Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf/

Bryan, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com/ (Strongly recommended)

Wickham, H. (n.d.). Tidy design principles. https://design.tidyverse.org

Wickham, H. (n.d.). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org

Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz

Examples of Data Reports

Global Syndemic Project
Data Report 3

A Demonstration of General Linear Models Using Penguins

Closing Remarks

This presentation was created with the Quarto Publishing System. The code and materials are available on GitHub.

References

In accordance with the American Psychological Association (APA) Style, 7th edition.

Ackoff, R. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16, 3–9.

Bernardes, M. E. M. (2009). Ensino e aprendizagem como unidade dialética na atividade pedagógica. Psicologia Escolar e Educacional, 13(2), 235–242. https://doi.org/10.1590/S1413-85572009000200005

Broman, K. (2013, April 5). Data science is statistics. The stupidest thing... https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/

Brookshear, J. G., & Brylow, D. (2020). Computer science: An overview (13 ed., Global edition). Pearson.

Bryan, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com/

Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf/

Cao, L. (2017). Data science: A comprehensive overview. ACM Computing Surveys, 50(3), 43. https://doi.org/10.1145/3076253

Carvalho, A. M. de, Verly Jr, E., Marchioni, D. M., & Jones, A. D. (2021). Measuring sustainable food systems in Brazil: A framework and multidimensional index to evaluate socioeconomic, nutritional, and environmental aspects. World Development, 143, 105470. https://doi.org/10.1016/j.worlddev.2021.105470

Casella, G., & Berger, R. L. (2002). Statistical inference (2nd ed.). Duxbury.

Chang, W. (2018). R graphics cookbook: Practical recipes for visualizing data (2nd ed.). O’Reilly Media. https://r-graphics.org

Code.org. (2018, January 30). How computers work [Video recording]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Coronel, C., & Morris, S. A. (2019). Database systems: Design, implementation, and management (13th ed.). Cengage.

DeGroot, M. H., & Schervish, M. J. (2012). Probability and statistics (OCLC: ocn502674206) (4th ed.). Addison-Wesley.

Denning, P. J. (2005). Is computer science science? Communications of the ACM, 48(4), 27–31. https://doi.org/10.1145/1053291.1053309

Dhar, V. (2023). Data science and prediction. Communications of the ACM, 56(12), 64–73. https://doi.org/10.1145/2500499

Ellis, P. D. (Ed.). (2010). The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results (a). Cambridge University Press.

Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. https://doi.org/10.1080/00031305.2017.1375987

Engels, F. (2020). Dialética da natureza (Traduzido do original em alemão Dialektik der natur (1873-1882).) (N. Schneider, Trans.). Boitempo.

Freire, P. (2011). Pedagogia da autonomia: saberes necessários à prática educativa (OCLC: 1229931361). Paz e Terra.

Gavin, L. (2020, October 20). Big data will be dead in 5 years. Towards Data Science. https://towardsdatascience.com/big-data-will-be-dead-in-5-years-ef4344269aef

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

GO FAIR initiative. (n.d.). GO FAIR initiative: Make your data & services FAIR. GO FAIR. Retrieved June 10, 2024, from https://www.go-fair.org/

Gómez-de-Mariscal, E., Guerrero, V., Sneider, A., Jayatilaka, H., Phillip, J. M., Wirtz, D., & Muñoz-Barrutia, A. (2021). Use of the p-values as a size-dependent function to address practical differences when analyzing large datasets. Scientific Reports, 11(1, 1), 20942. https://doi.org/10.1038/s41598-021-00199-5

Greener, R. (2020, August 4). Stop testing for normality. Medium. https://towardsdatascience.com/stop-testing-for-normality-dba96bb73f90

Grosser, M., Bumann, H., & Wickham, H. (2021). Advanced R solutions. CRC Press.

Horne, J. A., & Östberg, O. (1976). A self-assessment questionnaire to determine morningness-eveningness in human circadian rhythms. International Journal of Chronobiology, 4(2), 97–110. https://www.ncbi.nlm.nih.gov/pubmed/1027738

Ifrah, G. (2001). The universal history of computing: From the abacus to the quantum computer (E. F. Harding, Trans.). John Wiley & Sons. (Original work published 1994)

Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314. https://doi.org/10.1080/10618600.1996.10474713

Kozak, M., & Piepho, H.-P. (2018). What’s normal anyway? Residual plots are more telling than significance tests when checking ANOVA assumptions. Journal of Agronomy and Crop Science, 204(1), 86–98. https://doi.org/10.1111/jac.12220

Kuhn, M., & Silge, J. (2022). Tidy modeling with R: A framework for modeling in the tidyverse (OCLC: on1338675673). O’Reilly Media. https://www.tmwr.org/

Landau, W. (n.d.). The {targets} R package user manual.

Leocadio-Miguel, M. A., Louzada, F. M., Duarte, L. L., Areas, R. P., Alam, M., Freire, M. V., Fontenele-Araujo, J., Menna-Barreto, L., & Pedrazzoli, M. (2017). Latitudinal cline of chronotype. Scientific Reports, 7(1), 5437. https://doi.org/10.1038/s41598-017-05797-w

Lohr, S. (2014, August 18). For big-data scientists, “Janitor work” is key hurdle to insights. The New York Times: Technology. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

Marwick, B., Boettiger, C., & Mullen, L. (2018). Packaging data analytical work reproducibly using R (and friends). The American Statistician, 72(1), 80–88. https://doi.org/10.1080/00031305.2017.1375986

McIlroy, M. D., Pinson, E. N., & Tague, B. A. (1978). UNIX Time-Sharing System: Forward. Bell System Technical Journal, 57(6), 1899–1904. https://archive.org/details/bstj57-6-1899/mode/2up

Meyer, M. N. (2018). Practical tips for ethical data sharing. Advances in Methods and Practices in Psychological Science, 1(1), 131–144. https://doi.org/10.1177/2515245917747656

Najmi, A., Sadasivam, B., & Ray, A. (2021). How to choose and interpret a statistical test? An update for budding researchers. Journal of Family Medicine and Primary Care, 10(8), 2763. https://doi.org/10.4103/jfmpc.jfmpc_433_21

Neyman, J., & Pearson, E. S. (1928a). On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A(1/2), 175–240. https://doi.org/10.2307/2331945

Neyman, J., & Pearson, E. S. (1928b). On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A(3/4), 263–294. https://doi.org/10.2307/2332112

Norde, M. M., Porciuncula, L., Garrido, G., Nunes-Galbes, N. M., Sarti, F. M., Marchioni, D. M. L., & de Carvalho, A. M. (2023). Measuring food systems sustainability in heterogenous countries: The Brazilian multidimensional index updated version applicability. Sustainable Development, 31(1), 91–107. https://doi.org/10.1002/sd.2376

Peng, R. D. (2022). R programming for data science. https://bookdown.org/rdpeng/rprogdatascience/

Perezgonzalez, J. D. (2015). Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Frontiers in Psychology, 6. https://doi.org/10.3389/fpsyg.2015.00223

Popper, K. R. (1979). Objective knowledge: An evolutionary approach. Oxford University Press. (Original work published 1972)

Posit Team. (n.d.). RStudio: Integrated development environment for R [Computer software]. Posit Software. http://www.posit.co

R Core Team. (n.d.). R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Reis, J., & Housley, M. (2022). Fundamentals of data engineering: Plan and build robust data systems. O’Reilly.

Rowley, J. (2007). The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information Science, 33(2), 163–180. https://doi.org/10.1177/0165551506070706

Schucany, W. R., & Ng, H. K. T. (2006). Preliminary goodness-of-fit tests for normality do not validate the one-sample Student t. Communications in Statistics - Theory and Methods, 35(12), 2275–2286. https://doi.org/10.1080/03610920600853308

Shatz, I. (2024). Assumption-checking rather than (just) testing: The importance of visualization and effect size in statistical diagnostics. Behavior Research Methods, 56(2), 826–845. https://doi.org/10.3758/s13428-023-02072-x

Sistema de Vigilância Alimentar e Nutricional. (n.d.). Relatórios de acesso público [Data set]. Retrieved March 19, 2023, from https://sisaps.saude.gov.br/sisvan/relatoriopublico/

Student. (1908). The probable error of a mean. Biometrika, 6(1). https://doi.org/10.2307/2331554

Turing, A. M. (1937). On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society, s2-42(1), 230–265. https://doi.org/10.1112/plms/s2-42.1.230

van der Loo, M., & Jonge, E. de. (2018). Statistical data cleaning with applications in R. John Wiley & Sons.

Vartanian, D. (2025). Is latitude associated with chronotype? [Corrected version (Pre-Release), University of São Paulo]. https://doi.org/10.17605/OSF.IO/YGKTS

von Neumann, J. (1993). First draft of a report on the EDVAC. IEEE Annals of the History of Computing, 15(4), 27–75. IEEE Annals of the History of Computing. https://doi.org/10.1109/85.238389

Vygotsky, L. S. (2012). Thought and language (Publicado originalmente em 1934.) (E. Hanfmann, G. Vakar, & A. Kozulin, Trans.; Rev. exp. ed.). The MIT Press.

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70(2). https://doi.org/10.1080/00031305.2016.1154108

Wickham, H. (n.d.-a). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org

Wickham, H. (n.d.-b). Tidy design principles. https://design.tidyverse.org

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10). https://doi.org/10.18637/jss.v059.i10

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz

Wickham, H. (2023, February 23). The tidy tools manifesto. Tidyverse. https://tidyverse.tidyverse.org/articles/manifesto.html

Wickham, H., & Bryan, J. (2023). R packages (2nd ed.). O’Reilly Media. https://r-pkgs.org

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly Media. https://r4ds.hadley.nz

Wilkinson, L. (2005). The grammar of graphics (J. Chambers, D. Hand, & W. Härdle, Eds.; 2nd ed.). Springer.

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18

Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [Video recording]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7

Thank you!

Decision about \(\text{H}_{0}\)	\(\text{H}_{0}\) True	\(\text{H}_{0}\) False
Accept	Correct inference (True negative) (\(1 - \alpha\))	Type II error (False negative) (\(\beta\))
Reject	Type I error (False positive) (\(\alpha\))	Correct inference (True positive) (\(1 - \beta\))

An Introduction to the R Programming Language

Hi there! 👋

Introduction

Pre-Course Survey

Course Objectives

Course Content

Schedule

Class Dynamic

Main References

Learn from Mistakes

Getting Help

Notes

Course Presentation

Module 1Computer Science

Let’s Start from Zero

Computer Science

Abstractions

Computer Architecture

Binary Arith. & Boolean Logic

Logic Gates

What We Are Really Doing When We Program?

Programming Languages

Open-Source Prog. Lang.

Why R?

Python is Fine Too, I Guess…

It’s Not What You Think

A Bit of R History

Community

Important R Comm. and Events

How to Learn More

Module 2R Basics

Installing R

Installing R

The RStudio IDE

Installing RStudio

Creating a Project

Creating a Project

Adding a README & LICENSE File

Quarto Notebooks

Markdown Basics

Objects

Vectorized Operations

Primary Data Types

Data Classes

Non-Atomic Objects

Operations

Arithmetic Operators

Assignment Operators

Names

Names

Comparison Operators

Logical Operators

Other Operators

Operator Precedence

Subsetting

Subsetting

Special Values

Pipes

Control Flow: Choices

Control Flow: Loops

Functions

Functionals

Enviroments

Packages

Packages

Other Kind of Objects

Commenting

Documentation

The Tidyverse

Main Tidyverse Packages

The rOpenSci

Base R Solutions

Best Practices

How to Learn More

Module 3Data Munging

Data Science

First Things First

What Is Data After All?

What Is Data After All?

Statistical Value Chain

Module 1
Computer Science

Module 2
R Basics

Module 3
Data Munging

The `tibble` Package

The `here` Package

The `readr` Package

The `janitor` Package

`clean_names()`

The `tidyr` Package

The `dplyr` Package

`select()`

`rename()`

`rename()`

`mutate()`

`mutate()`

`mutate()`

`mutate()`

`mutate()`

`filter()`

`filter()`

`relocate()`

`relocate()`

Module 4
Exploratory Data Analysis

`head()` & `tail()`

`slice_sample()`

`glimpse()`

`count()`

`summarize()`

The `summarytools` Package

The `janitor` Package

The `gt` Package

The `ggplot2` Package

The `ggplot2` Package