Daniel Vartanian
Jaqueline Lopes Pereira
This course will introuce you to the R programming language.
Here are the main topics:
(Artwork by Allison Horst)
(Future course? An Introduction to Git)
(Artwork by Allison Horst)
The course will take place in one of the laboratories of the Public Health School (FSP) of the University of SĂŁo Paulo (USP).
The classes are distributed over 5 days, with a total of 9 hours.
Monday (24/03) 13:00-14:00
Tuesday (25/03) 13:00-14:00
Wednesday (26/03) 13:00-18:00
Thursday (27/03) 13:00-14:00
Friday (28/03) 13:00-14:00
Theory âĄïž Practice.
đ In-class activities.
đ Final project.
đ No formal evaluation.
Mistakes will happen. Donât let them discourage you!
(Artwork by Allison Horst)
Donât be afraid to ask questions, even if they seem simple. Everyone has to start somewhere!
The programming community is very active and there are many resources available to help you learn and solve problems.
One of the best ways to get help is to search or ask questions on forums like Stack Overflow.
đŸ Bring a flash drive/pen drive.
đ©âđ» Try to use the same computer for all classes.
đŹ Always open this presentation in the browser.
â Donât miss a class.
đââïž Ask questions.
đ€ Help each other.
đ Have fun!
Always have this presentation open in your browser.
We will use it to navigate through the course content.
Use it to access the course materials and exercises.
Tip: Want to create QR Codes like this? Check the qrcode
R package.
This is a programming course.
As such, we must understand the basics of how a computer works.
It is essential to start from materiality; otherwise, things stop making sense.
If you understand R at its core, you can navigate any challenge you encounter.
(Ariadneâs thread. Artwork by Eroshka)
The science of information processes and their interactions with the world (Denning, 2005).
Computer science studies information processes both artificial and natural (Denning, 2005). It seeks to build a scientific foundation for such topics as computer design, computer programming, information processing, algorithmic solutions of problems, and the algorithmic process itself (Brookshear & Brylow, 2020).
(Photo by Mark Richards)
(Artwork by Pablo Picasso â Le Taureau (1945-46))
Stored-Program Concept (1945) (AKA âvon Neumann Architectureâ)
First proposed by J. P. Eckert and his team (1944-1945).
Input -> Storage -> Processing -> Output
(Reproduced from Brookshear & Brylow (2020))
Decimal | Binary |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
Addition |
---|
01 + |
01 + |
01 = |
11 |
x | y | x â§ y | x âš y |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 1 | 1 | 1 |
x | ÂŹx |
---|---|
0 | 1 |
1 | 0 |
(Adapted from Brookshear & Brylow (2020))
(Video by George Zaidan and Sajan Saini, on TED-ED)
The instructions which govern this operation must be given to the device in absolutely exhaustive detail. [âŠ] All these procedures require the use of some code to express the logical and the algebraical definition of the problem under consideration, as well as the necessary numerical material. (von Neumann, 1993)
(Photo by Corbis Historical/Getty Images)
(Artwork by Calltutors)
R is an excellent language to get started with programming because itâs specialized. Unlike some other languages, it doesnât have an overwhelming amount of features for you to learn.
(Artwork by Allison Horst)
In a general sense, Python is also good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.
Programming in movies versus programming in real life:
R is a free and open-source programming language designed for data analysis graphics (Ihaka & Gentleman, 1996).
It was developed by Ross Ihaka and Robert Gentleman in the statistics department at the University of Auckland (New Zealand) in 1991 and introduced to the scientific community in 1993 (Peng, 2022).
(Robert Gentleman (Left) and Ross Ihaka (Right). Photos by an Unknown Author.)
(Artworks by Allison Horst)
Installing R is very simple.
You just need to download the installer from the Comprehensive R Archive Network (CRAN).
Go to Download R for [YOUR OS] and then click on the base linkâThe base version of R.
Simply follow the instructions of the installer.
An IDE (Integrated Development Environment) is a software that provides a set of tools to help the programmer to write code. It puts together things like a text editor, a console, a debugger, and a file manager.
RStudio is the most popular IDE for R. Like R, it is also free and open-source.
Project-oriented workflow: A project consolidates all related files and resources in one place, ensuring that your work is reproducible and well-organized.
It is important to have a structured project directory. A best practice is to follow the structure made for R Packages (Marwick et al., 2018).
Every data project must be accompanied, at least, by a README file, explaining the project, and a LICENSE file, defining the terms of use of the data (See Open Source Licenses here).
(Reproduced from Marwick et al. (2018))
You can use the usethis
R package to create these files.
(Artwork by Allison Horst)
Markdown is a lightweight markup language with plain text formatting syntax. The great thing about Markdown is that you can forget about the formatting and focus on the content.
Quarto Notebooks are a type of Markdown document that allows you to mix code, text, and output in a single document. These slides are all written in Markdown.
Learn more about the Markdown syntax here.
Everything in R is an object.
Everything that happens in R is the result of a function call.
Scalars (0D): 1
Vectors (1D): c(1, 2, 3)
Matrices (2D):matrix(1:9, nrow = 3, ncol = 3)
Arrays (nD):array(1:27, dim = c(3, 3, 3))
(Artwork by Kanishk Srivastava)
In R, most operations are vectorized, meaning they operate on all elements of a vector at once, without needing explicit loops.
This makes code more concise, readable, and less error-prone, while improving efficiency compared to non-vectorized languages.
(Artwork by Kanishk Srivastava)
These are part of the atomic data types in R.
logical
: TRUE/FALSE.integer
: 1, 2, 3.double
: 1.0, 2.0, 3.0.character
: âHello, World!â.character
(e.g., âMariaâ, âJohnâ)factor
integer
(e.g., 1, 2, 3)double
(e.g., 1.0, 2.0, 3.0)complex
(e.g., 1 + 2i, 3 + 4i)logical
(e.g., TRUE, FALSE)Date
(e.g., 2023-01-01) (Linear time)POSIXct
(e.g., 2023-01-01 00:00:00) (linear time)Interval
(e.g., 2023-01-01 00:00:00â2023-12-15 15:40:00) (linear time)Duration
(e.g., 1 year, 2 months, 3 days) (linear time)Period
(e.g., 1 year, 2 months, 3 days) (linear(ish) time)hms
(e.g., 01:00:00) (Circular time)Atomic types are non-recursive objects, i.e., objects that canât hold themselfs as an entry (e.g., logical
, integer
, double
, character
).
Non-Atomic Objects
matrix()
) (2D)array()
) (nD)list()
) (nD)data.frame()
) (a special case of list
) (2D)list(
list(list(1), list(2)),
list(list("a"), list("b")),
list(TRUE)
)
#> [[1]]
#> [[1]][[1]]
#> [[1]][[1]][[1]]
#> [1] 1
#>
#>
#> [[1]][[2]]
#> [[1]][[2]][[1]]
#> [1] 2
#>
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [[2]][[1]][[1]]
#> [1] "a"
#>
#>
#> [[2]][[2]]
#> [[2]][[2]][[1]]
#> [1] "b"
#>
#>
#>
#> [[3]]
#> [[3]][[1]]
#> [1] TRUE
Left Hand Side Operator
Right Hand Side
(Artwork by Allison Horst)
R has strict rules regarding the names of objects.
if
, else
, TRUE
, FALSE
).variable-name
Bad
variable.name
Good, but not advisable
variable_name
Good (snake_case) â Most used in R
1variable_name
Bad
.variable_name
Bad
variableName
Good (camelCase)
VariableName
Good (PascalCase)
Atomic
`[`()
1 level extractionData Frames
x[i, ]
Extract line i
x[, j]
Extract column/variable j
x[i, j]
Extract line i
from column/variable j
NA
(Not Available)
Missing values must be explicitly declared in R. For that R uses the NA
value.
NA
comes in different flavors, for example: NA_integer_
, NA_real_
, NA_character_
, NA_complex_
.
If you use just NA
, R will use the most appropriate type.
NaN
(Not a Number)
Tip: See the naniar
package.
Click here to learn the differences between the base R and magrittr
pipes.
is.* Functions
đš Avoid loops đš (if you can).
Especially when dealing with large datasets. Use functionals instead.
A functional is a function that takes a function as an input and returns a vector as output (Wickham, 2019).
It makes the work of the programmer easier, because it allows to apply a function to a vector without the need of a loop. Because these functions are usually written in C (a low-level programminng language), they are very fast.
R have native functionals (e.g., lapply()
), but prefer solutions from the purrr
package.
library(microbenchmark)
microbenchmark(
with_for_loop(1:1000),
with_map(1:1000),
times = 100,
check = "equal"
)
#> Unit: microseconds
#> expr min lq mean median uq max
#> with_for_loop(1:1000) 902.153 1134.307 2893.9184 1270.279 1494.831 123181.670
#> with_map(1:1000) 360.080 448.497 537.5125 513.731 578.594 1480.732
#> neval cld
#> 100 a
#> 100 a
Scoping: The act of finding the value associated with a name.
Each program must do one thing well (McIlroy et al., 1978).
In R, the fundamental unit of shareable code is the package.
A package bundles together code, data, documentation, and tests, and is easy to share with others (Wickham & Bryan, 2023).
Documentation example:
The mctq
R package.
Note: Other programming languages refer to packages as libraries.
R is just the core. It needs packages to do anything useful.
A tipycal installation of R comes with a set of packages, like:
base
: Basic R functions (e.g., sum()
)datasets
: Some datasets for testing and teaching (e.g., mtcars
)graphics
: The basic graphics functions (e.g., plot()
)grDevices
: The graphics devices (e.g., pdf()
)methods
: The built-in object-oriented programming systemparallel
: Support for parallel computationstats
: Basic statistical functions (e.g., t.test()
)utils
: Utility functions (e.g., install.packages()
)Based on the atomic types, we can create other types of objects.
Using the right kind of object in R is very important, because these objects also include methods to deal with particular types of data.
For example, time can be expressed in diffent ways, like linear time (e.g., durations, periods, data-time, intervals) and circular time (e.g., time of day). We can also have time series, which are a sequence of data points taken at successive points in time.
Date
(Days since 1970-01-01 (UNIX epoch))
POSIXct
(Seconds since 1970-01-01 (UNIX epoch))
đš Only if necessary đš
Your code must speak for itself.
In data analysis code, only use comments to record important things. If you need comments to explain what your code is doing, consider rewriting your code to be clearer (Wickham, n.d.-a).
One the most important things in programming is to know how to find help. In R, you can use the help()
function to get help on a function or package.
Since R has more an academic focus, documentation is usually plentiful and very well written.
Documentation Websites: The mctq
R package documentation.
Tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy (The Tidy Manifesto), grammar, and data structures.
It was created by Hadley Wickham, a prominent figure in the R community and author of several key references for this course. Hadley Wickhamâs contributions have significantly shaped modern R programming.
The Tidyverse also has a meta-package that installs or load the most important packages from the collection.
readr
: Read flat files (csv, tsv, fwf) into R.stringr
: A fresh approach to string manipulation in R.lubridate
: Make working with dates in R just that little bit easier.dplyr
: A grammar of data manipulation.purrr
: A functional programming toolkit for R.forcats
: Tools for working with categorical variables (factors).ggplot2
: An implementation of the Grammar of Graphics in R.rOpenSci is a non-profit initiative founded in 2011 that brings together a community of researchers and developers committed to open science. They create and maintain high-quality R packages that provide access to data and tools from diverse sources, ensuring reliability and reproducibility in scientific research.
All packages go through a peer-review process, which ensures that they are well written and reliable.
Example: mctq
R package peer-review.
You can do most of the things you need with the packages that come bundled with R. However, that is not the most efficient way to do things today.
In this course we are going to focus on the Tidyverse and rOpenSci packages, which are the most modern and efficient way to work with data in R.
If you want a depper understanding of the R language, we encourage you to explore the base R solutions.
You can learn a lot about a person just by looking at their code đ.
Variable and function names should use only lowercase letters, numbers, and
_
. Use underscores (_
) (so called snake case) to separate words within a name.
The tidyverse has four guiding principles:
Data science is the study of the generalizable extraction of knowledge from data (Dhar, 2023).
For some, data science is just statistics (Broman, 2013) (hype statistics). For others, itâs a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).
(Reproduced from Wickham et al. (2023))
You only analyze or visualize data when you already have clean, tidy, and validated data.
The processing for getting data ready for analysis is called data munging. You can also see it as data wrangling (usually when dealing with machine learning) or, simply, data cleaning.
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets (Lohr, 2014).
Ackoffâs DIKW pyramid (Ackoff, 1989)
Data < Information < Knowledge < Wisdom
Data versus the interpretation of the data.
Data is an abstraction. Itâs a representation of the world around us. Without context, it has no meaning.
(Artwork by Pablo Picasso â Le Taureau (1945-46))
A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. [âŠ] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear (van der Loo & Jonge, 2018).
(Reproduced from van der Loo & Jonge (2018))
With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source. (van der Loo & Jonge, 2018).
If the researcher has made any modifications to the raw data, it is not the raw form of the data. (S. E. Ellis & Leek, 2018).
(Photo by Unknown Author)
(Artwork by Allison Horst)
(Artwork by Allison Horst)
(Reproduced from Wickham et al. (2023))
Data validation techniques are used to ensure that data is accurate, consistent, and reliable.
Examples of invalid data:
Tip: Check out Mark van der Looâs validate
R package for data validation techniques.
Daily air quality measurements in New York (May to September 1973).
Data can be hierarchical, with multiple levels of organization.
Interest rate representing the adjusted average rate of overnight repurchase agreements backed by Brazilâs federal government securities held in the Special System for Settlement and Custody (SELIC). Reported in % per day.
library(jsonlite)
paste0(
"https://api.bcb.gov.br/dados/serie/",
"bcdata.sgs.11", "/",
"dados", "?",
"formato=json", "&",
"dataInicial=01/12/2024", "&",
"dataFinal=03/12/2024"
) |>
read_json()
#> [[1]]
#> [[1]]$data
# [1] "02/12/2024"
# [[1]]$valor
# [1] "0.041957"
# [[2]]
# [[2]]$data
# [1] "03/12/2024"
# [[2]]$valor
# [1] "0.041957"
Not all data is tabular; spatial data can be very large and complex.
Excel cannot handle spatial data, and GUI-based statistical softwares, when capable of handling spatial data, are often limited and struggles with performance issues.
Projected monthly average maximum temperatures (°C) for June 2021â2040, based on the global climate model (GCM) ACCESS-CM2.
(WorldClim 2.1 data. June mean temperature (°C) in South America (1970-2000))
Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data.
(Artwork by 9Dots Management)
Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns) (Coronel & Morris, 2019).
Proper documentation is crucial for data sharing, reproducibility, and ensuring that others can understand and use your data effectively.
Here are some examples and guides to help you document your data:
Check S. E. Ellis & Leek (2018) for a detailed guide on data documentation.
(Adapted from S. E. Ellis & Leek (2018, Figure 1))
There are many open data formats available for researchers to use. Open can mean different things, but in this context, it means that the format is not proprietary and can be used by anyone. Here are some examples:
â ïž Excel files are not a open data format! â ïž
Data frames are a special type of list
used for storing data tables. They are the most common way of storing data in R.
tibble
Packagetibble
is a modern take on data frames in R, offering improved usability and integration with the Tidyverse. It enhances readability while retaining essential data frame features.
Using R native data.frame
class.
mtcars
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
#> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
#> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
#> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
#> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
#> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
#> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
#> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
#> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
#> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Using tibble
(Tidyverse way).
library(dplyr) # or library(tibble)
mtcars |> as_tibble()
#> # A tibble: 32 Ă 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # âč 22 more rows
It will depend on what king of data you are working with.
Different from Excel or GUI-based statistical software, R can deal with any kind of data.
Examples of R functions to read data:
utils::read.csv()
, readr::read_csv()
terra::vect()
, st::st_read()
readxl::read_excel()
, haven::read_dta()
(Stata), haven::read_sav()
(SPSS), haven::read_sas()
(SAS)Created by the great Allison Horst, the author of these beautiful illustrations.
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
These data were collected from 2007â2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of th US Long Term Ecological Research Network.
We will use this package to get familiar with R.
(Artwork by Allison Horst)
đ” Known your data.
data
in the root of your project.The data documentation can be accessed by running:
You can also use the following code to open the file in RStudio:
Before importing the data to R, letâs first take a look at the content of the data file in a text editor.
palmerpenguins-raw.csv
studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,39.1,18.7,181,3750,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186,3800,FEMALE,8.94956,-24.69454,NA
PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18,195,3250,FEMALE,8.36821,-25.33302,NA
PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,NA,NA,NA,NA,NA,NA,NA,Adult not sampled.
PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193,3450,FEMALE,8.76651,-25.32426,NA
PAL0708,6,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A2,Yes,2007-11-16,39.3,20.6,190,3650,MALE,8.66496,-25.29805,NA
PAL0708,7,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A1,No,2007-11-15,38.9,17.8,181,3625,FEMALE,9.18718,-25.21799,Nest never observed with full clutch.
PAL0708,8,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N4A2,No,2007-11-15,39.2,19.6,195,4675,MALE,9.4606,-24.89958,Nest never observed with full clutch.
PAL0708,9,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A1,Yes,2007-11-09,34.1,18.1,193,3475,NA,NA,NA,No blood sample obtained.
PAL0708,10,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N5A2,Yes,2007-11-09,42,20.2,190,4250,NA,9.13362,-25.09368,No blood sample obtained for sexing.
PAL0708,11,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A1,Yes,2007-11-09,37.8,17.1,186,3300,NA,8.63243,-25.21315,No blood sample obtained for sexing.
PAL0708,12,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N6A2,Yes,2007-11-09,37.8,17.3,180,3700,NA,NA,NA,No blood sample obtained.
PAL0708,13,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A1,Yes,2007-11-15,41.1,17.6,182,3200,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,14,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N7A2,Yes,2007-11-15,38.6,21.2,191,3800,MALE,NA,NA,Not enough blood for isotopes.
PAL0708,15,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A1,Yes,2007-11-16,34.6,21.1,198,4400,MALE,8.55583,-25.22588,NA
PAL0708,16,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N8A2,Yes,2007-11-16,36.6,17.8,185,3700,FEMALE,NA,NA,Not enough blood for isotopes.
PAL0708,17,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N9A1,Yes,2007-11-12,38.7,19,195,3450,FEMALE,9.18528,-25.06691,NA
Pointing to files inside your project.
here
Packagehere
is a package that helps you use relative paths in your R projects.
It turns file management much easier and OS independent.
(Artwork by Allison Horst)
readr
Packagedata |> glimpse()
#> Rows: 344
#> Columns: 17
#> $ studyName <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PALâŠ
#> $ `Sample Number` <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10âŠ
#> $ Species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie PâŠ
#> $ Region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers"âŠ
#> $ Island <chr> "Torgersen", "Torgersen", "Torgersen", "TorgerseâŠ
#> $ Stage <chr> "Adult, 1 Egg Stage", "Adult, 1 Egg Stage", "AduâŠ
#> $ `Individual ID` <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", âŠ
#> $ `Clutch Completion` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", âŠ
#> $ `Date Egg` <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-âŠ
#> $ `Culmen Length (mm)` <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.âŠ
#> $ `Culmen Depth (mm)` <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8"âŠ
#> $ `Flipper Length (mm)` <chr> "181", "186", "195", NA, "193", "190", "181", "1âŠ
#> $ `Body Mass (g)` <chr> "3750", "3800", "3250", NA, "3450", "3650", "362âŠ
#> $ Sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE"âŠ
#> $ `Delta 15 N (o/oo)` <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.6649âŠ
#> $ `Delta 13 C (o/oo)` <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "âŠ
#> $ Comments <chr> "Not enough blood for isotopes.", NA, NA, "AdultâŠ
janitor
Packagejanitor
provides simple functions for examining and cleaning dirty data.
(Artwork by Allison Horst)
clean_names()
The clean_names()
function transforms data frame column names into machine-readable formats.
This makes it easier to work with the data.
data |> names()
#> [1] "studyName" "Sample Number" "Species"
#> [4] "Region" "Island" "Stage"
#> [7] "Individual ID" "Clutch Completion" "Date Egg"
#> [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
#> [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
#> [16] "Delta 13 C (o/oo)" "Comments"
Most tidyverse functions use data masking, allowing you to refer to data frame columns directly by their names (e.g., my_variable
) instead of using more verbose syntax like data[["my_variable"]]
.
This is part of a concept called tidy evaluation.
tidyr
Packagetidyr
provides a set of functions that help you to tidy your data.
data |> glimpse()
#> Rows: 344
#> Columns: 18
#> $ study_name <chr> "PAL0708", "PAL0708", "PAL0708", "PAL0708", "PAL0708âŠ
#> $ sample_number <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "âŠ
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie PenguâŠ
#> $ region <chr> "Anvers", "Anvers", "Anvers", "Anvers", "Anvers", "AâŠ
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", âŠ
#> $ life_stage <chr> "Adult", "Adult", "Adult", "Adult", "Adult", "Adult"âŠ
#> $ egg_stage <chr> "1 Egg Stage", "1 Egg Stage", "1 Egg Stage", "1 Egg âŠ
#> $ individual_id <chr> "N1A1", "N1A2", "N2A1", "N2A2", "N3A1", "N3A2", "N4AâŠ
#> $ clutch_completion <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "No"âŠ
#> $ date_egg <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1âŠ
#> $ culmen_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", âŠ
#> $ culmen_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1âŠ
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",âŠ
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", âŠ
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FâŠ
#> $ delta_15_n_o_oo <chr> NA, "8.94956", "8.36821", NA, "8.76651", "8.66496", âŠ
#> $ delta_13_c_o_oo <chr> NA, "-24.69454", "-25.33302", NA, "-25.32426", "-25.âŠ
#> $ comments <chr> "Not enough blood for isotopes.", NA, NA, "Adult notâŠ
dplyr
Packagedplyr
is one the most important packages in the Tidyverse. It provides a grammar for data manipulation.
mutate()
: Create, modify, and delete columns.transmute()
: Creates a new data frame containing only specified computationsselect()
: Keep or drop columns using their names and types.slice()
: Subset rows using their positions.filter()
: Keep rows that match a condition.arrange()
: Order rows using column values.rename()
: Rename columns.relocate()
: Change column order.summarize()
: Summarise each group down to one row.select()
The select()
function is used to select columns from a data frame.
In our case, we are not interested in using alll the variables in the raw data. We will select only the variables we need.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ date_egg <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1âŠ
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie PenguâŠ
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", âŠ
#> $ culmen_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", âŠ
#> $ culmen_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1âŠ
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",âŠ
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", âŠ
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FâŠ
rename()
(Artwork by Allison Horst)
rename()
Letâs rename culmen to bill for clarity. Likewise, weâll change date_egg
to year
, extracting the year value in another moment.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1âŠ
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie PenguâŠ
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", âŠ
#> $ bill_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", âŠ
#> $ bill_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1âŠ
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",âŠ
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", âŠ
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FâŠ
mutate()
(Artwork by Allison Horst)
mutate()
The mutate()
function is used to create new columns or modify existing columns in a data frame.
Letâs assign proper classes to the variables. We also need to transform the year
column, extracting just the year value.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <chr> "2007-11-11", "2007-11-11", "2007-11-16", "2007-11-1âŠ
#> $ species <chr> "Adelie Penguin (Pygoscelis adeliae)", "Adelie PenguâŠ
#> $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", âŠ
#> $ bill_length_mm <chr> "39.1", "39.5", "40.3", NA, "36.7", "39.3", "38.9", âŠ
#> $ bill_depth_mm <chr> "18.7", "17.4", "18", NA, "19.3", "20.6", "17.8", "1âŠ
#> $ flipper_length_mm <chr> "181", "186", "195", NA, "193", "190", "181", "195",âŠ
#> $ body_mass_g <chr> "3750", "3800", "3250", NA, "3450", "3650", "3625", âŠ
#> $ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FâŠ
mutate()
For categorical variables, it is helpful to inspect their unique values.
If the vector contains a limited number of unique values, it is likely a good candidate to be converted into a factor
for better data handling and analysis.
mutate()
data <-
data |>
mutate(
year =
year |>
as.Date() |>
year() |>
as.integer(),
species =
case_match(
species,
"Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
"Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap",
"Gentoo penguin (Pygoscelis papua)" ~ "Gentoo"
) |>
as.factor(),
island = as.factor(island),
bill_length_mm = bill_length_mm |> as.numeric(),
bill_depth_mm = bill_depth_mm |> as.numeric(),
flipper_length_mm = flipper_length_mm |> as.integer(),
body_mass_g = body_mass_g |> as.integer(),
sex =
sex |>
str_to_lower() |>
as.factor()
)
mutate()
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
filter()
(Artwork by Allison Horst)
filter()
Letâs filter the data to keep only penguins with a body mass greater than 2000 grams and a bill length less than 50 mm.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
relocate()
(Artwork by Allison Horst)
relocate()
Letâs organize our columns in a more logical order.
The year
column is best placed next to the sex
column.
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
data |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
Always save your data in non-binary formats like csv
to ensure that it can be read by other software.
rds
stands for R Data Serialization. It is a binary data format that allows you to load an object into R preserving all its attributes.
Tip: Store your data in research repositories like The Open Science Framework (See the osfr
package). If you are working with sensitive or human data, ensure it is encrypted before storing it in the cloud (See the lockr
package).
The R packages that come with a typical R installation provide a set of basic statistical functions.
head()
& tail()
penguins |> head(3)
#> # A tibble: 3 Ă 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> # âč 2 more variables: sex <fct>, year <int>
penguins |> tail(3)
#> # A tibble: 3 Ă 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Chinstrap Dream 49.6 18.2 193 3775
#> 2 Chinstrap Dream 50.8 19 210 4100
#> 3 Chinstrap Dream 50.2 18.7 198 3775
#> # âč 2 more variables: sex <fct>, year <int>
slice_sample()
penguins |> slice_sample(n = 15)
#> # A tibble: 15 Ă 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Gentoo Biscoe 48.4 14.4 203 4625
#> 2 Adelie Dream 43.2 18.5 192 4100
#> 3 Gentoo Biscoe 42 13.5 210 4150
#> 4 Adelie Biscoe 41.1 18.2 192 4050
#> 5 Adelie Biscoe 38.6 17.2 199 3750
#> 6 Chinstrap Dream 55.8 19.8 207 4000
#> 7 Adelie Dream 39.8 19.1 184 4650
#> 8 Adelie Dream 40.3 18.5 196 4350
#> 9 Gentoo Biscoe 45.5 13.9 210 4200
#> 10 Adelie Torgers⊠36.6 17.8 185 3700
#> 11 Gentoo Biscoe 44.5 15.7 217 4875
#> 12 Chinstrap Dream 45.2 16.6 191 3250
#> 13 Chinstrap Dream 49 19.6 212 4300
#> 14 Adelie Biscoe 36.4 17.1 184 2850
#> 15 Gentoo Biscoe 46.2 14.5 209 4800
#> # âč 2 more variables: sex <fct>, year <int>
glimpse()
penguins |> glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, AdelâŠ
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, TorgerseâŠ
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, âŠ
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, âŠ
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186âŠ
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, âŠ
#> $ sex <fct> male, female, female, NA, female, male, female, maleâŠ
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007âŠ
count()
penguins |> count(species, island, .drop = FALSE)
#> # A tibble: 9 Ă 3
#> species island n
#> <fct> <fct> <int>
#> 1 Adelie Biscoe 44
#> 2 Adelie Dream 56
#> 3 Adelie Torgersen 52
#> 4 Chinstrap Biscoe 0
#> 5 Chinstrap Dream 68
#> 6 Chinstrap Torgersen 0
#> 7 Gentoo Biscoe 124
#> 8 Gentoo Dream 0
#> 9 Gentoo Torgersen 0
summarize()
summarytools
Packagepenguins |> descr(bill_depth_mm)
#> Descriptive Statistics
#> penguins
#> N: 344
#>
#> bill_depth_mm
#> ----------------- ---------------
#> Mean 17.15
#> Std.Dev 1.97
#> Min 13.10
#> Q1 15.60
#> Median 17.30
#> Q3 18.70
#> Max 21.50
#> MAD 2.22
#> IQR 3.10
#> CV 0.12
#> Skewness -0.14
#> SE.Skewness 0.13
#> Kurtosis -0.92
#> N.Valid 342.00
#> N 344.00
#> Pct.Valid 99.42
janitor
PackageThe tabyl
system from the janitor
package provides more control over the output compared to the summarytools
package.
island | Adelie | Chinstrap | Gentoo | Total |
---|---|---|---|---|
Biscoe | 26% (44) | 0% (0) | 74% (124) | 100% (168) |
Dream | 45% (56) | 55% (68) | 0% (0) | 100% (124) |
Torgersen | 100% (52) | 0% (0) | 0% (0) | 100% (52) |
Total | 44% (152) | 20% (68) | 36% (124) | 100% (344) |
gt
PackageThe Grammar of Graphics, by Leland Wilkinson, set the foundation thinking about data visualization.
We will based this part of the course on the principles found in this book.
(Book cover image from L. Wilkinson (2005))
ggplot2
Package(Artwork by Allison Horst)
ggplot2
PackageThe most powerful and flexible package for data visualization.
A Tidyverse package based on the principles of The Grammar of Graphics.
(Artwork by R Chart Gallery)
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(body_mass_g, flipper_length_mm, species) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point(size = 4) +
scale_color_brand_d(alpha = 0.7) +
theme(text = element_text(size = 20), legend.position = "none")
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(body_mass_g, flipper_length_mm, species) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point(size = 4) +
geom_smooth(
method = "lm",
formula = y ~ x,
se = FALSE,
linewidth = 2
) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
scale_color_brand_d(alpha = 0.7) +
theme(text = element_text(size = 20), legend.position = "none")
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
labs(x = "Flipper Length (mm)", y = "Frequency", fill = "Species") +
scale_fill_brand_d()
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(bill_length_mm, species) |>
ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
geom_boxplot(outlier.color = get_brand_color("dark-red")) +
geom_jitter(width = 0.2, alpha = 0.1) +
scale_fill_brand_d(alpha = 0.7) +
labs(x = "Species", y = "Bill Length (mm)", fill = "Species")
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(body_mass_g, flipper_length_mm, species) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point(size = 2) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
scale_color_brand_d(alpha = 0.7)
library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(stringr)
library(tidyr)
penguins |>
drop_na(body_mass_g, flipper_length_mm, sex) |>
mutate(sex = `levels<-`(sex, str_to_title(levels(sex)))) |>
ggplot(aes(x = body_mass_g, y = flipper_length_mm, color = sex)) +
geom_point(size = 2) +
geom_smooth(
method = "lm",
formula = y ~ x,
se = FALSE,
color = get_brand_color("brown")
) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Sex"
) +
scale_color_brand_d(alpha = 0.7) +
facet_wrap(~species)
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(patchwork)
library(tidyr)
plot_hist <-
penguins |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_histogram(bins = 30, alpha = 0.5, position = "identity") +
labs(x = "Flipper Length (mm)", y = "Frequency") +
scale_fill_brand_d() +
theme(legend.position = "none")
plot_boxplot <-
penguins |>
drop_na(bill_length_mm, species) |>
ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
geom_boxplot(outlier.color = get_brand_color("dark-red")) +
geom_jitter(width = 0.2, alpha = 0.1) +
labs(x = "Species", y = "Bill Length (mm)", fill = "Species") +
scale_fill_brand_d(alpha = 0.7) +
theme(axis.title.x = element_blank(), legend.position = "none")
plot_scatter <-
penguins |>
drop_na(body_mass_g, flipper_length_mm, species) |>
ggplot(
aes(
x = body_mass_g,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point(size = 2) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
labs(
x = "Body Mass (g)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
scale_color_brand_d(alpha = 0.7) +
theme(legend.position = "none")
(plot_hist + plot_boxplot) / plot_scatter + plot_annotation(tag_levels = "A")
Color palettes can be sequential, diverging, or qualitative (discrete). Here are some examples of two popular packages for color palettes in R.
viridis
: Colorblind-Friendly Color Maps
library(ggplot2)
library(palmerpenguins)
library(tidyr)
library(viridis)
penguins |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_density(alpha = 0.5, position = "identity") +
scale_fill_viridis(discrete = TRUE) +
labs(x = "Flipper Length (mm)", y = "Density", fill = "Species") +
theme(text = element_text(size = 20), legend.position = "none")
RColorBrewer
: Common Color Palettes
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_density(alpha = 0.5, position = "identity") +
scale_fill_brewer(palette = "Set1") +
labs(x = "Flipper Length (mm)", y = "Density", fill = "Species") +
theme(text = element_text(size = 20), legend.position = "none")
Check out the amazing work by Yan Holtz. Visit From Data to Viz too see the diagram below and many others.
(Artwork by Yan Holtz)
Another amazing work by Yan Holtz is the R Graph Gallery, which provides numerous examples of R graphics and how to create them.
(Artwork by Yan Holtz)
The Tidymodels framework is a collection of packages for modeling and machine learning using Tidyverse principles.
It is created by the same team that developed the tidyverse
and is designed to work seamlessly with it.
Although it is a relative newcomer to the R ecosystem (2018), it has quickly gained popularity due to its simplicity and consistency.
Like Tidyverse, Tidymodels also has a meta-package that installs or load the most important packages from the collection.
rsample
: Provides infrastructure for efficient data splitting and resampling.parsnip
: A tidy, unified interface to models.recipes
: A tidy interface to data pre-processing tools for feature engineering.workflows
: A package to bundle the pre-processing, modeling, and post-processing together.infer
: A statistical grammar for inferential statistics.Decision about \(\text{H}_{0}\) | \(\text{H}_{0}\) True | \(\text{H}_{0}\) False |
---|---|---|
Accept | Correct inference (True negative) (\(1 - \alpha\)) |
Type II error (False negative) (\(\beta\)) |
Reject | Type I error (False positive) (\(\alpha\)) |
Correct inference (True positive) (\(1 - \beta\)) |
(Based on Casella & Berger (2002, p. 383))
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.8,
alpha = 0.05,
welch.df = TRUE,
alternative = "greater"
)
#> Difference between Two means
#> (Independent Samples t Test)
#> H0: mu1 = mu2
#> HA: mu1 > mu2
#> ------------------------------
#> Statistical power = 0.8
#> n1 = 310
#> n2 = 310
#> ------------------------------
#> Alternative = "greater"
#> Degrees of freedom = 618
#> Non-centrality parameter = 2.49
#> Type I error rate = 0.05
#> Type II error rate = 0.2
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "greater",
plot = TRUE,
verbose = FALSE
)
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.3,
alpha = 0.05,
welch.df = TRUE,
alternative = "greater",
)
#> Difference between Two means
#> (Independent Samples t Test)
#> H0: mu1 = mu2
#> HA: mu1 > mu2
#> ------------------------------
#> Statistical power = 0.3
#> n1 = 64
#> n2 = 64
#> ------------------------------
#> Alternative = "greater"
#> Degrees of freedom = 126
#> Non-centrality parameter = 1.131
#> Type I error rate = 0.05
#> Type II error rate = 0.7
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "greater",
plot = TRUE,
verbose = FALSE
)
library(pwrss)
pwr_analysis <- pwrss.t.2means(
mu1 = 0.2, # Cohen's d for small effect sizes
mu2 = 0,
power = 0.999,
alpha = 0.001,
welch.df = TRUE,
alternative = "greater",
)
#> Difference between Two means
#> (Independent Samples t Test)
#> H0: mu1 = mu2
#> HA: mu1 > mu2
#> ------------------------------
#> Statistical power = 0.999
#> n1 = 1913
#> n2 = 1913
#> ------------------------------
#> Alternative = "greater"
#> Degrees of freedom = 3824
#> Non-centrality parameter = 6.185
#> Type I error rate = 0.001
#> Type II error rate = 0.001
power.t.test(
ncp = pwr_analysis$ncp,
df = pwr_analysis$df,
alpha = pwr_analysis$parms$alpha,
alternative = "greater",
plot = TRUE,
verbose = FALSE
)
(Artwork by Allison Horst)
(Artwork by Allison Horst)
Large samples and sensitivity
Is a difference of \(0.00001\) valid?
Statistical ritual versus Statistical thinking
Comparison of a 95% of confidence level (\(\alpha = 0.05\)) and an n-dependent p-value curve. The parameter \(n_{\alpha}\) represents the minimum sample size to detect statistically significant differences among compared groups. The parameter \(n_{\gamma}\) represents the convergence point of the p-value curve. When the p-value curve expresses practical differences, the area under the red curve (\(A_{p(n)}\)) is smaller than the area under the constant function \(\alpha = 0.05\) (\(A_{\alpha = 0.05}\)) when it is evaluated between \(0\) and \(n_{\gamma}\).
(Reproduced from GĂłmez-de-Mariscal et al. (2021, Figure 3))
[âŠ] in many circumstances, all that is intended by âprovingâ the null hypothesis is that the ES [Effect Size] is not necessarily zero but small enough to be negligible
(Cohen, 1988, p. 461).
(Photo by an unknown author.)
Test | Relevant Effect Size |
Effect Size Classes
|
||
---|---|---|---|---|
Small | Medium | Large | ||
Comparison of independent means | \(d\), \(\Delta\), Hedgesâ \(g\) | 0.20 | 0.50 | 0.80 |
Comparison of two correlations | \(q\) | 0.10 | 0.30 | 0.50 |
Difference between proportions | Cohenâs \(g\) | 0.05 | 0.15 | 0.25 |
Correlation | \(r\) | 0.10 | 0.30 | 0.50 |
\(r^{2}\) | 0.01 | 0.09 | 0.25 | |
Crosstabulation | \(w\), \(\varphi\), \(V\), \(C\) | 0.10 | 0.30 | 0.50 |
ANOVA | \(f\) | 0.10 | 0.25 | 0.40 |
\(\eta^{2}\) | 0.01 | 0.06 | 0.14 | |
Multiple regression | \(R^{2}\) | 0.02 | 0.13 | 0.26 |
\(f^{2}\) | 0.02 | 0.15 | 0.35 | |
Notes: The rationale for most of these benchmarks can be found in Cohen (1988) at the following pages: Cohenâs \(d\) (p. 40), \(q\) (p. 115), Cohenâs \(g\) (pp. 147â149), \(r\) and \(r^{2}\) (pp. 79â80), Cohenâs \(w\) (pp. 224â227), \(f\) and \(\eta^{2}\) (pp. 285â287), \(R^{2}\) and \(f^{2}\) (pp. 413â414). |
\(\Delta\text{R}^{2}\) = 0.00388
Cohenâs \(f^{2}\) = 0.00414
Latitudinal cline of chronotype (Leocadio-Miguel et al., 2017).
Critique of Leocadio-Miguel et al. latitude article (Vartanian, 2025).
The ASA statement on p-values (Wasserstein & Lazar, 2016).
Authors who rely solely on the p-value demonstrate a preference for statistical rituals over statistical reasoning (Gigerenzer, 2004).
Note: The HO score (Horne & Ăstberg, 1976) goes from 16 to 86, with higher scores indicating a preference for morningness.
(Reproduced from Leocadio-Miguel et al. (2017, Figure 2))
Check Antoine Soeteweyâs flowchart to help you decide the appropriate statistical test for your data.
(Artwork by Antoine Soetewey)
infer
Pipeline Examplesinfer
is a Tidymodels package that provides a statistical grammar for inferential statistics.
It also offers pipeline examples for various hypothesis tests. These pipelines can serve as a helpful starting point for implementing other types of models.
The examples provided here are for educational purposes only.
In these examples, we are not verifying the assumptions underlying the statistical tests. For simplicity, we assume that the data satisfies all necessary assumptions, and the resulting models are valid.
Please note that the results of the statistical tests might not be valid due to these simplifications.
In practice, always ensure that the assumptions of the statistical tests you use are thoroughly checked and validated.
Is there a meaningful difference in flipper length between Adelie and Gentoo penguins?
To test this, we will perform a t-Test for Independent Samples using a randomization-based empirical null distribution approach.
To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval.
Cohenâs benchmark for a small effect size will be used as our Minimum Effect Size (MES) (Cohen, 1988).
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{a}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{a}: d \geq \text{MES} \\ \end{cases} \]
library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
filter(species %in% c("Adelie", "Gentoo")) |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot(outlier.color = get_brand_color("dark-red")) +
geom_jitter(width = 0.2, alpha = 0.1) +
scale_fill_brand_d(alpha = 0.7) +
labs(x = "Species", y = "Flipper Length (mm)", fill = "Species")
library(brandr)
library(dplyr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
filter(species %in% c("Adelie", "Gentoo")) |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = flipper_length_mm, fill = species)) +
geom_density(alpha = 0.7) +
scale_fill_brand_d() +
labs(x = "Flipper Length (mm)", y = "Density", fill = "Species")
pwr_analysis <- pwr.t.test(
d = 0.2, # Cohen's d for small effect sizes
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided"
)
pwr_analysis
#>
#> Two-sample t test power calculation
#>
#> n = 393.4057
#> d = 0.2
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
pwr_analysis <- pwr.t.test(
d = 0.5, # Cohen's d for medium effect sizes
sig.level = 0.05,
power = 0.8,
type = "two.sample",
alternative = "two.sided"
)
pwr_analysis
#>
#> Two-sample t test power calculation
#>
#> n = 63.76561
#> d = 0.5
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
observed_statistic <-
penguins |>
filter(species %in% c("Gentoo", "Adelie")) |>
drop_na(flipper_length_mm, species) |>
specify(flipper_length_mm ~ species) |>
hypothesize(null = "independence") |>
calculate(stat = "t", order = c("Gentoo", "Adelie"))
observed_statistic
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1 Ă 1
#> stat
#> <dbl>
#> 1 34.4
null_dist <-
penguins |>
filter(species %in% c("Gentoo", "Adelie")) |>
drop_na(flipper_length_mm, species) |>
specify(flipper_length_mm ~ species) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "t", order = c("Gentoo", "Adelie"))
null_dist
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1,000 Ă 2
#> replicate stat
#> <int> <dbl>
#> 1 1 1.72
#> 2 2 0.121
#> 3 3 -1.98
#> 4 4 0.267
#> 5 5 0.161
#> 6 6 0.0241
#> 7 7 0.532
#> 8 8 -0.954
#> 9 9 -0.751
#> 10 10 -0.395
#> # âč 990 more rows
pwr_analysis <- pwrss.t.2means(
mu1 = gentoo |> mean(na.rm = TRUE),
mu2 = adelie |> mean(na.rm = TRUE),
sd1 = gentoo |> sd(na.rm = TRUE),
sd2 = adelie |> sd(na.rm = TRUE),
paired = FALSE,
n2 = length(adelie),
kappa = length(gentoo) / length(adelie),
power = NULL,
alpha = 0.05,
welch.df = TRUE,
alternative = "not equal"
)
#> Difference between Two means
#> (Independent Samples t Test)
#> H0: mu1 = mu2
#> HA: mu1 != mu2
#> ------------------------------
#> Statistical power = 1
#> n1 = 119
#> n2 = 146
#> ------------------------------
#> Alternative = "not equal"
#> Degrees of freedom = 251.35
#> Non-centrality parameter = 33.506
#> Type I error rate = 0.05
#> Type II error rate = 0
Is there a meaningful difference in flipper length between Adelie and Gentoo penguins?
Our analysis found a statistically significant difference in means (\(t\) = 34.4, \(p\)-value < 0.001). The observed effect size was very large and exceed the Minimal Effect Size (MES) threshold (\(d\) = 4.13, 95% CI [3.70, 4.56]).
The power (1 - \(\beta\)) of the test is 1, indicating that the probability of a false negative is zero.
Based on these results, we conclude that there is a meaningful difference in flipper length between Adelie and Gentoo penguins, with Gentoo penguins having significantly longer flippers. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} \\ \text{H}_{a}: \mu_{A} \neq \mu_{B} \\ \end{cases} \]
\[ \begin{cases} \text{H}_{0}: d < \text{MES} \\ \text{H}_{a}: d \geq \text{MES} \\ \end{cases} \]
Is there a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins?
To test this, we will perform a One-Way ANOVA using a randomization-based empirical null distribution approach.
To ensure practical significance, we will analyze the difference in means for its effect size, considering a 95% confidence interval.
Cohenâs benchmark for a small effect size will be used as our Minimum Effect Size (MES) (Cohen, 1988).
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} = \mu_{C} \\ \text{H}_{a}: \mu_{i} \neq \mu_{j}, \ \text{for some} \ i, j \end{cases} \]
\[ \begin{cases} \text{H}_{0}: f < \text{MES} \\ \text{H}_{a}: f \geq \text{MES} \\ \end{cases} \]
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(flipper_length_mm, species) |>
ggplot(aes(x = species, y = flipper_length_mm, fill = species)) +
geom_boxplot(outlier.color = get_brand_color("dark-red")) +
geom_jitter(width = 0.2, alpha = 0.1) +
scale_fill_brand_d(alpha = 0.7) +
labs(x = "Species", y = "Flipper Length (mm)", fill = "Species")
pwr_analysis <- pwr.anova.test(
k = 3,
f = 0.25, # Cohen's f for medium effect sizes
sig.level = 0.05,
power = 0.8
)
pwr_analysis
#>
#> Balanced one-way analysis of variance power calculation
#>
#> k = 3
#> n = 52.3966
#> f = 0.25
#> sig.level = 0.05
#> power = 0.8
#>
#> NOTE: n is number in each group
observed_statistic <-
penguins |>
drop_na(flipper_length_mm, species) |>
specify(flipper_length_mm ~ species) |>
hypothesize(null = "independence") |>
calculate(stat = "F")
observed_statistic
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1 Ă 1
#> stat
#> <dbl>
#> 1 595.
null_dist <-
penguins |>
drop_na(flipper_length_mm, species) |>
specify(flipper_length_mm ~ species) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute") |>
calculate(stat = "F")
null_dist
#> Response: flipper_length_mm (numeric)
#> Explanatory: species (factor)
#> Null Hypothesis: independence
#> # A tibble: 1,000 Ă 2
#> replicate stat
#> <int> <dbl>
#> 1 1 0.367
#> 2 2 0.113
#> 3 3 1.18
#> 4 4 1.60
#> 5 5 0.182
#> 6 6 2.62
#> 7 7 1.75
#> 8 8 1.00
#> 9 9 0.0295
#> 10 10 1.48
#> # âč 990 more rows
Tukey Honest Significant Differences (HSD) test.
Is there a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins?
Our analysis found a statistically significant difference in means (\(\text{F}\) = 595, \(p\)-value < 0.001). The observed effect size was very large and exceed the Minimal Effect Size (MES) threshold (\(f\) = 1.873, 95% CI [1.72, Inf]).
The power (1 - \(\beta\)) of the test is 1, indicating that the probability of a false negative is zero.
Based on these results, we conclude that there is a meaningful difference in flipper length between Adelie, Chinstrap, and Gentoo penguins, with Gentoo penguins having significantly longer flippers, followed by Chinstrap penguins. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.
\[ \begin{cases} \text{H}_{0}: \mu_{A} = \mu_{B} = \mu_{C} \\ \text{H}_{a}: \mu_{i} \neq \mu_{j}, \ \text{for some} \ i, j \end{cases} \]
\[ \begin{cases} \text{H}_{0}: f < \text{MES} \\ \text{H}_{a}: f \geq \text{MES} \\ \end{cases} \]
Adding bill_depth_mm
as a predictor improve the prediction of flipper_length_mm
?
To test this, we will use Nested Linear Models.
To ensure practical significance, we will analyze the delta of the adjusted \(\text{R}^2\) of the models considering a 95% confidence interval.
We will use Cohenâs benchmark for a small effect size as our Minimum Effect Size (MES) (Cohen, 1988).
\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} = 0 \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} > 0 \end{cases} \]
\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} < \text{MES} \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} \geq \text{MES} \end{cases} \]
pwrss.f.reg(
r2 = 0.02, # Cohen's f^2 for small effect sizes
k = 2,
power = 0.8,
alpha = 0.05
)
#> Linear Regression (F test)
#> R-squared Deviation from 0 (zero)
#> H0: r2 = 0
#> HA: r2 > 0
#> ------------------------------
#> Statistical power = 0.8
#> n = 476
#> ------------------------------
#> Numerator degrees of freedom = 2
#> Denominator degrees of freedom = 472.108
#> Non-centrality parameter = 9.696
#> Type I error rate = 0.05
#> Type II error rate = 0.2
pwrss.f.reg(
r2 = 0.15, # Cohen's f^2 for medium effect sizes
k = 2,
power = 0.8,
alpha = 0.05
)
#> Linear Regression (F test)
#> R-squared Deviation from 0 (zero)
#> H0: r2 = 0
#> HA: r2 > 0
#> ------------------------------
#> Statistical power = 0.8
#> n = 58
#> ------------------------------
#> Numerator degrees of freedom = 2
#> Denominator degrees of freedom = 54.7
#> Non-centrality parameter = 10.182
#> Type I error rate = 0.05
#> Type II error rate = 0.2
library(brandr)
library(ggplot2)
library(palmerpenguins)
penguins |>
drop_na(bill_length_mm, flipper_length_mm, species) |>
ggplot(
aes(
x = bill_length_mm,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_smooth(
inherit.aes = FALSE,
mapping = aes(x = bill_length_mm, y = flipper_length_mm),
method = "lm",
formula = y ~ x,
se = FALSE,
color = get_brand_color("black")
) +
geom_point(size = 2) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
labs(
x = "Bill Length (mm)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
scale_color_brand_d(alpha = 0.7)
library(brandr)
library(ggplot2)
library(palmerpenguins)
library(tidyr)
penguins |>
drop_na(bill_depth_mm, flipper_length_mm, species) |>
ggplot(
aes(
x = bill_depth_mm,
y = flipper_length_mm,
color = species,
shape = species
)
) +
geom_point(size = 2) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE) +
geom_smooth(
inherit.aes = FALSE,
mapping = aes(x = bill_depth_mm, y = flipper_length_mm),
method = "lm",
formula = y ~ x,
se = FALSE,
color = get_brand_color("black")
) +
labs(
x = "Bill Depth (mm)",
y = "Flipper Length (mm)",
color = "Species",
shape = "Species"
) +
scale_color_brand_d(alpha = 0.7)
library(brandr)
library(broom)
library(ggplot2)
library(palmerpenguins)
# library(rutils) # github.com/danielvartan/rutils
library(tidyr)
data <- penguins |> drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm)
fit <-
formula(flipper_length_mm ~ bill_length_mm + bill_depth_mm) |>
lm(data)
fit |>
augment(data) |>
ggplot(aes(x = bill_length_mm, y = flipper_length_mm)) +
geom_point() +
geom_line(
aes(y = .fitted, color = "Prediction"),
linewidth = 0.5,
alpha = 0.5,
) +
geom_function(
aes(color = "Adjusted prediction"),
fun = rutils:::lm_fun(fit, fix_all_but = 1, data = data),
linewidth = 1
) +
labs(
x = "Bill depth (mm)",
y = "Flipper length (mm)",
subtitle = rutils:::lm_str_fun(fit, fix_all_but = 1),
color = NULL
) +
ggplot2::scale_color_manual(
values = c(
"Prediction" = get_brand_color("green"),
"Adjusted prediction" = get_brand_color("red")
)
)
library(dplyr)
library(palmerpenguins)
library(predict3d)
library(rgl)
library(tidyr)
user_matrix <-
dplyr::tribble(
~a, ~b, ~c, ~d,
0.6233152, -0.7817951, -0.01657271, 0,
0.1739255, 0.1179437, 0.97767037, 0,
-0.7623830, -0.6122792, 0.20949011, 0,
0, 0, 0, 1
) |>
as.matrix() |>
`colnames<-`(NULL)
data <- penguins |> drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm)
formula(flipper_length_mm ~ bill_length_mm + bill_depth_mm) |>
lm(data) |>
predict3d(
xlab = "Bill length (mm)",
ylab = "Bill depth (mm)",
zlab = "Flipper length (mm)",
radius = 0.75,
type = "s",
color = "red",
show.subtitle = FALSE,
show.error = FALSE
)
view3d(userMatrix = user_matrix, zoom = 0.9)
rglwidget(elementId = "1st") |>
suppressMessages() |>
suppressWarnings()
fit_restricted_stats |>
tidyr::pivot_longer(cols = dplyr::everything()) |>
adorn_rounding(5)
#> # A tibble: 12 Ă 2
#> name value
#> <chr> <dbl>
#> 1 r.squared 0.431
#> 2 adj.r.squared 0.429
#> 3 sigma 10.6
#> 4 statistic 257.
#> 5 p.value 0
#> 6 df 1
#> 7 logLik -1293.
#> 8 AIC 2591.
#> 9 BIC 2603.
#> 10 deviance 38394.
#> 11 df.residual 340
#> 12 nobs 342
fit_full_stats |>
tidyr::pivot_longer(cols = dplyr::everything()) |>
adorn_rounding(5)
#> # A tibble: 12 Ă 2
#> name value
#> <chr> <dbl>
#> 1 r.squared 0.626
#> 2 adj.r.squared 0.624
#> 3 sigma 8.63
#> 4 statistic 284.
#> 5 p.value 0
#> 6 df 2
#> 7 logLik -1221.
#> 8 AIC 2449.
#> 9 BIC 2465.
#> 10 deviance 25222.
#> 11 df.residual 339
#> 12 nobs 342
pwrss.f.reg(
r2 = fit_full_stats$adj.r.squared,
k = 2,
n =
penguins |>
drop_na(flipper_length_mm, bill_length_mm, bill_depth_mm) |>
nrow(),
power = NULL,
alpha = 0.05
)
#> Linear Regression (F test)
#> R-squared Deviation from 0 (zero)
#> H0: r2 = 0
#> HA: r2 > 0
#> ------------------------------
#> Statistical power = 1
#> n = 342
#> ------------------------------
#> Numerator degrees of freedom = 2
#> Denominator degrees of freedom = 339
#> Non-centrality parameter = 566.919
#> Type I error rate = 0.05
#> Type II error rate = 0
Adding bill_depth_mm
as a predictor meaningful improve the prediction of flipper_length_mm
?
Our analysis indicates that adding bill_depth_mm
as a predictor improves the model adjusted \(\text{R}^{2}\) in 0.19483 (95% CI [0.12069, 0.26897]) (\(\text{F}\)(2, 339) = 284, \(p\)-value < 0.001), exceeding the Minimal Effect Size (MES) threshold.
The power (1 - \(\beta\)) of the test is 1, indicating that the probability of a false negative is zero.
Based on these results, we conclude that bill_depth_mm
meaningful improve the prediction of flipper_length_mm
. Consequently, we reject the null hypothesis in favor of the alternative hypothesis.
\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} = 0 \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} > 0 \end{cases} \]
\[ \begin{cases} \text{H}_{0}: \Delta \text{Adjusted} \ \text{R}^{2} < \text{MES} \\ \text{H}_{a}: \Delta \text{Adjusted} \ \text{R}^{2} \geq \text{MES} \end{cases} \]
Model diagnostics are crucial!
Itâs essential to verify that all model assumptions hold. However, a discussion on this topic is beyond the scope of this course.
You can find these assumptions in most statistical textbooks, or you can look at the original papers that introduced the models (e.g., fot t-Tests, see Student (1908)).
(Artwork by Allison Horst)
đš Avoid Using! đš
Objective assumption tests (e.g., AndersonâDarling test) are not advisable for large samples, as they can be overly sensitive to minor deviations. Additionally, they might overlook visual patterns that are not captured by a single metric.
Usually, a visual inspection of the data is the preferred approach in most cases.
For a straightforward critique of normality tests specifically, refer to this article by Greener (2020).
See also: Kozak & Piepho (2018), Schucany & Ng (2006), and Shatz (2024).
Answer the following question using the data provided:
Was there a meaningful difference in ultra-processed food consumption among Brazilian children aged 2 to 4 in 2022 across the clusters B and D of the Revised Multidimensional Index for Sustainable Food Systems (MISFS-R)?
The Multidimensional Index for Sustainable Food Systems (MISFS) is a tool designed to assess the sustainability of food systems at a subnational level in Brazil, incorporating local behaviors and practices.
The MISFS-R is a revised version that introduces new indicators and a refined methodology for calculating the index.
For more details, see Carvalho et al. (2021) and Norde et al. (2023).
(Adapted from Norde et al. (2023, Figure 6))
(Reproduced from Norde et al. (2023, Figure 6))
This exercise is for educational purposes only.
The data used in this exercise requires further cleaning and validation to ensure its suitability for real-world applications. For the purposes of this analysis, we will assume the data is valid and reliable, though this assumption may not hold in practice.
Additionally, we will not verify the assumptions underlying the statistical test. To simplify the exercise, we will proceed as though all necessary assumptions are satisfied.
Please note that the results of the statistical test may not be valid due to these simplifications.
In real-world scenarios, always ensure that the assumptions of statistical tests are rigorously checked and validated before interpreting the results.
(If you were unable to download the data, you can access it here)
To assist you with this task, we have prepared a comprehensive template. You can download it here.
This template outlines all the essential steps required to complete the exercise.
We have provided a head start on the analysis, so your task is to complete the remaining sections and ensure all steps are thoroughly addressed.
Remember the data science workflow:
(Reproduced from Wickham et al. (2023))
đŁ Here are the steps you should follow:
1. Open a new Quarto project.
2. Copy the content of the template into the projectâs Quarto file.
3. Download the data.
4. Inspect the data file.
5. Import the data into R.
6. Clean, tidy, and validate the data.
7. Save the validated data.
8. Conduct a brief exploratory data analysis.
9. Conduct a t-test to compare the groups.
10. Calculate the effect size.
11. Perform a power analysis.
12. Write your conclusions.
đ§ź Recalculate the relative frequency.
â Remove municipalities with less than 10 monitored children.
đââïž If youâre stuck, ask for help.
đ Have fun!
We sure enjoyed teaching you!
(Artwork by Allison Horst)
We hope you feel more confident in your R programming skills and are ready to tackle new challenges. Remember, learning is a continuous journey, and the R community is here to support you.
đ§đ· For Brazilian graduate students, consider exploring the abnt
Quarto format. It helps you write your thesis or dissertation in compliance with the Brazilian Association of Technical Standards (ABNT) standards while ensuring full reproducibility. Check out an example here.
đââïž If you have any questions or need assistance, donât hesitate to reach out to us!
Here are other great resources to learn more about R and programming in general.
đ Online Courses
Kon, F. (n.d.). Introdução Ă ciĂȘncia da computação com python â Parte 1 [Introduction to computer science with python â Part 1] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos (University of SĂŁo Paulo, Brazil) (pt-BR)
Kon, F. (n.d.). Introdução Ă ciĂȘncia da computação com python â Parte 2 [Introduction to computer science with python â Part 2] [Online course]. Coursera. https://www.coursera.org/learn/ciencia-computacao-python-conceitos-2s (University of SĂŁo Paulo, Brazil) (pt-BR)
Peng, R. D., Leek, J., & Caffo, B. (n.d.). Data science specialization. [Online course]. Coursera. https://www.coursera.org/specializations/jhu-data-science (Johns Hopkins University, United States) (en-US)
đ„ Videos
Code.org. (2018, January 30). How computers work [YouTube video]. https://youtube.com/playlist?list=PLzdnOPI1iJNcsRwJhvksEo1tJqjIqWbN-&si=WkuM8c-AKI-NZ3td
Zaidan, G., & Saini, S. (2025, February 25). How are microchips made? [YouTube video]. TED-Ed. https://youtu.be/IkRXpFIRUl4?si=iQ7xQuFS6DZLuBY7
đ Books
Wickham, H., Ăetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). OâReilly Media. https://r4ds.hadley.nz
Bryan, J., Hester, J., Pileggi, S., & Aja, E. D. (n.d.). What they forgot to teach you about R: The stuff you need to know about R, besides data analysis. https://rstats.wtf/
Bryan, J. (n.d.). Happy Git and GitHub for the useR. https://happygitwithr.com/ (Strongly recommended)
Wickham, H. (n.d.). Tidy design principles. https://design.tidyverse.org
Wickham, H. (n.d.). The tidyverse style guide. Retrieved July 17, 2023, from https://style.tidyverse.org
Wickham, H. (2019). Advanced R (2nd ed.). CRC Press. https://adv-r.hadley.nz
This presentation was created with the Quarto Publishing System. The code and materials are available on GitHub.
(Artwork by Allison Horst)
In accordance with the American Psychological Association (APA) Style, 7th edition.
(Artwork by Allison Horst)