University of São Paulo
2024-06-20
This presentation aims to organize our research efforts in a way that makes them transparent, reproducible, and in accordance with the best international standards.
Here is our itinerary:
Center for Open Science. (February 15, 2023) Introduction to OSF. https://youtu.be/X07mBq2tnMg?si=m_mXwKrw0LvHazTg
Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. https://doi.org/10.1080/00031305.2017.1375987
“The scientific research enterprise is built on a foundation of trust. Scientists trust that the results reported by others are valid. Society trusts that the results of research reflect an honest attempt by scientists to describe the world accurately and without bias. But this trust will endure only if the scientific community devotes itself to exemplifying and transmitting the values associated with ethical scientific conduct.”
(Drawings by John McKiernan)
“Open Science is an umbrella term encompassing a multitude of assumptions about the future of knowledge creation and dissemination.” (Fecher & Friesike, 2014)
At a high-level, it can be defined as the following.
“Open Science is scholarly research that is collaborative, transparent and reproducible and whose outputs are publicly available.” (European Commission & Directorate-General for Research and Innovation, 2018)
(Figure by Lotta Tomasson)
OSF is a free and open source project management tool, created by the Center for Open Science, that supports researchers throughout their entire project lifecycle.
(Image by Center for Open Science)
Popular registries
OSF offers 50 GB of free storage for each public project (on Google Cloud) and allows you to connect your project with other cloud storage services.
(Image by Center for Open Science)
(Image by Center for Open Science)
OSF let you have a fine control over each component of your project.
📁 Research protocol ✅
📁 Research data ✅
📁 Research code ✅
📁 Research results ✅
📁 Secrets of the masonic society (Level 5!) ❌
A preprint repository is a platform that allows researchers to share their research outputs before they are peer-reviewed.
A unlicensed work is an “all rights reserved” work. This means that you can’t use it without the author’s permission.
“Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.”
(Illustration by Scriberia.)
“Big data will be dead in 5 Years: Everything will be big data, so it won’t need a special name.” (Gavin, 2020)
(Cartoon by David Fletcher)
Working with big data using Excel or GUI-based (Graphical User Interface) statistical software is extremely challenging, if not impossible.
Excel, for example, can struggle with performance issues and has a maximum row limit (1,048,576 rows), which is often insufficient for big data projects.
(WorldClim 2.1 data. June mean temperature (°C) in South America (1970-2000))
Python is good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.
(Image author unknown)
Programming in movies versus programming in real life:
I don’t know how to code! 😭
How do I learn R?
(Illustration by Allison Horst)
R was created by scientists for scientists.
It’s made of a very diverse, inclusive, non-toxic community.
(Illustration by Allison Horst)
(Illustration by Allison Horst)
Dynamic documents seamlessly integrate text with output from a programming language, automatically updating the output whenever the code is executed.
Examples
Pipeline tools coordinate the pieces of computationally demanding analysis projects. They can be used to automate the execution of a series of tasks.
Reproducible environments ensure that your code will run the same way on different machines and regardless of the time.
We’ve only touched upon the vast landscape of open science. There are many other tools and concepts that we didn’t cover, such as:
(Illustration by Allison Horst)
Things will not always go as planned. But that’s ok. We’ll figure it out together.
This presentation was created using the Quarto Publishing System. Code and materials are available on GitHub.
These beautiful illustrations were made by Allison Horst. Thank you Allison!
(Illustration by Allison Horst)
In accordance with the American Psychological Association (APA) Style, 7th edition.
(Illustration by Allison Horst)
APIs are a set of rules that allow different software applications to communicate with each other.
These suggestions may be too overwhelming at first. It’s important to start small and gradually incorporate these practices into the project workflow.
The order of the suggestions is not important.
“Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.”
You can think data engineering as the plumbing of data science.
It’s a confusing term, with many definitions and interpretations.
For some, data science is just statistics (Broman, 2013) (hype statistics). For others, it’s a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).
A high-level definition: “Data science is the study of the generalizable extraction of knowledge from data.” (Dhar, 2023)
(Figure from Wickham et al. (2023))
“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” (Lohr, 2014).
(Figure by Reis & Housley (2022))
Data is an abstraction. It’s a representation of the world around us. Without context, it has no meaning.
“A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. […] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear.”
Some R data classes
Data validation techniques are used to ensure that data is accurate, consistent, and reliable.
Examples
(Illustration by Allison Horst)
(Illustration by Allison Horst)
Learn more in Wickham et al. (2023), chap. 5.
(Figure from Wickham et al. (2023))
(Illustration by Allison Horst)
(Illustration by Allison Horst)
Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data.
(Image by 9Dots Management)
“Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns)” (Coronel & Morris, 2019).
(Figure by Ellis & Leek (2018))
There are many open data formats available for researchers to use. Here are some examples:
(Excel files are not a open data format!)
A data management plan (DMP) is a formal document that outlines how data will be managed throughout the research process.
“Kanban is a tool that allows you to fully visualize the status of your processes through a board with dynamic columns that make all tasks and processes steps clear”
(Author unknown)