Enhancing the Global Syndemic project with Open Science practices

Daniel Vartanian

University of São Paulo

2024-06-20

Hi there! 👋

This presentation aims to organize our research efforts in a way that makes them transparent, reproducible, and in accordance with the best international standards.

Here is our itinerary:

  1. Introduction
  2. Fostering a culture of open science
  3. Implementing a comprehensive project and data management system
  4. Promoting reproducible research practices
  5. Final remarks

Materials sent before the presentation

Center for Open Science. (February 15, 2023) Introduction to OSF. https://youtu.be/X07mBq2tnMg?si=m_mXwKrw0LvHazTg

Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. https://doi.org/10.1080/00031305.2017.1375987

Why this is important?

Key points

  1. Fostering a culture of open science.
  2. Implementing a comprehensive project and data management system.
  3. Promoting reproducible research practices.

Fostering a culture of open science

We have a problem…

The scientific research enterprise is built on a foundation of trust. Scientists trust that the results reported by others are valid. Society trusts that the results of research reflect an honest attempt by scientists to describe the world accurately and without bias. But this trust will endure only if the scientific community devotes itself to exemplifying and transmitting the values associated with ethical scientific conduct.”

Closed doors science

Reproducibility crisis

A real fictional example

  • Multicentric cohort study.
  • Uncentralized data storage.
  • No data management plan.
  • Hierarchical communication.
  • Reluctance to share data and code.
  • Complete absence of standards.
  • Complete absence of documentation.
  • Proprietary data formats.
  • Non-transparent research process.
  • Unreproducible results.

Open Science

“Open Science is an umbrella term encompassing a multitude of assumptions about the future of knowledge creation and dissemination.” (Fecher & Friesike, 2014)

At a high-level, it can be defined as the following.

Open Science is scholarly research that is collaborative, transparent and reproducible and whose outputs are publicly available.(European Commission & Directorate-General for Research and Innovation, 2018)

Implementing a comprehensive project and data management system

The Open Science Framework (OSF)

OSF is a free and open source project management tool, created by the Center for Open Science, that supports researchers throughout their entire project lifecycle.

Registrations

  1. It is easy to obtain confirmations, or verifications, for nearly every theory —if we look for confirmations.
  2. Confirmations should count only if they are the result of risky predictions; that is to say, if, unenlightened by the theory in question, we should have expected an event which was incompatible with the theory—an event which would have refuted the theory.

Popular registries

Data storage

OSF offers 50 GB of free storage for each public project (on Google Cloud) and allows you to connect your project with other cloud storage services.

Components

  • Everything is a component in OSF, including projects.
  • Subprojects can be nested within main projects.

Managing access

OSF let you have a fine control over each component of your project.

📁 Research protocol ✅

📁 Research data ✅

📁 Research code ✅

📁 Research results ✅

📁 Secrets of the masonic society (Level 5!) ❌

Digital objects & version control

  • OSF provides unique, persistent URLs for all components and files.
  • It can also provide DOIs (Digital Object Identifiers) for all components.

  • OSF has a built-in version control system, all changes are tracked and can be reverted.

Preprints

A preprint repository is a platform that allows researchers to share their research outputs before they are peer-reviewed.

Open licenses

A unlicensed work is an “all rights reserved” work. This means that you can’t use it without the author’s permission.

Project page

Suggestions

  1. Project and data management system
  • Implement OSF as the central system for managing the project and associated data.
  1. Subprojects integration
  • Integrate each subproject (e.g., scope review, data analysis/dashboard, causal model, agent-based modeling) into the main project on OSF.
  1. Protocol registration
  • Register research protocols for both the main project and each subproject.
  • Ensure alignment of protocols before registration by using drafts.

Promoting reproducible research practices

Reproducibility versus Replicability

Reproducibility means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.”

(Bezjak et al., 2018)

Big data

Big data will be dead in 5 Years: Everything will be big data, so it won’t need a special name.” (Gavin, 2020)

Big data wrangling

  • Working with big data using Excel or GUI-based (Graphical User Interface) statistical software is extremely challenging, if not impossible.

  • Excel, for example, can struggle with performance issues and has a maximum row limit (1,048,576 rows), which is often insufficient for big data projects.

  • The best tools for handling big data are R and Python.

Example: SISVAN data (Tabular)

2023
34 cols \(\times\) 50,544,073 rows
1,718,498,482 data points.
2022
34 cols \(\times\) 45,862,105 rows
1,559,311,570 data points.
2021
34 cols \(\times\) 29,853,217 rows
1,015,009,378 data points.
2020
34 cols \(\times\) 22,720,515 rows
772,497,510 data points.
2019
34 cols \(\times\) 30,175,272 rows
1,025,959,248 data points.

Example: Spatial data (Raster)

  • Not all data is tabular; spatial data can be very large and complex.
  • Excel cannot handle spatial data, and GUI-based statistical softwares, when capable of handling spatial data, are often limited and struggles with performance issues.

Open-source programming languages

Now, let’s be rational about this…

Python is fine too

Python is good for learning how to program, but it is much easier to learn how to work with data in R. In academia, both programming languages are very important.

It’s not what you think

Programming in movies versus programming in real life:

Best (and free!) resources

I don’t know how to code! 😭

How do I learn R?

R has the best communities

R was created by scientists for scientists.

It’s made of a very diverse, inclusive, non-toxic community.

Important R communities and events

You’ll be up and running in no time

Stata versus R

Dynamic documents

Dynamic documents seamlessly integrate text with output from a programming language, automatically updating the output whenever the code is executed.

Examples

Pipelines

Pipeline tools coordinate the pieces of computationally demanding analysis projects. They can be used to automate the execution of a series of tasks.

Reproducible environments

Reproducible environments ensure that your code will run the same way on different machines and regardless of the time.

Example of a reproducible research

Suggestions

  1. Dynamic documents
  • Use dynamic documents to maintain an updated and reproducible record of data analyses.
  1. Programmatic approach
  • Emphasize learning and applying programmatic methods for handling large datasets, as this is the most effective approach for managing and analyzing extensive data collections.
  1. Reproducible environments
  • Use reproducible environments to time-proof the code and ensure that it will run the same way on different machines.
  • Employ pipelines to streamline and automate the analysis process.

Some other things that we didn’t cover

We’ve only touched upon the vast landscape of open science. There are many other tools and concepts that we didn’t cover, such as:

Some other things that we didn’t cover

Theory versus practice

Things will not always go as planned. But that’s ok. We’ll figure it out together.

Final remarks

License: MIT License: CC BY 4.0

This presentation was created using the Quarto Publishing System. Code and materials are available on GitHub.

These beautiful illustrations were made by Allison Horst. Thank you Allison!

References

In accordance with the American Psychological Association (APA) Style, 7th edition.

Ackoff, R. (1989). From data to wisdom. Journal of Applied Systems Analysis, 16, 3–9.
Alionço, A. (2017, May 2). What is kanban? The beginner’s guide. Pipefy. https://www.pipefy.com/blog/what-is-kanban/
Baker, M. (2015). Over half of psychology studies fail reproducibility test. Nature. https://doi.org/10.1038/nature.2015.18248
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a
Bezjak, S., Clyburne-Sherin, A., Conzett, P., Fernandes, P., Görögh, E., Helbig, K., Kramer, B., Labastida, I., Niemeyer, K., Psomopoulos, F., Ross-Hellauer, T., Schneider, R., Tennant, J., Verbakel, E., Brinken, H., & Heller, L. (2018). Open science training handbook. Zenodo. https://doi.org/10.5281/ZENODO.1212496
Broman, K. (2013, April 5). Data science is statistics. The stupidest thing... https://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/
Cao, L. (2017). Data science: A comprehensive overview. ACM Computing Surveys, 50(3), 43. https://doi.org/10.1145/3076253
Coronel, C., & Morris, S. A. (2019). Database systems: Design, implementation, and management (13th ed.). Cengage.
Dhar, V. (2023). Data science and prediction. Communications of the ACM, 56(12), 64–73. https://doi.org/10.1145/2500499
Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. https://doi.org/10.1080/00031305.2017.1375987
European Commission, & Directorate-General for Research and Innovation. (2018). OSPP-REC: Open science policy platform recommendations. European Union. https://doi.org/10.2777/958647
Fecher, B., & Friesike, S. (2014). Open science: One term, five schools of thought. In S. Bartling & S. Friesike (Eds.), Opening science: The evolving guide on how the internet is changing research, collaboration and scholarly publishing (pp. 17–47). Springer. https://doi.org/10.1007/978-3-319-00026-8
Gavin, L. (2020, October 20). Big data will be dead in 5 years. Towards Data Science. https://towardsdatascience.com/big-data-will-be-dead-in-5-years-ef4344269aef
GO FAIR initiative. (n.d.). GO FAIR initiative: Make your data & services FAIR. GO FAIR. Retrieved June 10, 2024, from https://www.go-fair.org/
Jonge, E. de, & Loo, M. van der. (2018). Statistical data cleaning with applications in R. John Wiley & Sons.
Lohr, S. (2014, August 18). For big-data scientists, Janitor work” is key hurdle to insights. The New York Times: Technology. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
Meyer, M. N. (2018). Practical tips for ethical data sharing. Advances in Methods and Practices in Psychological Science, 1(1), 131–144. https://doi.org/10.1177/2515245917747656
National Academy of Sciences, National Academy of Engineering, & Institute of Medicine of the National Academies. (2009). On being a scientist: A guide to responsible conduct in research: Third edition (3rd ed.). The National Academies Press. https://doi.org/10.17226/12192
Popper, K. R. (2002). Conjectures and refutations: The growth of scientific knowledge. Routledge.
Project Management Institute. (2017). The agile practice guide. The Project Management Institute.
Reis, J., & Housley, M. (2022). Fundamentals of data engineering: Plan and build robust data systems. O’Reilly.
Rowley, J. (2007). The wisdom hierarchy: Representations of the DIKW hierarchy. Journal of Information Science, 33(2), 163–180. https://doi.org/10.1177/0165551506070706
Stellman, A., & Greene, J. (2014). Learning agile. O’Reilly.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for data science: Import, tidy, transform, visualize, and model data (2nd ed.). O’Reilly. https://r4ds.hadley.nz/
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3(1), 160018. https://doi.org/10.1038/sdata.2016.18

Thank you!

(AP) Appendices

(AP) App. Programming Interfaces (APIs)

APIs are a set of rules that allow different software applications to communicate with each other.

# library(magrittr)
# library(sidrar)

# Brazil's population in 2022 (census data)
sidrar::get_sidra(
  api = "/t/7358/n1/all/v/all/p/all/c2/6794/c287/100362/c1933/49039"
) |>
  suppressMessages() |>
  magrittr::extract2("Valor") |>
  prettyNum(big.mark = ",")
#> [1] "216,284,269"

(AP) Summary of points and suggestions

These suggestions may be too overwhelming at first. It’s important to start small and gradually incorporate these practices into the project workflow.

The order of the suggestions is not important.

(AP) Summary of points and suggestions

  1. Project and data management system
  • Implement OSF as the central system for managing the project and associated data.
  1. Subprojects integration
  • Integrate each subproject (e.g., scope review, data analysis, causal model, agent-based modeling) into the main project on OSF.
  1. Protocol registration
  • Register research protocols for both the main project and each subproject.
  • Ensure alignment of protocols before registration by using drafts.

(AP) Summary of points and suggestions

  1. Dynamic documents
  • Use dynamic documents to maintain an updated and reproducible record of data analyses.
  1. Programmatic approach
  • Emphasize learning and applying programmatic methods for handling large datasets, as this is the most effective approach for managing and analyzing extensive data collections.
  1. Reproducible environments
  • Use reproducible environments to time-proof the code and ensure that it will run the same way on different machines.
  • Employ pipelines to streamline and automate the analysis process.

(AP) Summary of points and suggestions

  1. Data management plan
  • Develop one data management plan for the whole project.
  1. Open licenses
  • Apply open licenses to both the data and the code to ensure transparency and accessibility.
  1. Agile management methodology
  • Use a project management tool to track the progress of the project and subprojects (e.g., Taiga, Jira, Trello).
  • A simple Kanban board for each subproject can be very helpful. There’s no need to overcomplicate things.

(AP) Summary of points and suggestions

  1. Data science program
  • Follow the guidelines and methodologies proposed by Wickham et al. (2023) when conducting data analyses.
  • Tidy and document all the code and data.

(AP) Establishing standardized guidelines for data practices

(AP) Data engineering

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning.

You can think data engineering as the plumbing of data science.

(AP) Data science

It’s a confusing term, with many definitions and interpretations.

For some, data science is just statistics (Broman, 2013) (hype statistics). For others, it’s a new interdisciplinary field that synthesizes statistics, informatics, computing, communication, management, and sociology (Cao, 2017).

A high-level definition: “Data science is the study of the generalizable extraction of knowledge from data.(Dhar, 2023)

(AP) Data engineering versus data science

“Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.” (Lohr, 2014).

(AP) What is data after all?

(AP) What is data after all?

Data is an abstraction. It’s a representation of the world around us. Without context, it has no meaning.

(AP) Statistical value chain

“A value chain, roughly, consists of a sequence of activities that increase the value of a product step by step. […] One should realize that although the schema nicely organizes data analysis activities, in practice, the process is hardly linear.”

(AP) Raw data

  • “With raw data, we mean the data as it arrives at the desk of the analyst. The state of such data may of course vary enormously, depending on the data source.” (Jonge & Loo, 2018).
  • If the researcher has made any modifications to the raw data, it is not the raw form of the data.” (Ellis & Leek, 2018).

(AP) Data classes

Some R data classes

  • Character (e.g., “Maria”, “John”).
  • Factor (e.g., 1 = “Male”, 2 = “Female”).
  • Integer (e.g., 1, 2, 3).
  • Float (e.g., 1.0, 2.0, 3.0).
  • Complex (e.g., 1 + 2i, 3 + 4i).
  • Boolean (e.g., TRUE, FALSE).
  • Date (e.g., 2023-01-01) (linear time).
  • Date-time (e.g., 2023-01-01 00:00:00) (linear time).
  • Interval (e.g., 2023-01-01 00:00:00–2023-12-15 15:40:00) (linear time).
  • Duration (e.g., 1 year, 2 months, 3 days) (linear time).
  • Period (e.g., 1 year, 2 months, 3 days) (linear(ish) time).
  • Time of day (e.g., 01:00:00) (circular time).

(AP) Valid data

Data validation techniques are used to ensure that data is accurate, consistent, and reliable.

Examples

  • Impossible values (e.g., negative age);
  • Inconsistent values (e.g., a person with a height of 2 meters and a weight of 20 kg);
  • Improbable values (e.g., a person 200 years old);
  • Duplicated values (e.g., the same person with two different ages).

(AP) Tidy data

(AP) Tidy data

(AP) Untidy to tidy

Learn more in Wickham et al. (2023), chap. 5.

(AP) Tidy data

(AP) Tidy data

(AP) Spreadsheet syndrome

Spreadsheet syndrome is a term used to describe the problems that arise from using spreadsheets to manage data.

(AP) Relational databases

“Developed by E. F. Codd of IBM in 1970, the relational model is based on mathematical set theory and represents data as independent relations. Each relation (table) is conceptually represented as a two-dimensional structure of intersecting rows and columns. The relations are related to each other through the sharing of common entity characteristics (values in columns)” (Coronel & Morris, 2019).

(AP) Data documentation

(AP) The codebook

(AP) Open data formats

There are many open data formats available for researchers to use. Here are some examples:

(Excel files are not a open data format!)

(AP) Fair principles

(AP) Data management plans

A data management plan (DMP) is a formal document that outlines how data will be managed throughout the research process.

(AP) Project management

(AP) Why this is important?

(AP) What is a project?

  1. A unique and temporary endeavor.
  2. Has a defined beginning and end.
  3. Purpose is to create a specific product or service or change a specific product or service.
  4. Has limited resources.

(AP) KISS principle

(AP) Kanban

“Kanban is a tool that allows you to fully visualize the status of your processes through a board with dynamic columns that make all tasks and processes steps clear”

(AP) Kanban board

(AP) Kanban origin

(AP) Kanban principles

  1. Start with what you do now.
  2. Agree to pursue incremental, evolutionary change.
  3. Respect the current process, roles, responsibilities, and titles.