Installation Instructions

Install R if you don’t already have it
Optional: Install Git (or Git for Windows)
Install the following R packages using install.packages()
- tidyverse
- rmarkdown
- randomForest

What Is Reproducibility?

Reproducibility versus replication

Replicating a result
- Using same approach but different data
- Ideal, but generating new data is costly
- Journals prefer novel findings
Reproducing a result
- Using same approach and same data

What is reproducibility?

An attainable minimum standard for assessing the value of scientific claims.

– Sandve et al., 2013

Reproducibility crisis

Reproducibility exercise

Could I replicate Figure 1 from your last publication using only what is available online?
- Minimum: Is the data available?
- Better: Is the code available?
- Best: Did you detail which software was used, including versions?

As computational researchers, we can rely on the fact that computers are very good at following instructions.

Why are we facing this crisis?

Reproducibility isn’t taught to researchers
Little incentive to spend time on this due to publish or perish model
Now: Funding agencies are catching on to the problem
They are starting to require:
- Open data (e.g. data management plans)
- Open access publications
- Is open-source code the next requirement?

Why should I care?

Moral responsibility as scientists (bla, bla, bla…)
It makes your life easier as a researcher
You often have to revisit past analyses
- Realizing a mistake
- New or updated data and/or methods

Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer emails.

Today’s project

Dataset
- 303 patients show up at clinic with chest pain
- Clinical data for 13 variables
  - Age, sex, calcium levels, etc.
- Outcome: Whether the patient has heart disease
Objective
- Predict whether a patient has heart disease using the clinical data

Principle #1: Organize your files

Why should you organize your files?

There is a huge difference between navigating an organized project and a disorganized one
Data analyses are becoming more complex
- Different data types
- External datasets
- Multiple analyses in different scripts

How can you organize your files?

Everything in one folder
- Helps with Principle #2
There are many possible project structures
- Most important: Be consistent
- Create a README file with a description of your project structure
- Every file should have a single logical location

Using sub-folders effectively

Naming sub-folders

Note to self: The README file

Context of data collection (goals, hypotheses)
Data collection methods
Structure of files
Quality assurance (data validation, checking)
Data modifications
Confidentiality and permissions
Names of labels and variables
Explanations of codes and abbreviations

Naming data files

ISO-8601: The date format to rule them all

Tying everything together

|-- data
|  |-- yvr.2016-06-13.bird_counts.csv
|  |-- yvr.2016-07-27.bird_counts.csv
|  `-- yul.2016-06-13.bird_counts.csv
|
|-- docs  # Notes and manuscript
|  |-- notebook.md
|  |-- manuscript.md
|  `-- changelog.txt
|
|-- results  # Output (disposable)
|  `-- summarized_results.csv
|
`-- scripts  # Code
   |-- sightings_analysis.R
   `-- runall.R

Hands-on practice

Organize your files

Create project directory: predict_hd
Create sub-directories
- data, docs, results, scripts
Create a simple README
Alternative: ProjectTemplate package

Principle #2: Be portable

Why should the project be portable?

You shouldn’t assume your code will only be run on your computer
You can’t collaborate otherwise
- With others
- With yourself (e.g. home and work computers)

How can a project be made portable?

Again: Everything in one folder
Parameterize any file paths
Use relative paths instead
Small files can be tracked using version control

Hands-on practice

Be portable

Use setwd() (not portable)
Create R Project, close RStudio, open the project and run getwd()
Even better: The here R package

Principle #3: Track your project history

Why should you track your project history?

Small changes can have big—often unintended—consequences
You can revert to a version that generated a specific result
If you use GitHub, it can act as a back-up
Simplifies collaboration
- With others
- With yourself on different computers

Which tools can track your project history?

Version control systems
- Git, Subversion, Mercurial
Git is popular because of GitHub
Git integrates nicely with RStudio
Alternatively, you can maintain copies of your scripts
Look into SFU Vault

How can you track your project history?

Rule of thumb: Track any file created by a human as soon as they’re created
Works best with plain text files
You can also track your data if it’s not too big
But you shouldn’t track results
- Ideally, you can regenerate all of your results from scratch (Principle #5)

Hands-on practice

Track your project history

Initialize Git repository in RStudio Project
Add files (e.g. README) to Git repository
Download data, add to data folder and commit
- http://bit.ly/rrr-data
For more info: http://happygitwithr.com/

When should you commit to Git?

Small changes
Related in some way (“atomic”)
There’s no specific time interval
- Whatever time it takes to complete one task
Sync with GitHub regularly (e.g. at the end of each day)

Principle #4: Treat data as read-only

Why shouldn’t you edit your data manually?

Error-prone
Inefficient
Difficult to reproduce
Irreversible
- Hopefully you have backups of your data

How can you treat data as read-only?

Leave the data as it was given to you
- From an instrument, a survey, a collaborator, etc.
Separate raw data from cleaned-up data
Configure raw data as read-only
Avoid manual changes
- Ideally, automate changes with a script
- Otherwise, make a copy and note what changes you made and why

Example

|-- README
|-- data
|  |
|  |-- raw_data
|  |   ` birds_count_table.csv  # Never edited!
|  |
|  `-- clean_data
|      ` birds_count_table.clean.csv
|
|-- src
|  ` clean_data.R  # Script instead of manual editing
|
[...]

Hands-on practice

Treat data as read-only

Split data folder into data/raw_data and data/clean_data
Move raw data (Heart.csv) into data/raw_data
Set files as read-only
- Sys.chmod("data/raw_data/Heart.csv", "555")
Attempt to edit the Heart.csv raw data file
Write R script to fix dataset (NA values)

Principle #5: Store code in scripts

Why should you store code in scripts?

Any code or command you run is necessary for reproducing your results
You will likely have to re-run parts of your analysis
- Realizing a mistake
- New or updated data and/or methods
Relying on your command history (or memory) is very risky

What code should you store in scripts?

Yes, all code that you run at least once should be stored in a script
Even “throwaway” code that will “probably never be used again”
Ideally, you should be able to regenerate all of your results from scratch
- Using just your scripts and no manual commands

How can you store code in scripts?

You can still use an interactive console
- Just be diligent to save the commands afterwards
Scripts should be under version control
Manual interventions, if unavoidable, should be tracked
Bonus: Follow a consistent coding style

Hands-on practice

Store code in scripts

Copy commands for cleaning data to analysis.R
Expand analysis.R
- Train random forest model
- Save fit and plot results
Adopt a consistent coding style
- Hadley Wickam’s style guide

Principle #6: Document code and results

Why should you document code and results?

Scripting is the documentation of what you did
But why did you do code something one way versus another?
And how do you interpret the results?
What if there was a way of achieving all of the above in one fell swoop?

Documenting your code with comments

Code comments are a common approach
Start every script with a brief comment explaining:
- What it does
- Usage information
- An example (worth a thousand words)
- Reasonable parameters values

Level up: Literate programming

Code and textual annotations are intertwined
You explain with text what you’re doing in code and more importantly, why you are doing it
You also interpret results (e.g. figures) as they’re generated
Options: R Markdown, Jupyter Notebooks

What is R Markdown?

Quick Demo Video

Hands-on practice

Document code and results

Introduce Markdown (see cheat sheet)
Convert analysis.R to analysis.Rmd
Look over some common R Markdown options (see cheat sheet)
- warnings=FALSE, messages=FALSE
Output document to Word

Principle #7: Track software versions

Why should you track software versions?

Software versions are important
- Inputs and outputs can change
- Software behaviour can change
- End result: Potentially very different results
Very difficult to figure out when trying to reproduce a result
- Even when everything else was done properly

Case in point

library(dplyr)

by_species <- group_by(iris, Species)

by_sepal_width  <- arrange(by_species, Sepal.Width)

# Shortest petal length among setosa flowers
by_sepal_width[[1, "Petal.Length"]]

My output: [1] 1.3

Major dplyr change after version 0.4.3

How can youn track software versions?

If you use little software:
- Manually keep track of software versions
If you use a lot of software:
- Use package managers like Conda/Bioconda
- In R, print the session info at the end of each analysis

Hands-on practice

Track software versions

Add sessionInfo() at the end of the R Markdown file
- Alternatively, you can use sessioninfo::session_info()
Introduce the Conda package manager and the Bioconda recipes
- I do not recommend using Conda for R

Principle #8: Modularize your analyses

Why should you modularize your analyses?

Easier to re-run parts of the analysis
- Especially useful when some parts are computationally intensive
Easier to understand, share, describe and modify the code
Naturally leads to the creation of useful intermediate files

How can you modularize your analyses?

Split scripts into multiple components that tie into one another
Use intermediate files as connections between scripts
Use standard formats instead of binary/proprietary formats
- For example, CSV is better than RDS/RData
If you need speed, export a plain text version and a binary version

Hands-on practice

Modularize your analyses

Split analysis.Rmd into 01-tidy.Rmd, 02-train.Rmd and 03-plot.Rmd
- Don’t forget to output intermediate files
Demo R Markdown website
- Create index.Rmd and _site.yml files
- Run rmarkdown::render_site("src")

Principle #9: Be deterministic

Why should you be deterministic?

RNG = Random number generator
A result that relies on a RNG will inevitably be impossible to be perfectly reproduced
Unless: A seed is set, which allows the RNG to be deterministic
Perfect for reproducibility!

How can you be deterministic?

Generate a random number that can act as the seed
Set the RNG seed to that random number

Hands-on practice

Be deterministic

Compare error rates among learners
- They should be different
Generate a random number using runif(1, 0, 1e8)
Set seed at the top of each script that uses a RNG
- Here, only train.Rmd relies on a RNG
- Use the set.seed() function
Compare error rates again

Principle #10: Share openly and freely!

Hands-on practice

Share openly and freely

Sharing code is as easy as switching the GitHub repository to public
- You can create a tag in Git to mark the version that was used
Share results in open-access publications
- You can also post preprints
- Journals policies can be found on SHERPA/RoMEO

The End

Thank you for your attention!

✌🏻

The Heart-csv dataset used in this presentation was taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Extra Stuff

R code for lesson

# Tidy ---------------------------------------------------
library(here)
library(tidyr)

heart_raw_file <- here("data/raw_data/Heart.csv")
heart_raw <- read.csv(heart_raw_file)

heart_clean <- drop_na(heart_raw)

heart_clean_file <- here("data/clean_data/Heart.csv")
write.csv(heart_clean, heart_clean_file)

sessionInfo()


# Train --------------------------------------------------
library(here)
library(randomForest)

heart_clean_file <- here("data/clean_data/Heart.csv")
heart_clean <- read.csv(heart_clean_file)

rf_model <- randomForest(AHD ~ ., heart_clean, 
                         importance = TRUE, 
                         keep.forest = TRUE)

rf_model_file <- here("results/random_forest_model.rds")
saveRDS(rf_model, rf_model_file)

sessionInfo()


# Plot ---------------------------------------------------
library(randomForest)

rf_model_file <- here("results/random_forest_fit.rds")
rf_model <- readRDS(rf_model_file)

heart_clean_file <- here("data/clean_data/Heart.csv")
heart_clean <- read.csv(heart_clean_file)

print(rf_model)

plot(rf_model)

varImpPlot(rf_model)

partialPlot(rf_model, heart_clean, "Ca")

sessionInfo()

10 Principles of Reproducible Research:

The Hands-on R Edition

Bruno Grande

2019-10-25

Installation Instructions

What Is Reproducibility?

Reproducibility versus replication

What is reproducibility?

Reproducibility crisis

Reproducibility exercise

Why are we facing this crisis?

Why should I care?

Today’s project

Principle #1: Organize your files

Why should you organize your files?

How can you organize your files?

Using sub-folders effectively

Naming sub-folders

Note to self: The README file

Naming data files

ISO-8601: The date format to rule them all

Tying everything together

Hands-on practice

Principle #2: Be portable

Why should the project be portable?

How can a project be made portable?

Hands-on practice

Principle #3: Track your project history

Why should you track your project history?

Which tools can track your project history?

How can you track your project history?

Hands-on practice

When should you commit to Git?

Principle #4: Treat data as read-only

Why shouldn’t you edit your data manually?

How can you treat data as read-only?

Example

Hands-on practice

Principle #5: Store code in scripts

Why should you store code in scripts?

What code should you store in scripts?

How can you store code in scripts?

Hands-on practice

Principle #6: Document code and results

Why should you document code and results?

Documenting your code with comments

Level up: Literate programming

What is R Markdown?

Hands-on practice

Principle #7: Track software versions

Why should you track software versions?

Case in point

How can youn track software versions?

Hands-on practice

Principle #8: Modularize your analyses

Why should you modularize your analyses?

How can you modularize your analyses?

Hands-on practice

Principle #9: Be deterministic

Why should you be deterministic?

How can you be deterministic?

Hands-on practice

Principle #10: Share openly and freely!

Why should you share openly and freely?

How can you share more openly and freely?

Where can you openly and freely share your data?

Hands-on practice

The End

Thank you for your attention!

Extra Stuff

R code for lesson