install.packages()
tidyverse
rmarkdown
randomForest
An attainable minimum standard for assessing the value of scientific claims.
– Sandve et al., 2013
As computational researchers, we can rely on the fact that computers are very good at following instructions.
Your primary collaborator is yourself 6 months from now, and your past self doesn’t answer emails.
README
file with a description of your project structure|-- data
| |-- yvr.2016-06-13.bird_counts.csv
| |-- yvr.2016-07-27.bird_counts.csv
| `-- yul.2016-06-13.bird_counts.csv
|
|-- docs # Notes and manuscript
| |-- notebook.md
| |-- manuscript.md
| `-- changelog.txt
|
|-- results # Output (disposable)
| `-- summarized_results.csv
|
`-- scripts # Code
|-- sightings_analysis.R
`-- runall.R
Organize your files
predict_hd
data
, docs
, results
, scripts
ProjectTemplate
packageBe portable
setwd()
(not portable)getwd()
here
R packageTrack your project history
README
) to Git repositorydata
folder and commit
|-- README
|-- data
| |
| |-- raw_data
| | ` birds_count_table.csv # Never edited!
| |
| `-- clean_data
| ` birds_count_table.clean.csv
|
|-- src
| ` clean_data.R # Script instead of manual editing
|
[...]
Treat data as read-only
data
folder into data/raw_data
and data/clean_data
Heart.csv
) into data/raw_data
Sys.chmod("data/raw_data/Heart.csv", "555")
Heart.csv
raw data fileStore code in scripts
analysis.R
analysis.R
Document code and results
analysis.R
to analysis.Rmd
warnings=FALSE, messages=FALSE
library(dplyr)
by_species <- group_by(iris, Species)
by_sepal_width <- arrange(by_species, Sepal.Width)
# Shortest petal length among setosa flowers
by_sepal_width[[1, "Petal.Length"]]
- My output:
[1] 1.3
- Major dplyr change after version 0.4.3
Track software versions
sessionInfo()
at the end of the R Markdown file
sessioninfo::session_info()
Modularize your analyses
analysis.Rmd
into 01-tidy.Rmd
, 02-train.Rmd
and 03-plot.Rmd
index.Rmd
and _site.yml
filesrmarkdown::render_site("src")
Be deterministic
runif(1, 0, 1e8)
train.Rmd
relies on a RNGset.seed()
function✌🏻
The Heart-csv
dataset used in this presentation was taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
# Tidy ---------------------------------------------------
library(here)
library(tidyr)
heart_raw_file <- here("data/raw_data/Heart.csv")
heart_raw <- read.csv(heart_raw_file)
heart_clean <- drop_na(heart_raw)
heart_clean_file <- here("data/clean_data/Heart.csv")
write.csv(heart_clean, heart_clean_file)
sessionInfo()
# Train --------------------------------------------------
library(here)
library(randomForest)
heart_clean_file <- here("data/clean_data/Heart.csv")
heart_clean <- read.csv(heart_clean_file)
rf_model <- randomForest(AHD ~ ., heart_clean,
importance = TRUE,
keep.forest = TRUE)
rf_model_file <- here("results/random_forest_model.rds")
saveRDS(rf_model, rf_model_file)
sessionInfo()
# Plot ---------------------------------------------------
library(randomForest)
rf_model_file <- here("results/random_forest_fit.rds")
rf_model <- readRDS(rf_model_file)
heart_clean_file <- here("data/clean_data/Heart.csv")
heart_clean <- read.csv(heart_clean_file)
print(rf_model)
plot(rf_model)
varImpPlot(rf_model)
partialPlot(rf_model, heart_clean, "Ca")
sessionInfo()