Thursday Overview: R for Reproducible Research

Goal:
Learn how to set up a reproducible research project in R, manage your code and data responsibly, and build a simple, tested analysis pipeline using RStudio.

What We Cover

Project Setup & Data Privacy:
Reviewed a sensible folder structure for research projects.
Discussed the importance of .gitignore to keep sensitive or unnecessary files (like raw data and environment folders) out of version control.
Version Control with Git & RStudio:
Used Git (with RStudio or command line) to track project changes and collaborate safely.
Reproducible R Environments:
Used renv to manage project-specific R packages, ensuring reproducibility.
Downloading and Preparing Data:
Downloaded a real-world dataset from cBioPortal.
Used bash commands to organize and extract data into the project.
Building a Simple R Pipeline:
Wrote clear, well-documented R functions to:
- Load tabular data
- Clean/filter relevant columns
- Save processed data
- Analyze and plot the top 10 mutated genes
Saved results and plots to the appropriate folders.
Testing Your Code:
Wrote a basic test (using testthat) to check that the pipeline runs end-to-end and produces expected outputs.
Discussed the value of both unit and smoke tests for research code.
Ethical Coding & Documentation:
Emphasized the importance of documentation and code reuse.
Added a README to the analysis folder to explain what each script does.

Key Takeaways

Organize your project folders and use .gitignore to protect sensitive data.
Use version control (Git) for all your code and documentation.
Always use a reproducible environment for R projects (e.g., renv).
Write modular, well-documented code for each step of your analysis.
Test your code to catch errors early and ensure reproducibility.
Document your scripts and results for yourself and others.

Videos of session

Section 1/3 15.55
Section 2/3 17.21
Section 3/3 17.15

Homework:
- Practice writing a simple R function for data loading or analysis. - Write a short test to check your function works. - Add a brief README or comments to explain your code. - Explore The Turing Way’s code reuse checklist for more ideas.

Session Cheat Sheet: Thursday

Bash/Terminal and RStudio Commands

# List files and check .gitignore
ls
cat .gitignore

# Add common R exclusions to .gitignore
echo ".Rhistory\n.RData\n.Rproj.user/\nrenv/" >> .gitignore
cat .gitignore

# Create folders for R code and tests
mkdir -p src/R tests/R
ls

# Check git status
git status

R Environment Setup

# Using the R console in RStudio
install.packages("renv")
renv::init()
install.packages(c("tidyverse", "testthat"))
renv::snapshot()

Data Download and Extraction

# Using the bash terminal in R Studio or your system terminal
mkdir -p data/raw
wget -O data/raw/aml_tcga_gdc.tar.gz https://cbioportal-datahub.s3.amazonaws.com/aml_tcga_gdc.tar.gz
tar -xzvf data/raw/aml_tcga_gdc.tar.gz -C data/raw

R Code Snippets

src/R/analysis.R

#!/usr/bin/env Rscript

library(readr)
library(dplyr)
library(ggplot2)

load_data <- function(filepath) {
  df <- read_tsv(filepath, skip = 2)
  return(df)
}

clean_data <- function(df) {
  df <- df %>%
    select(Hugo_Symbol, Variant_Classification, Tumor_Sample_Barcode) %>%
    na.omit()
  return(df)
}

save_data <- function(df, output_path) {
  write_csv(df, output_path)
}

analyze_data <- function(df) {
  summary <- df %>%
    count(Hugo_Symbol, sort = TRUE) %>%
    head(10)
  print("Top 10 mutated genes:")
  print(summary)
  p <- ggplot(summary, aes(x = reorder(Hugo_Symbol, -n), y = n)) +
    geom_bar(stat = "identity") +
    xlab("Gene") +
    ylab("Mutation Count") +
    ggtitle("Top 10 Mutated Genes") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  return(p)
}

# Example usage
# df <- load_data("data/raw/aml_tcga_gdc/data_mutations.txt")
# cleaned_df <- clean_data(df)
# save_data(cleaned_df, "data/processed/cleaned_mutations.csv")
# p <- analyze_data(cleaned_df)
# ggsave("results/top10_mutated_genes.png", plot = p, width = 8, height = 5)

tests/R/test_analysis.R

library(testthat)
source("../../src/R/analysis.R")

test_that("pipeline smoke test", {
  df <- data.frame(
    Hugo_Symbol = c("TP53", "BRCA1", "TP53", "EGFR"),
    Variant_Classification = c("Missense", "Nonsense", "Missense", "Silent"),
    Tumor_Sample_Barcode = c("S1", "S2", "S3", "S4")
  )
  cleaned <- clean_data(df)
  expect_true(nrow(cleaned) > 0)
  expect_true(all(c("Hugo_Symbol", "Variant_Classification", "Tumor_Sample_Barcode") %in% colnames(cleaned)))
})

Add a README to your analysis folder

bash

echo -e "# Analysis Scripts\nThis folder contains R scripts for data analysis. Each script is documented and tested." > src/R/README.md
cat src/R/README.md

References

Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012; 401. PubMed
Gao et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013). PubMed
de Bruijn et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res (2023). PubMed
DataSet: https://www.cbioportal.org/study/plots?id=aml_tcga_gdc