Wednesday Overview: Python for Reproducible Research
Goal:
Learn how to set up a reproducible research project in Python, manage your code and data responsibly, and build a simple, tested analysis pipeline.
What We Cover
- Project Setup & Data Privacy:
- Reviewed a sensible folder structure for research projects.
-
Discussed the importance of
.gitignore
to keep sensitive or unnecessary files (like raw data and environment folders) out of version control. -
Version Control with Git & VSCode:
-
Used Git (with VSCode or command line) to track project changes and collaborate safely.
-
Python Virtual Environments:
-
Created a virtual environment to manage project-specific Python packages, ensuring reproducibility.
-
Downloading and Preparing Data:
- Downloaded a real-world dataset from cBioPortal.
-
Used bash commands to organize and extract data into the project.
-
Building a Simple Python Pipeline:
- Wrote clear, well-documented Python functions to:
- Load tabular data
- Clean/filter relevant columns
- Save processed data
- Analyze and plot the top 10 mutated genes
-
Saved results and plots to the appropriate folders.
-
Testing Your Code:
- Wrote a basic test (using pytest) to check that the pipeline runs end-to-end and produces expected outputs.
-
Discussed the value of both unit and smoke tests for research code.
-
Ethical Coding & Documentation:
- Emphasized the importance of documentation and code reuse.
- Added a README to the analysis folder to explain what each script does.
Key Takeaways
- Organize your project folders and use
.gitignore
to protect sensitive data. - Use version control (Git) for all your code and documentation.
- Always use a virtual environment for Python projects.
- Write modular, well-documented code for each step of your analysis.
- Test your code to catch errors early and ensure reproducibility.
- Document your scripts and results for yourself and others.
Videos of session
- Section 1/3 21.19
- Section 2/3 15.06
- Section 3/3 19.43
Homework:
- Practice writing a simple Python function for data loading or analysis.
- Write a short test to check your function works.
- Add a brief README or comments to explain your code.
- Explore The Turing Way’s code reuse checklist for more ideas.
Session Cheat Sheet: Wednesday
Bash/Terminal Commands
# List files and check .gitignore
ls
cat .gitignore
# Add common Python exclusions to .gitignore
echo "__pycache__/\n*.pyc\nenvironment/" >> .gitignore
cat .gitignore
# Create folders for Python code and tests
mkdir -p src/python tests/python
ls
# Check git status
git status
# Create and activate a Python virtual environment
python3 -m venv environment
source environment/bin/activate
# Install required Python packages
pip install pandas pytest matplotlib
pip freeze > environment/requirements.txt
# Download and extract a dataset from cBioPortal
mkdir -p data/raw
wget -O data/raw/aml_tcga_gdc.tar.gz https://cbioportal-datahub.s3.amazonaws.com/aml_tcga_gdc.tar.gz
tar -xzvf data/raw/aml_tcga_gdc.tar.gz -C data/raw
# Run your analysis script
python src/python/analysis.py
# Run your tests
pytest
Python Code Snippets
src/python/analysis.py
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
def load_data(filepath):
"""Load a tab-delimited file and return a DataFrame."""
df = pd.read_csv(filepath, sep='\t', header=2)
return df
def clean_data(df):
df = df[['Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode']]
df = df.dropna()
return df
def save_data(df, output_path):
df.to_csv(output_path, index=False)
def analyze_data(df):
summary = df['Hugo_Symbol'].value_counts().head(10)
print("Top 10 mutated genes:")
print(summary)
fig, ax = plt.subplots()
summary.plot(kind='bar', ax=ax)
ax.set_xlabel('Gene')
ax.set_ylabel('Mutation Count')
ax.set_title('Top 10 Mutated Genes')
return fig
if __name__ == "__main__":
df = load_data("data/raw/aml_tcga_gdc/data_mutations.txt")
cleaned_df = clean_data(df)
save_data(cleaned_df, "data/processed/cleaned_mutations.csv")
fig = analyze_data(cleaned_df)
fig.savefig('results/top10_mutated_genes.png')
tests/python/test_analysis.py
import os
import sys
from pathlib import Path
import pandas as pd
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
from src.python.analysis import load_data, clean_data, save_data, analyze_data
def create_test_data(tmp_path):
df = pd.DataFrame({
'Hugo_Symbol': ['TP53', 'BRCA1', 'TP53', 'EGFR'],
'Variant_Classification': ['Missense', 'Nonsense', 'Missense', 'Silent'],
'Tumor_Sample_Barcode': ['S1', 'S2', 'S3', 'S4']
})
input_file = tmp_path / 'test_mutations.txt'
with open(input_file, 'w') as f:
f.write("# The cBioPortalFiles\n")
f.write("# Have 2 comment rows before the dataframe\n")
df.to_csv(input_file, sep='\t', index=False, mode='a')
return input_file
def test_pipeline_smoke(tmp_path=Path(".")):
input_file = create_test_data(tmp_path)
loaded = load_data(input_file)
cleaned = clean_data(loaded)
output_file = tmp_path / 'cleaned.csv'
save_data(cleaned, output_file)
fig = analyze_data(cleaned)
plot_file = tmp_path / 'plot.png'
fig.savefig(plot_file)
assert os.path.exists(output_file)
out_df = pd.read_csv(output_file)
assert not out_df.empty
assert set(['Hugo_Symbol', 'Variant_Classification', 'Tumor_Sample_Barcode']).issubset(out_df.columns)
assert os.path.exists(plot_file)
Add a README to your analysis folder
echo -e "# Analysis Scripts\nThis folder contains Python scripts for data analysis. Each script is documented and tested." > src/python/README.md
cat src/python/README.md
References
- Cerami et al. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data. Cancer Discovery. May 2012; 401. PubMed
- Gao et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6, pl1 (2013). PubMed
- de Bruijn et al. Analysis and Visualization of Longitudinal Genomic and Clinical Data from the AACR Project GENIE Biopharma Collaborative in cBioPortal. Cancer Res (2023). PubMed
- DataSet: https://www.cbioportal.org/study/plots?id=aml_tcga_gdc