--- title: "Visual Quality Control and Interpretation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Visual Quality Control and Interpretation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ```{r setup, message = FALSE} library(pepdiff) library(dplyr) library(ggplot2) ``` ## Why visualise? Numbers summarise; plots reveal. A p-value tells you something is significant, but a volcano plot shows you whether that significance is driven by a few outliers or a robust pattern. Always plot your data before trusting the statistics. This vignette shows what "good" and "bad" look like for each diagnostic plot. ## Creating example datasets We'll create two datasets: one well-behaved, one problematic. ```{r generate-good-data} set.seed(123) # Good data: clean separation, no outliers, balanced # Using 10 reps and 3-fold changes for good power make_good_data <- function() { n_peptides <- 40 n_reps <- 10 peptides <- paste0("PEP_", sprintf("%03d", 1:n_peptides)) genes <- paste0("GENE_", LETTERS[((1:n_peptides - 1) %% 26) + 1]) diff_peptides <- peptides[1:12] # 30% differential sim_data <- expand.grid( peptide = peptides, treatment = c("ctrl", "trt"), bio_rep = 1:n_reps, stringsAsFactors = FALSE ) %>% mutate( gene_id = genes[match(peptide, peptides)], base = rep(rgamma(n_peptides, shape = 8, rate = 0.8), each = 2 * n_reps), effect = ifelse(peptide %in% diff_peptides & treatment == "trt", sample(c(0.33, 3), length(diff_peptides) * n_reps, replace = TRUE), 1), value = rgamma(n(), shape = 20, rate = 20 / (base * effect)) ) %>% select(peptide, gene_id, treatment, bio_rep, value) temp_file <- tempfile(fileext = ".csv") write.csv(sim_data, temp_file, row.names = FALSE) read_pepdiff(temp_file, id = "peptide", gene = "gene_id", value = "value", factors = "treatment", replicate = "bio_rep") } good_data <- make_good_data() ``` ```{r generate-bad-data} # Problematic data: outlier sample, batch effect, systematic missingness make_bad_data <- function() { n_peptides <- 40 n_reps <- 10 peptides <- paste0("PEP_", sprintf("%03d", 1:n_peptides)) genes <- paste0("GENE_", LETTERS[((1:n_peptides - 1) %% 26) + 1]) sim_data <- expand.grid( peptide = peptides, treatment = c("ctrl", "trt"), bio_rep = 1:n_reps, stringsAsFactors = FALSE ) %>% mutate( gene_id = genes[match(peptide, peptides)], base = rep(rgamma(n_peptides, shape = 5, rate = 0.5), each = 2 * n_reps), # Outlier: one control sample has 3x higher values batch_effect = ifelse(treatment == "ctrl" & bio_rep == 1, 3, 1), value = rgamma(n(), shape = 10, rate = 10 / (base * batch_effect)) ) %>% select(peptide, gene_id, treatment, bio_rep, value) # Add systematic missingness: low-abundance peptides missing in treatment low_abundance <- peptides[31:40] sim_data <- sim_data %>% mutate(value = ifelse(peptide %in% low_abundance & treatment == "trt" & runif(n()) < 0.7, NA, value)) temp_file <- tempfile(fileext = ".csv") write.csv(sim_data, temp_file, row.names = FALSE) read_pepdiff(temp_file, id = "peptide", gene = "gene_id", value = "value", factors = "treatment", replicate = "bio_rep") } bad_data <- make_bad_data() ``` ## Data-level diagnostics ### PCA plot PCA (Principal Component Analysis) projects your high-dimensional data onto 2D. Samples that are similar cluster together. **Good PCA:** ```{r pca-good, fig.height = 4} plot_pca_simple(good_data) ``` - Replicates from the same condition cluster together - Groups are separated (if there's a biological effect) - No wild outliers **Problematic PCA:** ```{r pca-bad, fig.height = 4} plot_pca_simple(bad_data) ``` Warning signs: - **Outlier sample**: One point far from its group suggests a sample quality issue - **No separation**: If groups overlap completely, your treatment may have no effect (or the effect is too small to detect) - **Unexpected clustering**: Samples clustering by batch/run date instead of treatment suggests a batch effect **What to do:** - Investigate outliers - check sample prep notes, consider excluding - If batch effects dominate, you may need to include batch in your model or use batch correction ### Distribution plots ```{r dist-good, fig.height = 4} plot_distributions_simple(good_data) ``` **What to look for:** Similar shapes and locations across samples. Proteomics data is typically right-skewed - that's expected. ```{r dist-bad, fig.height = 4} plot_distributions_simple(bad_data) ``` Warning signs: - **Shifted distributions**: One sample systematically higher/lower suggests normalisation issues - **Different shapes**: One sample much more spread out suggests technical problems - **Bimodal distributions**: May indicate a mixture of populations **What to do:** - Check if data was normalised appropriately - Investigate shifted samples for technical issues - Consider whether the shifted sample should be excluded ### Missingness plot ```{r miss-good, fig.height = 4} plot_missingness_simple(good_data) ``` **Good missingness:** Random scatter, or no missing values at all. Missing values occur independently of abundance or condition. ```{r miss-bad, fig.height = 4} plot_missingness_simple(bad_data) ``` Warning signs: - **MNAR pattern**: Low-abundance peptides preferentially missing. This is "missing not at random" - the value is missing *because* it's low (below detection limit). - **Condition-specific missingness**: Peptides missing only in treatment (or only in control) may indicate true biological absence, or may be technical artifacts. **What to do:** - MNAR is common in proteomics and difficult to handle properly - Peptides with high missingness often can't be reliably analysed - pepdiff doesn't impute - if data is missing, the peptide may be excluded from analysis - See peppwR for understanding missingness patterns and power implications ## Results-level diagnostics Let's run analyses on both datasets: ```{r run-analyses} good_results <- compare(good_data, compare = "treatment", ref = "ctrl") bad_results <- compare(bad_data, compare = "treatment", ref = "ctrl") ``` ### Volcano plot The volcano plot shows effect size (fold change) vs statistical significance. It's the most informative single plot for differential abundance results. **Good volcano:** ```{r volcano-good, fig.height = 4} plot_volcano_new(good_results) ``` - **Symmetric spread**: Roughly equal up and down regulation - **Significant hits at edges**: Large fold changes with low p-values - **Cloud in centre**: Most peptides show small, non-significant changes **Problematic volcano:** ```{r volcano-bad, fig.height = 4} plot_volcano_new(bad_results) ``` Warning signs: - **Asymmetric**: All significant hits in one direction may indicate a global shift (normalisation problem) rather than true differential expression - **Vertical stripe at FC=1**: Many significant hits with tiny fold changes suggests p-value inflation - **Everything significant**: Likely a technical artifact or analysis error ### P-value histogram The p-value histogram reveals whether your statistical assumptions are met. **Understanding the shapes:** Under the null hypothesis (no true effects), p-values are uniformly distributed - every value from 0 to 1 is equally likely. When true effects exist, those peptides get pulled toward 0, creating a spike. **Good p-value histogram:** ```{r pval-good, fig.height = 4} plot_pvalue_histogram(good_results) ``` - **Uniform base + spike near 0**: This is ideal. The uniform part represents true nulls, the spike represents true positives. - The height of the spike relative to the uniform part tells you roughly what fraction of peptides have real effects. **Problematic p-value histogram:** ```{r pval-bad, fig.height = 4} plot_pvalue_histogram(bad_results) ``` Warning signs: - **U-shape** (spikes at both 0 and 1): P-value inflation. Your test is anti-conservative - it's giving too many small AND too many large p-values. Often indicates violated assumptions. - **Spike at 1 only**: Test is too conservative. May happen with very small sample sizes. - **Completely uniform**: No signal at all. Either no true effects, or not enough power to detect them. ### Fold change distribution ```{r fc-good, fig.height = 4} plot_fc_distribution_new(good_results) ``` **Good:** Centred near zero (log2 scale), roughly symmetric. Most peptides don't change much. ```{r fc-bad, fig.height = 4} plot_fc_distribution_new(bad_results) ``` Warning signs: - **Systematic shift**: Distribution centred away from zero suggests a global offset (normalisation issue) - **Bimodal**: Two peaks may indicate batch effects or distinct populations ## Individual plot functions The `plot()` method gives you a multi-panel grid. For individual plots: **Data-level:** - `plot_pca_simple(data)` - PCA - `plot_distributions_simple(data)` - Abundance distributions - `plot_missingness_simple(data)` - Missingness patterns **Results-level:** - `plot_volcano_new(results)` - Volcano plot - `plot_pvalue_histogram(results)` - P-value histogram - `plot_fc_distribution_new(results)` - Fold change distribution ## Exporting for publication All plots are ggplot2 objects, so you can customise and save them: ```{r export-example, eval = FALSE} p <- plot_volcano_new(good_results) + labs(title = "Treatment vs Control") + theme_minimal(base_size = 14) ggsave("volcano.pdf", p, width = 6, height = 5) ``` You can also modify themes, colours, and labels: ```{r customise-example, fig.height = 4} plot_volcano_new(good_results) + theme_classic() + labs(title = "My Custom Volcano Plot") ``` ## Summary: what to check **Before analysis:** 1. PCA - do replicates cluster? Any outliers? 2. Distributions - similar across samples? 3. Missingness - random or systematic? **After analysis:** 1. Volcano - symmetric? Hits make sense? 2. P-value histogram - uniform + spike, or something wrong? 3. Fold changes - centred at zero? If something looks off, investigate before trusting the statistics. Plots often reveal problems that summary numbers hide.