© 2022 Janghyun Choi
Docer CC BY-NC-SA 4.0

Visualizing Differentially Expressed Genes: Volcano Plot and CPM Scatter Plot

Visual representation of differential expression analysis results is crucial for interpreting and communicating findings in RNA-seq experiments. Two common and highly informative plots used in this context are the volcano plot and the CPM (Counts Per Million) scatter plot. These visualizations provide intuitive insights into the data, highlighting genes that are significantly differentially expressed and their expression levels across conditions. This protocol was developed using a system equipped with an Intel 10th generation i9-10910 processor and 48GB of memory. The test environment includes R version 4.4.0 running on macOS 14.4.1.

Volcano Plot

A volcano plot combines statistical significance with the magnitude of change (fold change) in a single plot, making it a powerful tool for identifying DEGs. It plots the negative log10 adjust p-value or false discovery rate (FDR) against the log2 fold change, allowing for easy identification of genes that are both statistically significant and biologically meaningful.

Required Packages:

library(ggplot2) 
library(ggrepel)
library(egg)

Step-by-Step Guide

1. Load Your DEG Data as Base Type

  • Load the DEG data from a CSV (can be used a txt file) file into a data frame. Ensure that the first row contains headers and the first column contains row names.
    Ctr <- read.csv("Volcano.csv", head = T, row.names = 1)
    
  • The DEG data can be utilized directly for analysis. In this instance, ROS marker genes are identified (ROS.hallmark) and labeled (ROS) separately. Example of the first few rows of the data:
    > head(Ctr)
                    SYMBOL    logFC      logCPM   PValue      FDR ROS.hallmark
    ENSG00000204472    AIF1 9.025174  1.58214476 2.34e-40 1.62e-38             
    ENSG00000142168    SOD1 7.315282 -0.07862934 3.15e-13 3.49e-12          ROS
    ENSG00000268651  CTAG1A 6.390608  0.43013880 1.64e-07 1.06e-06             
    ENSG00000170417 TMEM182 6.254823  0.32867916 6.20e-07 3.72e-06             
    ENSG00000214960   CRPPA 6.254823  0.37677955 6.20e-07 3.72e-06             
    ENSG00000164675    IQUB 6.104918 -0.43713437 2.35e-06 1.32e-05 
    

2. Draw a Basal Plot

  • Use ggplot2 to create the basl Volcano plot:
    ggplot(data = Ctr, aes(x = logFC, y = FDR)) + geom_point()
    

3. Adjust the y-axis to Display in Negative log10 FDR Format and Add Guide Lines

  • Update the plot to transform the y-axis and add guide lines:
    p <- ggplot(data = Ctr, aes(x = logFC, y = -log10(FDR))) + geom_point() + theme_minimal()
    p2 <- p + geom_vline(xintercept=c(-1, 1), col = "red") + geom_hline(yintercept = -log10(0.05), col = "red") 
    

4. Seperate the Section of Significance by Appying FC and FDR

  • Update the plot to distinguish the section of significant DEGs by appying FC and FDR:

    # Filter data
    Ctr$diffexpressed <- "NO"
    Ctr$diffexpressed[Ctr$logFC > 1 & Ctr$PValue < 0.05] <- "UP"
    Ctr$diffexpressed[Ctr$logFC < -1 & Ctr$PValue < 0.05] <- "DOWN"
    
    # Plotting
    p <- ggplot(data = Ctr, aes(x = logFC, y = -log10(FDR))) +
      geom_point(aes(color = diffexpressed), 
                size = 4, 
                shape = 21, 
                fill = c("#294A99", "#ECECEC", "#BE1826")[factor(Ctr$diffexpressed)], 
                alpha = 0.7, 
                stroke = 0.1) +
      scale_color_manual(values = c("DOWN" = "black", "UP" = "black", "NO" = "black")) +
      theme_minimal() +
      theme(legend.position = "none") +
      geom_vline(xintercept = c(-1, 1), col = "red") +
      geom_hline(yintercept = -log10(0.05), col = "red")
    
  • In these codes, you can adjust the size (size), shape (shape), transparency (alpha), and stroke (stroke) of the points to fit your preferences. The fill parameter allows you to specify colors for "DOWN", "NO", and "UP" states using hex codes. Refer to the ggplot2 documentation for more detailed information.

  • Output:

    Volcano_step1

5. Represent Gene Names that Satisfy Specific Conditions Using ggrepel

  • In this case, I found ROS-related genes in papers and databases and labeled them as ROS in the ROS.hallmark column. Replace the values in this column with the corresponding values in the SYMBOL column for plotting:

    p2 <- p + 
      geom_text_repel(
        data = subset(Ctr, ROS.hallmark == "ROS"), 
        aes(label = SYMBOL),
        box.padding = 0.35, 
        point.padding = 0.3,
        max.overlaps = Inf,
        segment.color = 'black'
      )
    
  • To display only genes that meet specific criteria (e.g. DEGs), filter the data accordingly:
    # Filter data
    filtered_data <- subset(Ctr, ROS.hallmark == "ROS" & (logFC >= 1 | logFC <= -1) & FDR < 0.05)
    
    # Plotting
    p2 <- p + 
      geom_text_repel(
        data = filtered_data, 
        aes(label = SYMBOL),
        box.padding = 0.35, 
        point.padding = 0.3,
        max.overlaps = Inf,
        segment.color = 'black'
      )
    
  • Output:

    Volcano_step2

Summarize

Ctr <- read.csv("Volcano.csv", head = T, row.names = 1)
filtered_data <- subset(Ctr, ROS.hallmark == "ROS" & (logFC >= 1 | logFC <= -1) & FDR < 0.05)
p <- ggplot(data = Ctr, aes(x = logFC, y = -log10(FDR))) +
  geom_point(aes(color = diffexpressed), size = 4, shape = 21, 
             fill = c("#294A99", "#ECECEC", "#BE1826")[factor(Ctr$diffexpressed)], 
             alpha = 0.7, stroke = 0.1) +
  scale_color_manual(values = c("DOWN" = "black", "UP" = "black", "NO" = "black")) +
  theme_minimal() +
  theme(legend.position = "none",
        panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  geom_vline(xintercept = c(-1, 1), col = "red") +
  geom_hline(yintercept = -log10(0.05), col = "red") +
  geom_text_repel(
    data = filtered_data, 
    aes(label = SYMBOL),
    box.padding = 0.35, 
    point.padding = 0.3,
    max.overlaps = Inf,
    segment.color = 'black',
    size = 3  
  )

# Adjust plot size for publication
p2 <- set_panel_size(p, width = unit(49, "mm"), height = unit(49, "mm"))
grid.arrange(p2)

CPM Scatter Plot

The CPM scatter plot is used to visualize the normalized expression levels of genes between two conditions. By plotting the log-transformed CPM values of one condition against another, this scatter plot highlights the overall distribution and correlation of gene expression levels, helping to identify outliers and trends.

Step-by-Step Guide

1. Load Your CPM Data as Base Type

  • Load the CPM data from a CSV file into a data frame. Ensure that the first row contains headers and the first column contains row names:
    Ctr <- read.csv("Scatter_CPM.csv", head = T, row.names = 1)
    
  • Example of the first few rows of the data:
    > head(Ctr)
                        shCon    shMLL1    shNRF2     shUTX
    ENSG00000000003 -2.680285 -2.680285 -2.680285 -1.549635
    ENSG00000000005  4.307581  4.677028  4.580042  4.872485
    ENSG00000000419 -1.749770 -1.753980 -1.171245 -1.549635
    ENSG00000000460  5.275807  5.093826  5.081307  4.986667
    ENSG00000000938  3.287921  3.437214  4.075670  3.266172
    ENSG00000000971  5.988133  4.288445  5.394514  4.854088
    

2. Caculate Correlation Coefficiency Between Samples

  • Calculate the linear model to assess the relationship between two samples. This can help determine how similar the expression levels are between two conditions:
    # Fit a lineat model
    model <- lm(formula = shCon ~ shMLL1, data = Ctr)
    
    # Summarize the model to get the correlation coefficient
    model_summary <- summary(model)
    
    # Extract coefficiency
    intercept <- model$coefficients[1]
    slope <- model$coefficients[2]
    r_squared <- model_summary$r.squared
    
  • Output:
    > model <- lm(formula = shCon ~ shMLL1, data = Ctr)
    > model
    Call:
    lm(formula = shCon ~ shMLL1, data = Ctr)
    
    Coefficients:
    (Intercept)       shMLL1  
      -0.01902      0.97918  
    
    > model_summary <- summary(model)
    > model_summary
    Call:
    lm(formula = shCon ~ shMLL1, data = Ctr)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -4.9938 -0.2207 -0.0368  0.2283  7.4868 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) -0.019021   0.007697  -2.471   0.0135 *  
    shMLL1       0.979179   0.001696 577.328   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.6539 on 10026 degrees of freedom
    Multiple R-squared:  0.9708,	Adjusted R-squared:  0.9708 
    F-statistic: 3.333e+05 on 1 and 10026 DF,  p-value: < 2.2e-16
    

3. Make a scatter plot

  • In this section, you will create a scatter plot to visualize the relationship between two conditions (shCon and shMLL1) and classify genes based on their expression changes:
    # Calculate difference between samples
    classify <- (Ctr$shMLL1) - (Ctr$shCon)
    
    # Classify genes
    Ctr$diffexpressed <- "NO"
    Ctr$diffexpressed[classify > 0.5] <- "UP"
    Ctr$diffexpressed[classify < (-0.5)] <- "DOWN"
    
    # Depict a plot with the linear model
    p <- ggplot(data = Ctr, aes(x = shCon, y = shMLL1)) +
      geom_point(aes(color = diffexpressed), size = 4, shape = 21, 
                fill = c("#294A99", "#ECECEC", "#BE1826")[factor(Ctr$diffexpressed)], 
                alpha = 0.7, stroke = 0.1) +
      scale_color_manual(values = c("DOWN" = "black", "UP" = "black", "NO" = "black")) +
      theme_minimal() +
      theme(legend.position = "none",
            panel.border = element_rect(color = "black", fill = NA, size = 1)) +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +  
      annotate("text", x = Inf, y = -Inf, label = sprintf("y = %.4fx + %.4f\nR² = %.4f", slope, intercept, r_squared),
              hjust = 1.1, vjust = -0.1, size = 5, color = "blue")
    
  • Output:

    Scatter

Summarize

Ctr <- read.csv("Scatter_CPM.csv", head = T, row.names = 1)
model <- lm(formula = shCon ~ shMLL1, data = Ctr)
model_summary <- summary(model)
intercept <- model$coefficients[1]
slope <- model$coefficients[2]
r_squared <- model_summary$r.squared

p <- ggplot(data = Ctr, aes(x = shCon, y = shMLL1)) +
  geom_point(aes(color = diffexpressed), size = 4, shape = 21, 
            fill = c("#294A99", "#ECECEC", "#BE1826")[factor(Ctr$diffexpressed)], 
            alpha = 0.7, stroke = 0.1) +
  scale_color_manual(values = c("DOWN" = "black", "UP" = "black", "NO" = "black")) +
  theme_minimal() +
  theme(legend.position = "none",
        panel.border = element_rect(color = "black", fill = NA, size = 1)) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  annotate("text", x = Inf, y = -Inf, label = sprintf("y = %.4fx + %.4f\nR² = %.4f", slope, intercept, r_squared),
          hjust = 1.1, vjust = -0.1, size = 5, color = "blue")

# Adjust plot size for publication
p2 <- set_panel_size(p, width = unit(49, "mm"), height = unit(49, "mm"))
grid.arrange(p2)

Citations

R/ggplot
  1. R Core Team, R. (2013). R: A language and environment for statistical computing. HTML
  2. Wilkinson, L. (2011). ggplot2: elegant graphics for data analysis by WICKHAM, H. DOI