© 2022 Janghyun Choi
Docer CC BY-NC-SA 4.0

Exploring Deep-sequencing Data with deepTools

The deepTools effectively manages the vast data volumes routinely produced from sequencing platforms. This tool suite includes versatile modules designed to process mapped reads, performing numerous quality assessments. It also generates normalized coverage files in standard bedGraph and bigWig formats, facilitating comparisons between various datasets, such as treatment and control groups. This protocol was created based on deepTools version 3.5.1 running on a system equipped with an Intel 10th generation i9-10910 processor and 48GB of memory. The test environment includes Python version 3.8.5, SciPy version 1.6.2, NumPy version 1.20.1, and conda version 22.9.0 under macOS 12.4 environment.

Installation deepTools

EnvTest EnvTest gith githio test PyPI Version

To install deepTools using Anaconda, use the following command:

$ conda install -c conda-forge -c bioconda deeptools

Troubleshooting in Linux Environment

  1. If error occurs for freeze environment, update conda as follow;
     $ sudo apt-get update
     $ conda update xz
    
  2. If error occurs for repository, remove it as follow;
     $ sudo add-apt-repository --remove ppa:[Dir]/[Sub-Dir]
    
     # For example
     $ sudo add-apt-repository --remove ppa:webupd8team/sublime-text-3
    
  3. If deeptools cannot be run due to libcryto.so.1.0.0 source error, refer to document of ‘phyton setting’

BamCoverage: Convert BAM into bigWig

  • BamCoverage module takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. Use the following command to convert BAM to BigWig with BamCoverage:

    This module requires a BAM file sorted by position and a bai file in the same folder.

      $ bamCoverage -b <input.bam> -o <output.bigwig> -of <outputType> -bs <int> \
          --normalizeUsing <Norm> --effectiveGenomeSize <int> --smoothLength <int> \
          --region <region> -p <int> -e
    
  • In these commands,

    ParameterDescription
    -b <input.bam>Specifies the input data.
    -o <output.bigwig>Specifies output file.
    -of <outputType>Specifies format of the output data. Possible values are bigWig or bdg.
    -bs <int>Size of the bins, in bases, for the output of the bigwig/bedgraph file (Default: 50).
    -bs 1 setting is recommended for publication. The smaller the value of -bs, the more computation.
    --normalizeUsing <Norm>Specifies whether to apply a method to normalize the number of reads per bin.
    Possible values for <Norm> are RPCG, RPKM, CPM, or none (default: none).
    --effectiveGenomeSize <int>The effective genome size (EGS) is the portion of the genome that is mappable. this parameter is couple with RPGC normalization.
    human: 2.7e9 (2700000000), mouse: 1.87e9 (1870000000).
    --smoothLength <int>The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the –binSize is set to 20 and the –smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered.
    For Publication, -bs 1, --smoothLength 10settting is recommended.
    The larger the value of ---smoothLength, the more computation.
    --region <region>Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time.
    The format is chr:start:end, for example --region chr2:178104147:178154424.
    -p <int>Specifies thread numbers <int>
    -e optionThis parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. Can not apply to RNA-seq data.

Example code

  • Here is an example command to perform converting a bigWig file from the BAM file:

      # Coded to quickly identify track patterns.
      $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig \
          -bs 100 --normalizeUsing RPGC --effectiveGenomeSize 2700000000 -e -p 10
    
      # For GEO submission, it takes a very long time.
      $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig -bs 1 \
          --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --smoothLength 10 -e -p 10
      $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig -bs 5 \
          --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --smoothLength 20 -e -p 10
    
      # This code makes the bigWig file by specifying a target gene region as wide, which can be a time saver.
      $ bamcoverage -b Final.sort.Unt_MLL1.bam -o /Users/jchoi/desktop/Unt_MLL1_NRF2.bigwig -bs 1 \
          --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --region chr2:178104147:178154424 \ 
          --smoothLength 20 -e
    

Output

  • When BamCoverage running, it prints messages printing as the follow:
          bamFilesList: ['Final.sort.Unt_MLL1.bam']
                  binLength: 1
                  numberOfSamples: None
                  blackListFileName: None
                  skipZeroOverZero: False
                  bed_and_bin: False
                  genomeChunkSize: None
                  defaultFragmentLength: 254
                  numberOfProcessors: 2
                  verbose: False
                  region: None
                  bedFile: None
                  minMappingQuality: None
                  ignoreDuplicates: False
                  chrsToSkip: []
                  stepSize: 1
                  center_read: False
                  samFlag_include: None
                  samFlag_exclude: None
                  minFragmentLength: 0
                  maxFragmentLength: 0
                  zerosToNans: False
                  smoothLength: None
                  save_data: False
                  out_file_for_raw_data: None
                  maxPairedFragmentLength: 1016
    

bigwigCompare and bamCompare

  • bigwigCompare compares two bigWig files based on the number of mapped reads. To compare the bigWig files, the genome is partitioned into bins of equal size, then the number of reads found in each BAM file are counted per bin and finally a summary value is reported. Similarly, bamCompare can be used to generate a bigWig or bedGraph file based on two BAM files that are compared to each other while being simultaneously normalized for sequencing depth. Use the following command to compare two bigWigs or two BAMs:

      # bigwigCompare
      $ bigwigCompare -b1 <target.bigwig> -b2 <control.bigwig> --operation <ratio> \
          -o <output.bigwig> -p <INT>
    
      # bamCompare
      $ bamCompare -b1 <target.bam> -b2 <control.bam> --operation <ratio> \
          -o <output.bigwig> -p <INT>
    
  • In these commands,

    ParameterDescription
    -b1 <target.bigwig>Specifies the bigwig or the sorted BAM file for the target/treatment.
    -b2 <control.bigwig>Specifies the bigwig or the sorted BAM file for the control.
    ---operation <ratio>Specifies how to determine two values are compared.
    Possible values for <ratio> are log2, ratio, substrate, add, or mean (default: log2).
    -o <output.bigwig>Specifies output file.
    -p <int>Specifies thread numbers <int>

Example code (Results are not printed)

  • Here is an example command to perform bigwigCompare and bamCompare:

      # bigwigCompare
      $ bigwigCompare -b1 /Users/jchoi/desktop/Unt_H3K4me3.bigwig \
          -b2 /Users/jchoi/desktop/Unt_IPT.bigwig --operation log2 \
          -o /Users/jchoi/desktop/Matrix/Unt_IPT_H3K4me3.bigwig -p 10
      $ bigwigCompare -b1 /Users/jchoi/desktop/Unt_H3K27me3.bigwig \
          -b2 /Users/jchoi/desktop/Unt_IPT.bigwig --operation log2 \
          -o /Users/jchoi/desktop/Matrix/Unt_IPT_H3K27me3.bigwig -p 10
    
      # bamCompare
      $ bamCompare -b1 /Users/jchoi/desktop/BAM/Sorted.Unt_NFIB.bam \
          -b2 /Users/jchoi/desktop/BAM/Sorted.Unt_IPT.bam --operation log2 \
          -o /Users/jchoi/desktop/BAM/Matrix/Unt_IPT_NFIB.bigwig -p 10
      $ bamCompare -b1 /Users/jchoi/desktop/BAM/Sorted.Unt_MLL1.bam \
          -b2 /Users/jchoi/desktop/BAM/Sorted.Unt_IPT.bam --operation log2 \
          -o /Users/jchoi/desktop/BAM/Matrix/Unt_IPT_MLL1.bigwig -p 10
    

computeMatrix: Prepare matrix data

  • computeMatrix calculates scores per genome regions and prepares an intermediate file that can be used with plotHeatmap and plotProfiles. computeMatrix accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score. Use the following command to make a matrix data using bigWig files generated from bigwigCompare:

      $ computeMatrix <mode> --referencePoint <point> -b <int> -a <int> -R <referenceGenome> \
          -S <bigwigCompared_1.bigwig>...<bigwigCompared_N.bigwig> \
          --skipZeros --missingDataAsZero -o <matrix file.gz> -p <int>
    
  • In these commands,

    ParameterDescription
    <mode>Possible values are reference-point or scale-regions. See the bellow
    --referencePoint <point>Specifies a reference point. The reference point for the plotting could be either the region start (TSS), the region end (TES) or the center of the region.
    -b <int>Distance upstream of the reference-point selected.
    -a <int>Distance downstream of the reference-point selected.
    -R <referenceGenome>File name or names, in BED or GTF format, containing the regions to plot. It is recommend to use the GTF file obtained in the htseeq-count step.
    -S <bigwigCompared_1.bigwig>...<bigwigCompared_N.bigwig>specifies input data, it is separated by spaces. These files should be obtained by using the bamCoverage or bamCompare modules.
    --skipZeros optionWhether regions with only scores of zero should be included or not.
    --missingDataAsZero optionIf set, missing data (NAs) will be treated as zeros.
    -o <matrix file.gz>Specifies output file. The matrix file must have a ‘gz’ extension.
    -p <int>Specifies thread numbers <int>

    The reference-point mode plots genomic positions upstream and downstream from a specified position within a genome region, such as the starting point. Conversely, the scale-regions mode adjusts all regions in the genome file to a user-defined length, stretching or shrinking them as necessary.

Example code

  • Here is an example command to prepare a matrix data with computeMatrix:

      # Mouse, reference-point mode, TSS, TF ChIP-seq
      $ computeMatrix reference-point --referencePoint TSS -b 3000 -a 3000 \
          -R /Users/jchoi/desktop/RefGene/mm10.refGene.gtf
          -S IPT_shCtrl_rep1.bigwig IPT_shCtrl_rep2.bigwig IPT_shNFIB_rep1.bigwig IPT_shNFIB_rep2.bigwig \
          --skipZeros --missingDataAsZero -o /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz -p 10
    
      # Rice, reference-point mode, TES, histone ChIP-seq
      $ computeMatrix reference-point --referencePoint TES -b 3000 -a 3000 \
          -R /Users/jchoi/Desktop/ProfYoon/O_Sativa_Genome/RGAP_Used/all.gtf \
          -S /Users/jchoi/Desktop/ProfYoon/ChIPSeq_Track/LowBin_WTe.bigwig /Users/jchoi/Desktop/ProfYoon/ChIPSeq_Track/LowBin_ehd1e.bigwig \
          --skipZeros --missingDataAsZero -o /Users/jchoi/Desktop/matrix_TES.gz -p 10
    

Output

  • This code does not output any results or progress, but sometimes you will see a message like the one below:
      Skipping XX_XXXXXX_XX, due to being absent in the computeMatrix output.
      Skipping NR_164356_11, due to being absent in the computeMatrix output. 
    
  • Ignore these mentions because they mean missing data and has been processed by --skipZeros and --missingDataAsZero commands.

plotHeatmap

  • plotHeatmap creates a heatmap for scores associated with genomic regions. The program requires a matrix file generated by the computeMatrix. Use the following command to make a heatmap:

      $ plotHeatmap -m <matrixFile.gz> -out <output.extention> --colorMap <colorCode> --colorList <"colorCode"> \
          --whatToShow <'type'> --zMin <group1 ... groupN> --zMax <group1 ... groupN> \
          --yMin <group1 ... groupN> --yMax <group1 ... groupN> \
          --kmeans <int> --perGroup --dpi <int> 
    
  • In these commands,

    ParameterDescription
    -m <matrixFile.gz>Specifies matrix file.
    -out <output.extention>Specifies output file. Possible values for file extention are png, eps, pdf, and svg.
    --colorMap <colorCode>Color map to use for the heatmap. If more than one heatmap is being plotted the color of each heatmap can be enter individually (e.g. –colorMap Reds Blues). Available color codes
    --colorList <colorCode>List of colors to use to create a colormap. For example, if –colorList black,yellow,blue is set (colors separated by comas) then a color map that starts with black, continues to yellow and finishes in blue is created. You can also use hexagonal color codes (#000000).
    --whatToShow <'type'>The default is to include a summary or profile plot on top of the heatmap and a heatmap colorbar. Other options are 'plot and heatmap', 'heatmap only', and 'heatmap and colorbar'. Not recommended for use.
    --zMin/--zMax optionadjust heatmap intensities. If more than one heatmap is being plotted the intensity of each heatmap can be enter individually.
    --yMin/--yMax optionadjust plot range of the y-axis. If more than one heatmap is being plotted the intensity of each heatmap can be enter individually.
    --kmeans <int>Number of clusters to compute. When this option is set, the matrix is split into clusters using the k-means algorithm. --hclustoption is also possible for Hierarchical clustering.
    --perGroup optionThe default is to plot all groups of regions by sample. Using this option instead plots all samples by group of regions.

Example Code (Results are not printed)

  • Here is an example command to plot heatmap with plotHeatmap:

      $ plotHeatmap -m /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz \
          -out /Users/jchoi/desktop/Distribution.svg --colorMap RdBu
          --zMin -1 --zMax 6 --yMin 0 --yMax 4 --dpi 300
    
      $ plotHeatmap -m /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz \
          -out /Users/jchoi/desktop/Distribution_heatmap.svg --colorMap RdBu
          --whatToShow 'heatmap only' --zMin -1 --zMax 6
          --kmeans 3 --dpi 300
    
      $ plotHeatmap -m /Users/jchoi/Desktop/matrix_TSS.gz 
          -out /Users/jchoi/Desktop/Rice_H3ac.svg --colorList "white,#bd1829" 
          --yMin 20 --yMax 50 --zMin 40 
    

plotProfile

  • plotProfile creates a profile plot for scores over sets of genomic regions. This is a simplified version of the plotHeatmap provided above that only plots individual line graphs. The parameters for this function are quite similar to others, so I will keep the explanation brief without example code.

    Unfortunately, plotProfile is currently not used by many users.

      $ plotProfile -m <matrix file.gz> -out <output.extention> --plotType=<select>
          --color <'color code'> --numPlotsPerRow <int> --plotTitle <"Test data profile">
          --kmeans <int> --perGroup --dpi <int>
    
  • In these commands,

    • --plotType=<select>: Possible values are lines, fill, se, std, overlapped_lines, and heatmap.
    • --color <'color code'>: List of colors to use for the plotted lines (color names or hex strings). For example --colors '#e89b81' '#959dc8' '#f6f0a8'
    • --numPlotsPerRow <int>: Number of plots per row
    Critical Note
    1. Overall, the plotHeatmap and plotProfile functions, powered by computeMatrix, are commonly used to visualize the tag distribution of peaks and are featured prominently in numerous publications.
    2. However, for more detailed analysis, I recommend using Python-based ngs.plot.r or the ChIPseeker R package , which offer enhanced detail and additional analytical capabilities.

multiBigwigSummary and multiBamSummary

  • multiBigwigSummary and multiBamSummary compute the average score/read coverages for genomic regions for two or more bigWig and BAM files, respectively. This analysis is performed for the entire genome by running the program in bins mode, or for certain user selected regions in BED-file mode. Most commonly, the default output of multiBigwigSummary (a compressed numpy array, .npz) is used by other tools such as plotCorrelation or plotPCA for visualization and diagnostic purposes. Use the following command to make a summary matrix from bigWig files with multiBigwigSummary:

      $ multiBigwigSummary bins -b <file1.bigwig>...<fileN.bigwig> -l <file1>...<fileN> -o <results.npz>
      $ multiBamSummary bins -b <file1.bam>...<fileN.bam> -l <file1>...<fileN> -e -o <results.npz>
    
  • In these commands,

    • -b <file1.bigwig>...<fileN.bigwig>: Specifies input data, it is separated by spaces.
    • -l <file1>...<fileN>: User defined labels instead of default labels from file names. Multiple labels have to be separated by a space.
    • -o <results.npz>: Specifies output data as npz format.
    • -eoption: only multiBamSummary, allows the extension of reads to fragment size.

Example Code (Results are not printed)

  • Here is an example command to make a coverage matrix with multiBigwigSummary:

      $ multiBigwigSummary bins -b shControl_Input.bigwig shControl_H3K4me3.bigwig shNFIB_H3K4me3.bigwig shMLL1_H3K4me3.bigwig \
          -l Input shControl shNFIB shMLL1 -o /Users/jchoi/Desktop/CorrMatrix.npz
    

plotCorrelation

  • plotCorrelationanalyzes and visualizes sample correlations based on the output of multiBamSummary or multiBigwigSummary. Pearson or Spearman methods are available to compute correlation coefficients. Use the following command to perform plotting for correlation with plotCorrelation:

      $ plotCorrelation -in <CorrMatrix.npz> --corMethod <select> --skipZeros --whatToPlot <type> \
          --removeOutliers --colorMap <color> --plotNumbers -o <output.extention> --outFileCorMatrix <matrix_name.tab>
    
  • In these commands,

    ParameterDescription
    -in <CorrMatrix.npz>Specifies a coverage matrix file.
    ---corMethod <select>Specifies a correlation method. Possible values for <select> are spearman and pearson.
    --skipZeros optionBy setting this option, genomic regions that have zero or missing (NaN) values in all samples are excluded.
    --whatToPlot <type>Choose between a heatmap or pairwise scatter plots. Possible values for <type> are heatmap and scatterplot.
    --removeOutliers optionIf set, bins with very large counts are removed (only pearson).
    --colorMap <color>Color map to use for the heatmap.
    --zMin/--zMax optionadjust heatmap intensities. If not specified, the value is set automatically.
    --xRange/--yRange optionadjust plot range of the x/y-axis. If not specified, the value is set automatically.
    --plotNumbers optionIf set, then the correlation number is plotted on top of the heatmap (only heatmap).
    -o <output.extention>Specifies output file. Possible values for file extention are png, eps, pdf, and svg (recommended).
    --outFileCorMatrix <matrix_name.tab>Save matrix with pairwise correlation values to a tab-separated file.

Example Code (Results are not printed)

  • Here is an example command to make a correlation plot with plotCorrelation:
      # Heatmap, Pearson r
      $ plotCorrelation -in /Users/jchoi/Desktop/CorrMatrix.npz --corMethod pearson --skipZeros \
          --whatToPlot heatmap --removeOutliers --colorMap RdYlBu --plotNumbers \
          -o /Users/jchoi/Desktop/heatmap_PearsonCorr.svg --outFileCorMatrix /Users/jchoi/Desktop/heatmap_PearsonCorr.tab
    
      # Scatter plot, Spearman r
      $ plotCorrelation -in /Users/jchoi/Desktop/CorrMatrix.npz --corMethod spearman --skipZeros \
          --whatToPlot scatterplot --colorMap RdYlBu -o /Users/jchoi/Desktop/heatmap_SpearmanCorr.svg \
          --outFileCorMatrix /Users/jchoi/Desktop/heatmap_SpearmanCorr.tab
    

Citations

deeptools
  1. Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A., & Manke, T. (2014). deepTools: a flexible platform for exploring deep-sequencing data. Nucleic acids research, 42(W1), W187-W191. DOI
  2. Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A. S., ... & Manke, T. (2016). deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic acids research, 44(Web Server issue), W160. DOI