Exploring Deep-sequencing Data with deepTools
The deepTools effectively manages the vast data volumes routinely produced from sequencing platforms. This tool suite includes versatile modules designed to process mapped reads, performing numerous quality assessments. It also generates normalized coverage files in standard bedGraph and bigWig formats, facilitating comparisons between various datasets, such as treatment and control groups. This protocol was created based on deepTools version 3.5.1 running on a system equipped with an Intel 10th generation i9-10910 processor and 48GB of memory. The test environment includes Python version 3.8.5, SciPy version 1.6.2, NumPy version 1.20.1, and conda version 22.9.0 under macOS 12.4 environment.
Installation deepTools
To install deepTools using Anaconda, use the following command:
$ conda install -c conda-forge -c bioconda deeptools
Troubleshooting in Linux Environment
- If error occurs for freeze environment, update conda as follow;
$ sudo apt-get update $ conda update xz
- If error occurs for repository, remove it as follow;
$ sudo add-apt-repository --remove ppa:[Dir]/[Sub-Dir] # For example $ sudo add-apt-repository --remove ppa:webupd8team/sublime-text-3
- If deeptools cannot be run due to
libcryto.so.1.0.0
source error, refer to document of ‘phyton setting’
BamCoverage: Convert BAM into bigWig
BamCoverage module takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. Use the following command to convert BAM to BigWig with BamCoverage:
This module requires a BAM file sorted by position and a bai file in the same folder.
$ bamCoverage -b <input.bam> -o <output.bigwig> -of <outputType> -bs <int> \ --normalizeUsing <Norm> --effectiveGenomeSize <int> --smoothLength <int> \ --region <region> -p <int> -e
In these commands,
Parameter Description -b <input.bam>
Specifies the input data. -o <output.bigwig>
Specifies output file. -of <outputType>
Specifies format of the output data. Possible values are bigWig or bdg. -bs <int>
Size of the bins, in bases, for the output of the bigwig/bedgraph file (Default: 50). -bs 1
setting is recommended for publication. The smaller the value of-bs
, the more computation.--normalizeUsing <Norm>
Specifies whether to apply a method to normalize the number of reads per bin.
Possible values for<Norm>
areRPCG
,RPKM
,CPM
, ornone
(default:none
).--effectiveGenomeSize <int>
The effective genome size (EGS) is the portion of the genome that is mappable. this parameter is couple with RPGC
normalization.
human: 2.7e9 (2700000000), mouse: 1.87e9 (1870000000).--smoothLength <int>
The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the –binSize is set to 20 and the –smoothLength is set to 60, then, for each bin, the average of the bin and its left and right neighbors is considered.
For Publication,-bs 1, --smoothLength 10
settting is recommended.
The larger the value of---smoothLength
, the more computation.--region <region>
Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time.
The format is chr:start:end, for example--region chr2:178104147:178154424
.-p <int>
Specifies thread numbers <int>
-e
optionThis parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. Can not apply to RNA-seq data.
Example code
Here is an example command to perform converting a bigWig file from the BAM file:
# Coded to quickly identify track patterns. $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig \ -bs 100 --normalizeUsing RPGC --effectiveGenomeSize 2700000000 -e -p 10 # For GEO submission, it takes a very long time. $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig -bs 1 \ --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --smoothLength 10 -e -p 10 $ bamcoverage -b Final.sort.Unt_IPT.bam -o /Users/jchoi/desktop/Unt_IPT.bigwig -bs 5 \ --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --smoothLength 20 -e -p 10 # This code makes the bigWig file by specifying a target gene region as wide, which can be a time saver. $ bamcoverage -b Final.sort.Unt_MLL1.bam -o /Users/jchoi/desktop/Unt_MLL1_NRF2.bigwig -bs 1 \ --normalizeUsing RPGC --effectiveGenomeSize 2700000000 --region chr2:178104147:178154424 \ --smoothLength 20 -e
Output
- When
BamCoverage
running, it prints messages printing as the follow:bamFilesList: ['Final.sort.Unt_MLL1.bam'] binLength: 1 numberOfSamples: None blackListFileName: None skipZeroOverZero: False bed_and_bin: False genomeChunkSize: None defaultFragmentLength: 254 numberOfProcessors: 2 verbose: False region: None bedFile: None minMappingQuality: None ignoreDuplicates: False chrsToSkip: [] stepSize: 1 center_read: False samFlag_include: None samFlag_exclude: None minFragmentLength: 0 maxFragmentLength: 0 zerosToNans: False smoothLength: None save_data: False out_file_for_raw_data: None maxPairedFragmentLength: 1016
bigwigCompare and bamCompare
bigwigCompare
compares two bigWig files based on the number of mapped reads. To compare the bigWig files, the genome is partitioned into bins of equal size, then the number of reads found in each BAM file are counted per bin and finally a summary value is reported. Similarly,bamCompare
can be used to generate a bigWig or bedGraph file based on two BAM files that are compared to each other while being simultaneously normalized for sequencing depth. Use the following command to compare two bigWigs or two BAMs:# bigwigCompare $ bigwigCompare -b1 <target.bigwig> -b2 <control.bigwig> --operation <ratio> \ -o <output.bigwig> -p <INT> # bamCompare $ bamCompare -b1 <target.bam> -b2 <control.bam> --operation <ratio> \ -o <output.bigwig> -p <INT>
In these commands,
Parameter Description -b1 <target.bigwig>
Specifies the bigwig or the sorted BAM file for the target/treatment. -b2 <control.bigwig>
Specifies the bigwig or the sorted BAM file for the control. ---operation <ratio>
Specifies how to determine two values are compared.
Possible values for<ratio>
arelog2
,ratio
,substrate
,add
, ormean
(default:log2
).-o <output.bigwig>
Specifies output file. -p <int>
Specifies thread numbers <int>
Example code (Results are not printed)
Here is an example command to perform
bigwigCompare
andbamCompare
:# bigwigCompare $ bigwigCompare -b1 /Users/jchoi/desktop/Unt_H3K4me3.bigwig \ -b2 /Users/jchoi/desktop/Unt_IPT.bigwig --operation log2 \ -o /Users/jchoi/desktop/Matrix/Unt_IPT_H3K4me3.bigwig -p 10 $ bigwigCompare -b1 /Users/jchoi/desktop/Unt_H3K27me3.bigwig \ -b2 /Users/jchoi/desktop/Unt_IPT.bigwig --operation log2 \ -o /Users/jchoi/desktop/Matrix/Unt_IPT_H3K27me3.bigwig -p 10 # bamCompare $ bamCompare -b1 /Users/jchoi/desktop/BAM/Sorted.Unt_NFIB.bam \ -b2 /Users/jchoi/desktop/BAM/Sorted.Unt_IPT.bam --operation log2 \ -o /Users/jchoi/desktop/BAM/Matrix/Unt_IPT_NFIB.bigwig -p 10 $ bamCompare -b1 /Users/jchoi/desktop/BAM/Sorted.Unt_MLL1.bam \ -b2 /Users/jchoi/desktop/BAM/Sorted.Unt_IPT.bam --operation log2 \ -o /Users/jchoi/desktop/BAM/Matrix/Unt_IPT_MLL1.bigwig -p 10
computeMatrix: Prepare matrix data
computeMatrix
calculates scores per genome regions and prepares an intermediate file that can be used withplotHeatmap
andplotProfiles
.computeMatrix
accepts multiple score files (bigWig format) and multiple regions files (BED format). This tool can also be used to filter and sort regions according to their score. Use the following command to make a matrix data using bigWig files generated frombigwigCompare
:$ computeMatrix <mode> --referencePoint <point> -b <int> -a <int> -R <referenceGenome> \ -S <bigwigCompared_1.bigwig>...<bigwigCompared_N.bigwig> \ --skipZeros --missingDataAsZero -o <matrix file.gz> -p <int>
In these commands,
Parameter Description <mode>
Possible values are reference-point
orscale-regions
. See the bellow--referencePoint <point>
Specifies a reference point. The reference point for the plotting could be either the region start ( TSS
), the region end (TES
) or thecenter
of the region.-b <int>
Distance upstream of the reference-point selected. -a <int>
Distance downstream of the reference-point selected. -R <referenceGenome>
File name or names, in BED or GTF format, containing the regions to plot. It is recommend to use the GTF file obtained in the htseeq-count step. -S <bigwigCompared_1.bigwig>...<bigwigCompared_N.bigwig>
specifies input data, it is separated by spaces. These files should be obtained by using the bamCoverage
orbamCompare
modules.--skipZeros
optionWhether regions with only scores of zero should be included or not. --missingDataAsZero
optionIf set, missing data (NAs) will be treated as zeros. -o <matrix file.gz>
Specifies output file. The matrix file must have a ‘gz’ extension. -p <int>
Specifies thread numbers <int>
The
reference-point
mode plots genomic positions upstream and downstream from a specified position within a genome region, such as the starting point. Conversely, thescale-regions
mode adjusts all regions in the genome file to a user-defined length, stretching or shrinking them as necessary.
Example code
Here is an example command to prepare a matrix data with
computeMatrix
:# Mouse, reference-point mode, TSS, TF ChIP-seq $ computeMatrix reference-point --referencePoint TSS -b 3000 -a 3000 \ -R /Users/jchoi/desktop/RefGene/mm10.refGene.gtf -S IPT_shCtrl_rep1.bigwig IPT_shCtrl_rep2.bigwig IPT_shNFIB_rep1.bigwig IPT_shNFIB_rep2.bigwig \ --skipZeros --missingDataAsZero -o /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz -p 10 # Rice, reference-point mode, TES, histone ChIP-seq $ computeMatrix reference-point --referencePoint TES -b 3000 -a 3000 \ -R /Users/jchoi/Desktop/ProfYoon/O_Sativa_Genome/RGAP_Used/all.gtf \ -S /Users/jchoi/Desktop/ProfYoon/ChIPSeq_Track/LowBin_WTe.bigwig /Users/jchoi/Desktop/ProfYoon/ChIPSeq_Track/LowBin_ehd1e.bigwig \ --skipZeros --missingDataAsZero -o /Users/jchoi/Desktop/matrix_TES.gz -p 10
Output
- This code does not output any results or progress, but sometimes you will see a message like the one below:
Skipping XX_XXXXXX_XX, due to being absent in the computeMatrix output. Skipping NR_164356_11, due to being absent in the computeMatrix output.
- Ignore these mentions because they mean missing data and has been processed by
--skipZeros
and--missingDataAsZero
commands.
plotHeatmap
plotHeatmap
creates a heatmap for scores associated with genomic regions. The program requires a matrix file generated by thecomputeMatrix
. Use the following command to make a heatmap:$ plotHeatmap -m <matrixFile.gz> -out <output.extention> --colorMap <colorCode> --colorList <"colorCode"> \ --whatToShow <'type'> --zMin <group1 ... groupN> --zMax <group1 ... groupN> \ --yMin <group1 ... groupN> --yMax <group1 ... groupN> \ --kmeans <int> --perGroup --dpi <int>
In these commands,
Parameter Description -m <matrixFile.gz>
Specifies matrix file. -out <output.extention>
Specifies output file. Possible values for file extention are png
,eps
,pdf
, andsvg
.--colorMap <colorCode>
Color map to use for the heatmap. If more than one heatmap is being plotted the color of each heatmap can be enter individually (e.g. –colorMap Reds Blues). Available color codes --colorList <colorCode>
List of colors to use to create a colormap. For example, if –colorList black,yellow,blue is set (colors separated by comas) then a color map that starts with black, continues to yellow and finishes in blue is created. You can also use hexagonal color codes (#000000). --whatToShow <'type'>
The default is to include a summary or profile plot on top of the heatmap and a heatmap colorbar. Other options are 'plot and heatmap'
,'heatmap only'
, and'heatmap and colorbar'
. Not recommended for use.--zMin/--zMax
optionadjust heatmap intensities. If more than one heatmap is being plotted the intensity of each heatmap can be enter individually. --yMin/--yMax
optionadjust plot range of the y-axis. If more than one heatmap is being plotted the intensity of each heatmap can be enter individually. --kmeans <int>
Number of clusters to compute. When this option is set, the matrix is split into clusters using the k-means algorithm. --hclust
option is also possible for Hierarchical clustering.--perGroup
optionThe default is to plot all groups of regions by sample. Using this option instead plots all samples by group of regions.
Example Code (Results are not printed)
Here is an example command to plot heatmap with
plotHeatmap
:$ plotHeatmap -m /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz \ -out /Users/jchoi/desktop/Distribution.svg --colorMap RdBu --zMin -1 --zMax 6 --yMin 0 --yMax 4 --dpi 300 $ plotHeatmap -m /Users/jchoi/desktop/bigWig/Matrix/matrix_TSS.gz \ -out /Users/jchoi/desktop/Distribution_heatmap.svg --colorMap RdBu --whatToShow 'heatmap only' --zMin -1 --zMax 6 --kmeans 3 --dpi 300 $ plotHeatmap -m /Users/jchoi/Desktop/matrix_TSS.gz -out /Users/jchoi/Desktop/Rice_H3ac.svg --colorList "white,#bd1829" --yMin 20 --yMax 50 --zMin 40
plotProfile
plotProfile
creates a profile plot for scores over sets of genomic regions. This is a simplified version of theplotHeatmap
provided above that only plots individual line graphs. The parameters for this function are quite similar to others, so I will keep the explanation brief without example code.Unfortunately,
plotProfile
is currently not used by many users.$ plotProfile -m <matrix file.gz> -out <output.extention> --plotType=<select> --color <'color code'> --numPlotsPerRow <int> --plotTitle <"Test data profile"> --kmeans <int> --perGroup --dpi <int>
In these commands,
--plotType=<select>
: Possible values arelines
,fill
,se
,std
,overlapped_lines
, andheatmap
.--color <'color code'>
: List of colors to use for the plotted lines (color names or hex strings). For example--colors '#e89b81' '#959dc8' '#f6f0a8'
--numPlotsPerRow <int>
: Number of plots per row
Critical Note- Overall, the
plotHeatmap
andplotProfile
functions, powered bycomputeMatrix
, are commonly used to visualize the tag distribution of peaks and are featured prominently in numerous publications. - However, for more detailed analysis, I recommend using Python-based
ngs.plot.r
or theChIPseeker
R package , which offer enhanced detail and additional analytical capabilities.
multiBigwigSummary and multiBamSummary
multiBigwigSummary
andmultiBamSummary
compute the average score/read coverages for genomic regions for two or more bigWig and BAM files, respectively. This analysis is performed for the entire genome by running the program in bins mode, or for certain user selected regions in BED-file mode. Most commonly, the default output ofmultiBigwigSummary
(a compressed numpy array, .npz) is used by other tools such asplotCorrelation
orplotPCA
for visualization and diagnostic purposes. Use the following command to make a summary matrix from bigWig files withmultiBigwigSummary
:$ multiBigwigSummary bins -b <file1.bigwig>...<fileN.bigwig> -l <file1>...<fileN> -o <results.npz> $ multiBamSummary bins -b <file1.bam>...<fileN.bam> -l <file1>...<fileN> -e -o <results.npz>
In these commands,
-b <file1.bigwig>...<fileN.bigwig>
: Specifies input data, it is separated by spaces.-l <file1>...<fileN>
: User defined labels instead of default labels from file names. Multiple labels have to be separated by a space.-o <results.npz>
: Specifies output data as npz format.-e
option: onlymultiBamSummary
, allows the extension of reads to fragment size.
Example Code (Results are not printed)
Here is an example command to make a coverage matrix with
multiBigwigSummary
:$ multiBigwigSummary bins -b shControl_Input.bigwig shControl_H3K4me3.bigwig shNFIB_H3K4me3.bigwig shMLL1_H3K4me3.bigwig \ -l Input shControl shNFIB shMLL1 -o /Users/jchoi/Desktop/CorrMatrix.npz
plotCorrelation
plotCorrelation
analyzes and visualizes sample correlations based on the output ofmultiBamSummary
ormultiBigwigSummary
. Pearson or Spearman methods are available to compute correlation coefficients. Use the following command to perform plotting for correlation withplotCorrelation
:$ plotCorrelation -in <CorrMatrix.npz> --corMethod <select> --skipZeros --whatToPlot <type> \ --removeOutliers --colorMap <color> --plotNumbers -o <output.extention> --outFileCorMatrix <matrix_name.tab>
In these commands,
Parameter Description -in <CorrMatrix.npz>
Specifies a coverage matrix file. ---corMethod <select>
Specifies a correlation method. Possible values for <select>
arespearman
andpearson
.--skipZeros
optionBy setting this option, genomic regions that have zero or missing (NaN) values in all samples are excluded. --whatToPlot <type>
Choose between a heatmap or pairwise scatter plots. Possible values for <type>
areheatmap
andscatterplot
.--removeOutliers
optionIf set, bins with very large counts are removed (only pearson
).--colorMap <color>
Color map to use for the heatmap. --zMin/--zMax
optionadjust heatmap intensities. If not specified, the value is set automatically. --xRange/--yRange
optionadjust plot range of the x/y-axis. If not specified, the value is set automatically. --plotNumbers
optionIf set, then the correlation number is plotted on top of the heatmap (only heatmap
).-o <output.extention>
Specifies output file. Possible values for file extention are png
,eps
,pdf
, andsvg
(recommended).--outFileCorMatrix <matrix_name.tab>
Save matrix with pairwise correlation values to a tab-separated file.
Example Code (Results are not printed)
- Here is an example command to make a correlation plot with
plotCorrelation
:# Heatmap, Pearson r $ plotCorrelation -in /Users/jchoi/Desktop/CorrMatrix.npz --corMethod pearson --skipZeros \ --whatToPlot heatmap --removeOutliers --colorMap RdYlBu --plotNumbers \ -o /Users/jchoi/Desktop/heatmap_PearsonCorr.svg --outFileCorMatrix /Users/jchoi/Desktop/heatmap_PearsonCorr.tab # Scatter plot, Spearman r $ plotCorrelation -in /Users/jchoi/Desktop/CorrMatrix.npz --corMethod spearman --skipZeros \ --whatToPlot scatterplot --colorMap RdYlBu -o /Users/jchoi/Desktop/heatmap_SpearmanCorr.svg \ --outFileCorMatrix /Users/jchoi/Desktop/heatmap_SpearmanCorr.tab
Citations
- Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A., & Manke, T. (2014). deepTools: a flexible platform for exploring deep-sequencing data. Nucleic acids research, 42(W1), W187-W191. DOI
- Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, A. S., ... & Manke, T. (2016). deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic acids research, 44(Web Server issue), W160. DOI