© 2022 Janghyun Choi
Docer CC BY-NC-SA 4.0

File Conversion and Management with SAMtools

The reads mapped onto the reference genome generate substantial data files in SAM format, ranging from 20 to 30 GB per sample, which could result in prolonged further analysis durations. SAMtools that is coded in python can be converted SAM into BAM to fast access the sequence data and analyze the sequence depth. This protocol was created based on SAMtools version 1.12 running on a system equipped with an Intel 10th generation i9-10910 processor and 48GB of memory. The test environment includes Python version 3.8.5, SciPy version 1.6.2, NumPy version 1.20.1, and pySam version 0.16.0.1 under macOS 12.4 OS environment.** The tool offers a variety of functions to analyze sequencing data, but this part will cover only the most commonly used features.

Installation samtools

Platform Original GitHub release Build status

To install HISAT2 via homebrew, type the following commands:

$ brew install Samtools

Stats - Check the mapping rate

SAMtools stats collects statistics from SAM/BAM files and outputs in a text format. The output can be visualized graphically using multiqc. Use the following command to perform the stats function of samtools:

$ samtools stats -@ <int> <input sam file.sam> > <output file.stats>

In these commands,

  • -@ <int>: specifies core numbers used in this work.
  • <input.sam>: specifies input file as a SAM format.
  • > <output.stats>: specifies output file as a stats format. > symbol must appear before the <output file.txt> syntax.

Example code (Results are not printed)

$ samtools stats -@ 10 /Users/jchoi/Desktop/SAM/Unt_UTX.sam > /Users/jchoi/Desktop/SAM/Unt_UTX.sam.stats 
  • The resulting stats file can be opened text editors and it consists of alignment information:
      # This file was produced by samtools stats (1.12+htslib-1.12) and can be plotted using plot-bamstats
      # This file contains statistics for all reads.
      # The command line was:  stats -@ 10 /Users/jchoi/Desktop/SAM/Unt_UTX.sam
      # CHK, Checksum	[2]Read Names	[3]Sequences	[4]Qualities
      # CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)
      CHK	6b2cf2a8	0ec11f5c	a187d603
      # Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
      SN	raw total sequences:	45639340	# excluding supplementary and secondary reads
      SN	filtered sequences:	0
      SN	sequences:	45639340
      SN	is sorted:	0
      SN	1st fragments:	22819670
      SN	last fragments:	22819670
      SN	reads mapped:	40370690
      SN	reads mapped and paired:	40243620	# paired-end technology bit set + both mates mapped
      SN	reads unmapped:	5268650
      SN	reads properly paired:	36659322	# proper-pair bit set
      SN	reads paired:	45639340	# paired-end technology bit set
      SN	reads duplicated:	0	# PCR or optical duplicate bit set
      SN	reads MQ0:	227872	# mapped and MQ=0
      SN	reads QC failed:	0
      SN	non-primary alignments:	0
      SN	supplementary alignments:	0
      SN	total length:	4562450957	# ignores clipping
      SN	total first fragment length:	2281908536	# ignores clipping
      SN	total last fragment length:	2280542421	# ignores clipping
      SN	bases mapped:	4038619347	# ignores clipping
      SN	bases mapped (cigar):	3942644569	# more accurate
      ...................
    

Export stats report via multiqc

The resulting stats files under the presence of the same folder can be compared and analyzed collectively with multiqc, use the following command:

$ multiqc ./

View - Convert to BAM file or View SAM or BAM file

SAMtools view, with no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header). Ultimately, the view function can convert text-format SAM files into binary BAM files and vice versa, enabling faster and more detailed manipulation of sequencing reads. Use the following command to perform the view function of samtools:

# For explore the strucure of them.
$ samtools view <SAM or BAM files> | head -<int> ## Usage is the same as the 'gzip' and 'cat' commands.
 
 # For converting BAM file.
$ samtools view -b -q <int> -f <FLAG> -F <FLAG> -@ <int> <input.sam> > <output.bam>

In these commands,

ParameterDescription
-b optionspecifies output file format as a SAM.
-q <int> optionskip alignments with MAPQ smaller than .
**Critical option,** A MAPQ > 30 is considered the highest quality.
-f <FLAG> optiononly output alignments with all bits set in <FLAG>.
-F <FLAG> optionDo not include output alignments with any bits set in <FLAG>.
FLAG field refers as bellow table.
-f and -F options cannot be used at the same time (mutually exclusive).
-@ <int> optionspecifies thread numbers <int>.
<input.sam>specifies input sam file.
> <output.bam>specifies output bam file. ‘>’ symbol must appear before the <output.bam> syntax.

There are no parameters when viewing or converting a specific region, only the following format would be used: chr:start-end.

For more information such as global parameter, refers to samtools’ guidelines.

FLAG field table

BitFlagDescription
00x1read paired
10x2read mapped in proper pair
20x4read unmapped
30x8mate unmapped
40x10read reverse strand
50x20mate reverse strand
60x40first in pair
70x80second in pair
80x100not primary alignment
90x200read fails platform/vendor quality checks
100x400read is PCR or optical duplicate
110x800supplementary alignment

Example code (Results are not printed)

$ samtools view -b -@ 20 shCon_Unt.sam > ~/Desktop/BAM/shCon_Unt.bam
$ samtools view -b -@ 20 shCon_H2O2.sam > ~/Desktop/BAM/shCon_H2O2.bam
$ samtools view -b -@ 20 shCon_NP.sam > ~/Desktop/BAM/shCon_NP.bam

# Convert to BAM file, which has a MAPQ score > 40 and marked FLAG field corresponded to 0x1 and 0x2.
$ samtools view -b -q 40 -f 0x1,0x2 -@ 20 shControl.sam > shControl.bam 

# Convert to BAM file from SAM file, which has a MAPQ score > 30 and excepts FLAG field corresponded to 0x8.
$ samtools view -b -q 30 -F 0x8 -@ 20 shNFIB.sam > shNFIB.bam 

Rmdup - Remove duplicative reads

SAMtools rmdup removes potential PCR duplicates (only retain the pair with highest mapping quality). The official manual describes “This command is obsolete. Use markdup instead.”, so it probably not operate in lastly versions. Use the following command to perform the rmdup function of samtools:

$ samtools rmdup -s <input.bam> <output.bam>

In these commands,

  • -s/-S option: -s is responsible for pair-end mode, whereas -S is for single-end mode.
  • <input.bam>: specifies input data.
  • <output.bam>: specifies output data.

the rmdupfunction does not provide multi-threads processing.

Example code and output

$ samtools rmdup -s shCon_Unt.bam rmdup.shCon_Unt.bam
$ samtools rmdup -s shCon_H2O2.bam rmdup.shCon_H2O2.bam
$ samtools rmdup -s shCon_NP.bam rmdup.shCon_NP.bam
  • When rmdup finishes running, it prints messages summarizing what happened.
      After operation of rmdup, terminal appears to as follows,
          [bam_rmdupse_core] 590 / 46146253 = 0.0000 in library
    

Sort - Sort reads by position or name

SAMtools sort function sorts BAM files by leftmost coordinates of references (chromosome position, no option) or by read name (-noption). In general, RNA-seq for feature counting (htseq-count) uses the -n option to sort by read name, while ChIP-seq for visualization or a BED file conversion does not use the -n option to sort by position. Use the following command to perform the sort function of samtools:

$ samtools sort -n -@ <int> <input.bam> -o <output.bam>

In these commands,

  • -n option: sort by name, if want to sort by position, remove -n.
  • -@ option: specifies thread numbers <int>.
  • <input.bam>: specifies input file.
  • -o <output.bam>: specifies output file.

Example code and output

# with -n option
$ samtools sort -n -@ 20 rmdup.shCon_Unt.bam -o sort.shCon_Unt.bam
$ samtools sort -n -@ 20 rmdup.shCon_H2O2.bam -o sort.shCon_H2O2.bam
$ samtools sort -n -@ 20 rmdup.shCon_NP.bam -o sort.shCon_NP.bam

# with no option
$ samtools sort -@ 20 /home/jh/Desktop/Bam/Rmdup_shControl.bam -o /home/jh/Desktop/Bam/Sort_shControl.bam
$ samtools sort -@ 20 /home/jh/Desktop/Bam/Rmdup_shNFIB.bam -o /home/jh/Desktop/Bam/Sort_shNFIB.bam
  • When sort finishes running, it prints messages summarizing what happened.
      After operation of sort, terminal appears to as follows,
          [bam_sort_core] merging from 20 files and 1 in-memory blocks...
    

Index - for ChIP-seq analysis

SAMtools index function Indexes a coordinate-sorted BGZIP-compressed SAM, BAM, or CRAM file for fast random access. This function requires a BAM file sorted by position. Use the following command to perform the index function of samtools:

$ samtools index <input.bam>

In these commands,

  • <input.bam>: specifies input file.

This command has no parameters that specifically define the output. The resulting bai file is gennerated in the input folder.

Preferably, the sorted BAM file and the bai file should exist in the same directory for further analysis.

Example code (Results are not printed)

$ samtools index /home/jh/Desktop/Bam/Sort_input.bam
$ samtools index /home/jh/Desktop/Bam/Sort_shControl.bam

Coverage - Check the sequencing depth

SAMtools coverage function computes the coverage at each position or region and draws an ASCII-art histogram or tabulated text. The coverage is defined as the percentage of positions within each bin with at least one base aligned against it. This function requires a BAM file sorted by position and a bai file. Use the following command to perform the coverage function of samtools:

$ samtools coverage -A -r <region> <input.bam>

In these commands,

  • -A option: Show only ASCII characters in histogram using colon and fullstop for full and half height characters.
  • -r <region>: Show specified region. Format: chr:start-end.
  • <input.bam>: specifies input file.

Example code and output

# Coverage across all chromosome without ASCII histogram
$ samtools coverage sort_MA_coverage.bam

# Coverage with ASCII histogram on chr1
$ samtools coverage -A -r chr1 sort_MA_coverage.bam
  • When coverage with no option finishes running, it prints messages summarizing what happened.
      #rname	numreads	covbases	coverage	meandepth	meanbaseq	meanmapq
      1	1582614	6662248	11.1823	2.66237	36.3	53.9
      2	1847826	7551970	12.6625	3.10498	36.3	56.5
      3	2100124	8084882	12.9093	3.36035	36.3	54.9
      4	1195390	9339824	11.9598	1.5328	36.3	54.2
      5	2054896	8817714	12.1623	2.84217	36.3	45
      6	1548488	6936500	11.509	2.57578	36.3	53
      7	1837930	8069184	10.8628	2.48051	36.3	54.7
      8	1513826	6812219	12.5444	2.79332	36.3	54.4
      9	1194217	6236630	11.0461	2.1184	36.3	53.5
      10	1347613	5541625	12.2006	2.972	36.3	52.1
      11	1352685	5446330	11.9739	2.97988	36.3	55.7
      12	1125210	5623379	11.4336	2.29375	36.3	52.8
      13	1427718	6164305	11.8122	2.74218	36.3	54.2
      14	1581018	5558226	10.5549	3.00721	36.3	51.7
      15	1194286	5929981	12.3437	2.49079	36.3	55.8
      16	1721170	6981746	12.6329	3.12217	36.3	56.2
      17	1239184	6120904	11.4493	2.32304	36.3	54.6
      18	1010858	5168646	10.1299	1.98516	36.3	51.8
      19	2085292	6031878	12.4498	4.31301	36.3	57.3
      20	1839414	6841016	12.3928	3.33999	36.3	57.2
      21	1744680	6044533	13.1592	3.80718	36.3	57.4
      22	1028090	5238163	13.3855	2.63309	36.3	54.9
      23	1436699	5811855	12.5734	3.11603	36.3	51.5
      24	911126	4395211	10.4219	2.1645	36.3	55.8
      25	1158715	4763344	12.7016	3.09503	36.3	55.1
      MT	3248278	16596	100	19679.7	36.3	60
    
  • When coverage with -A option finishes running, it prints messages summarizing what happened.
      1 (59.58Mbp)
      >  25.52% |.                                :                                                                       :                       :          | Number of reads: 1582614
      >  22.69% |:                                :                                 :                                     : :  .                  :          |     (288307 filtered)
      >  19.85% |:                                :                               . :                                     : : .:                 .:       .  | Covered bases:   6.66Mbp
      >  17.02% |::   .                     :     ::      .  .                 .  : :                                     : : ::        :.     :.::      .:::| Percent covered: 11.18%
      >  14.18% |::   :        :     .      :.    ::      :  : .       .       :  : ::                        .:         .::: ::     :. ::. : :::::      ::::| Mean coverage:   2.66x
      >  11.34% |::  ::        :  .:.:  .   ::    ::   .  :  :.:       :     :::  : ::    ..   .              ::  :.    ::::: ::     ::.:::.: :::::      ::::| Mean baseQ:      36.3
      >   8.51% |::  ::      : :  ::::.::.: ::    ::   :  :  :::  ..: ::    ::::  ::::: : ::   :    ..  .     ::  ::    :::::.::     :::::::: ::::: :  . ::::| Mean mapQ:       53.9
      >   5.67% |::  ::      :.:: ::::::::: ::   .:::  :. : .:::  ::: ::... ::::  :::::.::::.  :.:: ::.::  :. :::.::   :::::::::::   :::::::: ::::: :. ::::::|
      >   2.84% |:::.:: . .. ::::.:::::::::.:::..::::.::: : ::::..::::::::: ::::: :::::::::::: :::::::::: ::: ::::::   :::::::::::...::::::::::::::.:: ::::::| Histo bin width: 425.6Kbp
      >   0.00% |:::::::::::.:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::.::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::| Histo max bin:   28.359%
              1       4.26M     8.51M     12.77M    17.02M    21.28M    25.53M    29.79M    34.04M    38.30M    42.56M    46.81M    51.07M    55.32M     59.58M
    

Merge

merge function combines multiple sorted alignment files, producing a single sorted output file that contains all the input records and maintains the existing sort order. Use the following command to perform the merge function of samtools:

$ samtools merge -@ <int> <output.bam> <input-1.bam>...<input-N.bam>

In these commands,

  • -@ option: specifies thread numbers <int>.
  • <output.bam>: specifies output file.
  • <input.bam>...<input-N.bam>: specifies input file.

Example code (Results are not printed)

$ samtools merge -@ 20 merged_H3K4me3.bam sort.H3K4me3_rep1.bam sort.H3K4me3_rep2.bam sort.H3K4me3_rep3.bam
# After merge, performs sort and index.
$ samtools sort -@ 20 merged_H3K4me3.bam -o sort.merged_H3K4me3.bam
$ samtools index sort.merged_H3K4me3.bam

Citation

Samtools
  1. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & 1000 Genome Project Data Processing Subgroup. (2009). The sequence alignment/map format and SAMtools. bioinformatics, 25(16), 2078-2079. DOI