Alignment with Reference Genome using HISAT2

A first key step in RNA-seq is to align short reads to a reference genome. Several mappers have been developed according to various sample types and experimental conditions. HiSAT2 that is coded in python and a spcialized algorithm for transcriptome analysis can be fast and exactly maapped to a reference genome for whole-genome, transcriptome, and exome sequencing data. HiSAT2 can run on any computer installed on Linux or macOS and operates on python version > 2.6. As the alignment process takes a very long time and spends a lot of computational resources, the minimum spec of hardware is relatively higher than other bioinformatic tools (threads > 4, memory > 16 Gb ). This protocol was created based on HISAT version 2.2.1 running on a system equipped with an Intel 10th generation i9-10910 processor and 48GB of memory. The test environment includes Python version 3.8.5, SciPy version 1.6.2, NumPy version 1.20.1, and pySam version under macOS 12.4 environment.

Installation HISAT2

To install HISAT2 via anaconda, use the following commands:

$ conda install -c bioconda hisat2
# OR
$ conda install -c bioconda/label/cf201901 hisat2

Establish Genome Builder

  1. Download desired index files from official HISAT2 site.
  2. Uncompress it and move desired folder.

Prepare Genome Builder using `hisat2-build` if not available on suitable index file

hisat2-build builds a HISAT2 index from a set of DNA sequences. hisat2-build outputs a set of 6 files with suffixes .1.ht2, .2.ht2, .3.ht2, .4.ht2, .5.ht2, .6.ht2, .7.ht2, and .8.ht2. Use the following command to prepare the genome builder with hisat2-build:

$ hisat2-build <genome sequence.fa> <output folder>

The genome sequence file downloads the fa format of the toplevel (e.g. species.version.dna.toplevel.fa) provided by the ENSEMBL.

  • Here is an example command to perform hisat2-build on the reference file:
      $ hisat2-build /Users/jchoi/Desktop/Build/Danio_rerio.GRCz11.dna.toplevel.fa \
  • When hisat2-build finished typing, it prints messages processing genome builder file look like this:

Running HISAT2

Use the following command to perform mapping to the genome with hisat2:

$ hisat2 -x <GenoemeBuilder> -1 <forward> -2 <reverse> -S <output.sam> <OptionalParameters>

In these commands,

  • -x <GenoemeBuilder>: Specifies the genome index. Since index files consist of genome.X.ht2, the builder command must be typed as follows; /genome/builder/path/genome
  • -1, -2: Specifies the forward (-1) and reverse (-2) input files. Support gz commpressed fastq (fq) files.
  • -S <output.sam>: Specifies the output file type and name.
  • The <OptionalParameters> offer a wide range of variables as bellow, see the HISAT2’s manual for details.

    -5/--trim5 <int, -3/--trim3 <int>Trim bases from 5' (left) or 3' (right) end of each read before alignment (default: 0).
    --mp MX,MNSets the maximum (MX) and minimum (MN) mismatch penalties, both integers.
    --sp MX,MNSets the maximum (MX) and minimum (MN) penalties for soft-clipping per base, both integers.
    --np <int>Sets penalty for positions where the read, reference, or both, contain an ambiguous character such as N.
    --dta-cufflinks optionReport alignments tailored specifically for Cufflinks. In addition to what HISAT2 does with the above option (–dta), With this option, HISAT2 looks for novel splice sites with three signals (GT/AG, GC/AG, AT/AC), but all user-provided splice sites are used irrespective of their signals.
    -k <int>It searches for at most distinct, primary alignments for each read. Primary alignments mean alignments whose alignment score is equal or higher than any other alignments (Default: 5).
    -p <int>The number of CPU threads HISAT program will use when executing multi-processing tasks.

Example Code

Here is an example command to perform alignment with the human hg19 genome on trimmed fastq files:

$ hisat2 -x /Users/jchoi/Desktop/RNA-seq/GenomeIndex/hg19_HiSat2Builder/genome \
    -1 /Users/jchoi/Desktop/Trim/Trim_shCon_H2O2_1.fq.gz \
    -2 /Users/jchoi/Desktop/Trim/Trim_shCon_H2O2_2.fq.gz \
    -S /Users/jchoi/Desktop/SAM/shCon_H2O2.sam -p 10 --dta-cufflinks -k 5


When HISAT2 finishes running, it prints messages summarizing what happened.

24403217 reads; of these:
  24403217 (100.00%) were paired; of these:
    3213764 (13.17%) aligned concordantly 0 times
    18006379 (73.79%) aligned concordantly exactly 1 time
    3183074 (13.04%) aligned concordantly >1 times
    3213764 pairs aligned concordantly 0 times; of these:
      316476 (9.85%) aligned discordantly 1 time
    2897288 pairs aligned 0 times concordantly or discordantly; of these:
      5794576 mates make up the pairs; of these:
        3288372 (56.75%) aligned 0 times
        1992820 (34.39%) aligned exactly 1 time
        513384 (8.86%) aligned >1 times
93.26% overall alignment rate
  • This message can also be outputted by samtools.

The subsequent process utilizes the resulting ‘SAM’ file.


