samtools.1

   1 .TH samtools 1 "23 January 2009" "samtools-0.1.2" "Bioinformatics tools"
   2 .SH NAME
   3 .PP
   4 samtools - Utilities for the Sequence Alignment/Map (SAM) format
   5 .SH SYNOPSIS
   6 .PP
   7 samtools import ref_list.txt aln.sam.gz aln.bam
   8 .PP
   9 samtools sort aln.bam aln.sorted
  10 .PP
  11 samtools index aln.sorted.bam
  12 .PP
  13 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  14 .PP
  15 samtools merge out.bam in1.bam in2.bam in3.bam
  16 .PP
  17 samtools faidx ref.fasta
  18 .PP
  19 samtools pileup -f ref.fasta aln.sorted.bam
  20 .PP
  21 samtools tview aln.sorted.bam ref.fasta
  22
  23 .SH DESCRIPTION
  24 .PP
  25 Samtools is a set of utilities that manipulate alignments in the BAM
  26 format. It imports from and exports to the SAM (Sequence
  27 Alignment/Map) format, does sorting, merging and indexing, and
  28 allows to retrieve reads in any regions swiftly.
  29
  30 .SH COMMANDS AND OPTIONS
  31 .TP 10
  32 .B import
  33 samtools import <in.ref_list> <in.sam> <out.bam>
  34
  35 Convert alignments in SAM format to BAM format. File
  36 .I <in.ref_list>
  37 is TAB-delimited. Each line must contain the reference name and the
  38 length of the reference, one line for each distinct reference;
  39 additional fields are ignored. This file also defines the order of the
  40 reference sequences in sorting. File
  41 .I <in.sam>
  42 can be optionally compressed by zlib or gzip. A single hyphen is
  43 recognized as stdin or stdout, depending on the context. If you run
  44 `samtools faidx <ref.fa>', the resultant index file
  45 .I <ref.fa>.fai
  46 can be used as this
  47 .I <in.ref_list>
  48 file.
  49
  50 .TP
  51 .B sort
  52 samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
  53
  54 Sort alignments by leftmost coordinates. File
  55 .I <out.prefix>.bam
  56 will be created. This command may also create temporary files
  57 .I <out.prefix>.%d.bam
  58 when the whole alignment cannot be fitted into memory (controlled by
  59 option -m).
  60
  61 .B OPTIONS:
  62 .RS
  63 .TP 8
  64 .B -n
  65 Sort by read names rather than by chromosomal coordinates
  66 .TP
  67 .B -m INT
  68 Approximately the maximum required memory. [500000000]
  69 .RE
  70
  71 .TP
  72 .B merge
  73 samtools merge [-n] <out.bam> <in1.bam> <in2.bam> [...]
  74
  75 Merge multiple sorted alignments. The header of
  76 .I <in1.bam>
  77 will be copied to
  78 .I <out.bam>
  79 and the headers of other files will be ignored.
  80
  81 .B OPTIONS:
  82 .RS
  83 .TP 8
  84 .B -n
  85 The input alignments are sorted by read names rather than by chromosomal
  86 coordinates
  87 .RE
  88
  89 .TP
  90 .B index
  91 samtools index <aln.bam>
  92
  93 Index sorted alignment for fast random access. Index file
  94 .I <aln.bam>.bai
  95 will be created.
  96
  97 .TP
  98 .B view
  99 samtools view [-b] <in.bam> [region1 [...]]
 100
 101 Extract/print all or sub alignments in SAM or BAM format. If no region
 102 is specified, all the alignments will be printed; otherwise only
 103 alignments overlapping with the specified regions will be output. An
 104 alignment may be given multiple times if it is overlapping several
 105 regions. A region can be presented, for example, in the following
 106 format: `chr2', `chr2:1000000' or `chr2:1,000,000-2,000,000'.
 107
 108 .B OPTIONS:
 109 .RS
 110 .TP 8
 111 .B -b
 112 Output in the BAM format.
 113 .RE
 114
 115 .TP
 116 .B faidx
 117 samtools faidx <ref.fasta> [region1 [...]]
 118
 119 Index reference sequence in the FASTA format or extract subsequence from
 120 indexed reference sequence. If no region is specified,
 121 .B faidx
 122 will index the file and create
 123 .I <ref.fasta>.fai
 124 on the disk. If regions are speficified, the subsequences will be
 125 retrieved and printed to stdout in the FASTA format. The input file can
 126 be compressed in the
 127 .B RAZF
 128 format.
 129
 130 .TP
 131 .B pileup
 132 samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l in.site_list]
 133 [-iscg] [-T theta] [-N nHap] [-r pairDiffRate] <in.alignment>
 134
 135 Print the alignment in the pileup format. In the pileup format, each
 136 line represents a genomic position, consisting of chromosome name,
 137 coordinate, reference base, read bases, read qualities and alignment
 138 mapping qualities. Information on match, mismatch, indel, strand,
 139 mapping quality and start and end of a read are all encoded at the read
 140 base column. At this column, a dot stands for a match to the reference
 141 base on the forward strand, a comma for a match on the reverse strand,
 142 `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch
 143 on the reverse strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates
 144 there is an insertion between this reference position and the next
 145 reference position. The length of the insertion is given by the integer
 146 in the pattern, followed by the inserted sequence. Similarly, a pattern
 147 `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. Also at
 148 the read base column, a symbol `^' marks the start of a read segment
 149 which is a contiguous subsequence on the read separated by `N/S/H' CIGAR
 150 operations. The ASCII of the character following `^' minus 33 gives the
 151 mapping quality. A symbol `$' marks the end of a read segment.
 152
 153 If option
 154 .B -c
 155 is applied, the consensus base, consensus quality, SNP quality and
 156 maximum mapping quality of the reads covering the site will be inserted
 157 between the `reference base' and the `read bases' columns. An indel
 158 occupies an additional line. Each indel line consists of chromosome
 159 name, coordinate, a star, top two high-scoring ins/del sequences, the
 160 number of reads strongly supporting the first indel, the number of reads
 161 strongly supporting the second indel, the number of reads that confer
 162 little information on distinguishing indels and the number of reads that
 163 contain indels different from the top two ones.
 164
 165 .B OPTIONS:
 166 .RS
 167
 168 .TP 10
 169 .B -s
 170 Print the mapping quality as the last column. This option makes the
 171 output easier to parse, although this format is not space efficient.
 172
 173 .TP
 174 .B -i
 175 Only output pileup lines containing indels.
 176
 177 .TP
 178 .B -f FILE
 179 The reference sequence in the FASTA format. Index file
 180 .I FILE.fai
 181 will be created if
 182 absent.
 183
 184 .TP
 185 .B -t FILE
 186 List of reference names ane sequence lengths, in the format described
 187 for the
 188 .B import
 189 command. If this option is present, samtools assumes the input
 190 .I <in.alignment>
 191 is in SAM format; otherwise it assumes in BAM format.
 192
 193 .TP
 194 .B -l FILE
 195 List of sites at which pileup is output. This file is space
 196 delimited. The first two columns are required to be chromosome and
 197 1-based coordinate. Additional columns are ignored. It is
 198 recommended to use option
 199 .B -s
 200 together with
 201 .B -l
 202 as in the default format we may not know the mapping quality.
 203
 204 .TP
 205 .B -c
 206 Call the consensus sequnce using MAQ consensus model. Options
 207 .B -T,
 208 .B -N
 209 and
 210 .B -r
 211 are only effective when
 212 .B -c
 213 is in use.
 214
 215 .TP
 216 .B -g
 217 Generate genotype likelihood in the binary GLFv2 format. This option
 218 suppresses -c, -i and -s.
 219
 220 .TP
 221 .B -T FLOAT
 222 The theta parameter (error dependency coefficient) in the maq consensus
 223 calling model [0.85]
 224
 225 .TP
 226 .B -N INT
 227 Number of haplotypes in the sample (>=2) [2]
 228
 229 .TP
 230 .B -r FLOAT
 231 Expected fraction of differences between a pair of haplotypes [0.001]
 232
 233 .RE
 234
 235 .TP
 236 .B tview
 237 samtools tview <in.sorted.bam> [ref.fasta]
 238
 239 Text alignment viewer (based on the ncurses library). In the viewer,
 240 press `?' for help and press `g' to check the alignment start from a
 241 region in the format like `chr10:10,000,000'. Note that if the region
 242 showed on the screen contains no mapped reads, a blank screen will be
 243 seen. This is a known issue and will be improved later.
 244
 245 .RE
 246
 247 .TP
 248 .B fixmate
 249 samtools fixmate <in.nameSrt.bam> <out.bam>
 250
 251 Fill in mate coordinates, ISIZE and mate related flags from a
 252 name-sorted alignment.
 253
 254 .TP
 255 .B rmdup
 256 samtools rmdup <input.srt.bam> <out.bam>
 257
 258 Remove potential PCR duplicates: if multiple read pairs have identical
 259 external coordinates, only retain the pair with highest mapping quality.
 260 This command
 261 .B ONLY
 262 works with FR orientation and requires ISIZE is correctly set.
 263
 264 .RE
 265
 266
 267 .SH SAM FORFAM
 268
 269 SAM is TAB-delimited. Apart from the header lines, which are started
 270 with the `@' symbol, each alignment line consists of:
 271
 272 .TS
 273 center box;
 274 cb | cb | cb
 275 n | l | l .
 276 Col     Field   Description
 277 _
 278 1       QNAME   Query (pair) NAME
 279 2       FLAG    bitwise FLAG
 280 3       RNAME   Reference sequence NAME
 281 4       POS     1-based leftmost POSition/coordinate of clipped sequence
 282 5       MAPQ    MAPping Quality (Phred-scaled)
 283 6       CIAGR   extended CIGAR string
 284 7       MRNM    Mate Reference sequence NaMe (`=' if same as RNAME)
 285 8       MPOS    1-based Mate POSistion
 286 9       ISIZE   Inferred insert SIZE
 287 10      SEQ     query SEQuence on the same strand as the reference
 288 11      QUAL    query QUALity (ASCII-33 gives the Phred base quality)
 289 12      OPT     variable OPTional fields in the format TAG:VTYPE:VALUE
 290 .TE
 291
 292 .PP
 293 Each bit in the FLAG field is defined as:
 294
 295 .TS
 296 center box;
 297 cb | cb
 298 l | l .
 299 Flag    Description
 300 _
 301 0x0001  the read is paired in sequencing
 302 0x0002  the read is mapped in a proper pair
 303 0x0004  the query sequence itself is unmapped
 304 0x0008  the mate is unmapped
 305 0x0010  strand of the query (1 for reverse)
 306 0x0020  strand of the mate
 307 0x0040  the read is the first read in a pair
 308 0x0080  the read is the second read in a pair
 309 0x0100  the alignment is not primary
 310 0x0200  the read fails platform/vendor quality checks
 311 0x0400  the read is either a PCR or an optical duplicate
 312 .TE
 313
 314 .SH LIMITATIONS
 315 .PP
 316 .IP o 2
 317 In general, more testing is needed to ensure there is no severe bug.
 318 .IP o 2
 319 Reference sequence names and lengths are not acquired from the BAM/SAM header.
 320 .IP o 2
 321 CIGAR operations N and P may not be properly handled.
 322 .IP o 2
 323 There is a small known memory leak in the viewer.
 324
 325 .SH AUTHOR
 326 .PP
 327 Heng Li from the Sanger Institute is the author of the C version of
 328 samtools. Bob Handsaker from the Broad Institute implemented the BGZF
 329 library and Jue Ruan from Beijing Genomics Institute wrote the RAZF
 330 library. Various people in the 1000Genomes Project contributed to the
 331 SAM format specification.
 332
 333 .SH SEE ALSO
 334 .PP
 335 Samtools website: http://samtools.sourceforge.net