1 samtools(1) Bioinformatics tools samtools(1)
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
11 samtools sort aln.bam aln.sorted
13 samtools index aln.sorted.bam
15 samtools idxstats aln.sorted.bam
17 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
19 samtools merge out.bam in1.bam in2.bam in3.bam
21 samtools faidx ref.fasta
23 samtools pileup -f ref.fasta aln.sorted.bam
25 samtools mpileup -f ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
27 samtools tview aln.sorted.bam ref.fasta
31 Samtools is a set of utilities that manipulate alignments in the BAM
32 format. It imports from and exports to the SAM (Sequence Alignment/Map)
33 format, does sorting, merging and indexing, and allows to retrieve
34 reads in any regions swiftly.
36 Samtools is designed to work on a stream. It regards an input file `-'
37 as the standard input (stdin) and an output file `-' as the standard
38 output (stdout). Several commands can thus be combined with Unix pipes.
39 Samtools always output warning and error messages to the standard error
42 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
43 HTTP server if the BAM file name starts with `ftp://' or `http://'.
44 Samtools checks the current working directory for the index file and
45 will download the index upon absence. Samtools does not retrieve the
46 entire alignment file unless it is asked to do so.
50 view samtools view [-bhuHS] [-t in.refList] [-o output] [-f
51 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
52 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
54 Extract/print all or sub alignments in SAM or BAM format. If
55 no region is specified, all the alignments will be printed;
56 otherwise only alignments overlapping the specified regions
57 will be output. An alignment may be given multiple times if
58 it is overlapping several regions. A region can be presented,
59 for example, in the following format: `chr2' (the whole
60 chr2), `chr2:1000000' (region starting from 1,000,000bp) or
61 `chr2:1,000,000-2,000,000' (region between 1,000,000 and
62 2,000,000bp including the end points). The coordinate is
67 -b Output in the BAM format.
69 -u Output uncompressed BAM. This option saves time spent
70 on compression/decomprssion and is thus preferred
71 when the output is piped to another samtools command.
73 -h Include the header in the output.
75 -H Output the header only.
77 -S Input is in SAM. If @SQ header lines are absent, the
78 `-t' option is required.
80 -t FILE This file is TAB-delimited. Each line must contain
81 the reference name and the length of the reference,
82 one line for each distinct reference; additional
83 fields are ignored. This file also defines the order
84 of the reference sequences in sorting. If you run
85 `samtools faidx <ref.fa>', the resultant index file
86 <ref.fa>.fai can be used as this <in.ref_list> file.
88 -o FILE Output file [stdout]
90 -f INT Only output alignments with all bits in INT present
91 in the FLAG field. INT can be in hex in the format of
94 -F INT Skip alignments with bits present in INT [0]
96 -q INT Skip alignments with MAPQ smaller than INT [0]
98 -l STR Only output reads in library STR [null]
100 -r STR Only output reads in read group STR [null]
102 -R FILE Output reads in read groups listed in FILE [null]
105 tview samtools tview <in.sorted.bam> [ref.fasta]
107 Text alignment viewer (based on the ncurses library). In the
108 viewer, press `?' for help and press `g' to check the align-
109 ment start from a region in the format like
110 `chr10:10,000,000' or `=10,000,000' when viewing the same
114 pileup samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l
115 in.site_list] [-iscgS2] [-T theta] [-N nHap] [-r
116 pairDiffRate] <in.bam>|<in.sam>
118 Print the alignment in the pileup format. In the pileup for-
119 mat, each line represents a genomic position, consisting of
120 chromosome name, coordinate, reference base, read bases, read
121 qualities and alignment mapping qualities. Information on
122 match, mismatch, indel, strand, mapping quality and start and
123 end of a read are all encoded at the read base column. At
124 this column, a dot stands for a match to the reference base
125 on the forward strand, a comma for a match on the reverse
126 strand, `ACGTN' for a mismatch on the forward strand and
127 `acgtn' for a mismatch on the reverse strand. A pattern
128 `\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion
129 between this reference position and the next reference posi-
130 tion. The length of the insertion is given by the integer in
131 the pattern, followed by the inserted sequence. Similarly, a
132 pattern `-[0-9]+[ACGTNacgtn]+' represents a deletion from the
133 reference. The deleted bases will be presented as `*' in the
134 following lines. Also at the read base column, a symbol `^'
135 marks the start of a read segment which is a contiguous sub-
136 sequence on the read separated by `N/S/H' CIGAR operations.
137 The ASCII of the character following `^' minus 33 gives the
138 mapping quality. A symbol `$' marks the end of a read seg-
141 If option -c is applied, the consensus base, Phred-scaled
142 consensus quality, SNP quality (i.e. the Phred-scaled proba-
143 bility of the consensus being identical to the reference) and
144 root mean square (RMS) mapping quality of the reads covering
145 the site will be inserted between the `reference base' and
146 the `read bases' columns. An indel occupies an additional
147 line. Each indel line consists of chromosome name, coordi-
148 nate, a star, the genotype, consensus quality, SNP quality,
149 RMS mapping quality, # covering reads, the first alllele, the
150 second allele, # reads supporting the first allele, # reads
151 supporting the second allele and # reads containing indels
152 different from the top two alleles.
154 The position of indels is offset by -1.
158 -s Print the mapping quality as the last column. This
159 option makes the output easier to parse, although
160 this format is not space efficient.
162 -S The input file is in SAM.
164 -i Only output pileup lines containing indels.
166 -f FILE The reference sequence in the FASTA format. Index
167 file FILE.fai will be created if absent.
169 -M INT Cap mapping quality at INT [60]
171 -m INT Filter reads with flag containing bits in INT
174 -d INT Use the first NUM reads in the pileup for indel
175 calling for speed up. Zero for unlimited. [0]
177 -t FILE List of reference names ane sequence lengths, in
178 the format described for the import command. If
179 this option is present, samtools assumes the input
180 <in.alignment> is in SAM format; otherwise it
181 assumes in BAM format.
183 -l FILE List of sites at which pileup is output. This file
184 is space delimited. The first two columns are
185 required to be chromosome and 1-based coordinate.
186 Additional columns are ignored. It is recommended
187 to use option -s together with -l as in the default
188 format we may not know the mapping quality.
190 -c Call the consensus sequence using SOAPsnp consensus
191 model. Options -T, -N, -I and -r are only effective
192 when -c or -g is in use.
194 -g Generate genotype likelihood in the binary GLFv3
195 format. This option suppresses -c, -i and -s.
197 -T FLOAT The theta parameter (error dependency coefficient)
198 in the maq consensus calling model [0.85]
200 -N INT Number of haplotypes in the sample (>=2) [2]
202 -r FLOAT Expected fraction of differences between a pair of
205 -I INT Phred probability of an indel in sequencing/prep.
209 mpileup samtools mpileup [-r reg] [-f in.fa] in.bam [in2.bam [...]]
211 Generate pileup for multiple BAM files. Consensus calling is
216 -r STR Only generate pileup in region STR [all sites]
218 -f FILE The reference file [null]
221 reheader samtools reheader <in.header.sam> <in.bam>
223 Replace the header in in.bam with the header in
224 in.header.sam. This command is much faster than replacing
225 the header with a BAM->SAM->BAM conversion.
228 sort samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
230 Sort alignments by leftmost coordinates. File <out.pre-
231 fix>.bam will be created. This command may also create tempo-
232 rary files <out.prefix>.%d.bam when the whole alignment can-
233 not be fitted into memory (controlled by option -m).
237 -o Output the final alignment to the standard output.
239 -n Sort by read names rather than by chromosomal coordi-
242 -m INT Approximately the maximum required memory.
246 merge samtools merge [-h inh.sam] [-nr] <out.bam> <in1.bam>
249 Merge multiple sorted alignments. The header reference lists
250 of all the input BAM files, and the @SQ headers of inh.sam,
251 if any, must all refer to the same set of reference
252 sequences. The header reference list and (unless overridden
253 by -h) `@' headers of in1.bam will be copied to out.bam, and
254 the headers of other files will be ignored.
258 -h FILE Use the lines of FILE as `@' headers to be copied to
259 out.bam, replacing any header lines that would other-
260 wise be copied from in1.bam. (FILE is actually in
261 SAM format, though any alignment records it may con-
264 -r Attach an RG tag to each alignment. The tag value is
265 inferred from file names.
267 -n The input alignments are sorted by read names rather
268 than by chromosomal coordinates
271 index samtools index <aln.bam>
273 Index sorted alignment for fast random access. Index file
274 <aln.bam>.bai will be created.
277 idxstats samtools idxstats <aln.bam>
279 Retrieve and print stats in the index file. The output is TAB
280 delimited with each line consisting of reference sequence
281 name, sequence length, # mapped reads and # unmapped reads.
284 faidx samtools faidx <ref.fasta> [region1 [...]]
286 Index reference sequence in the FASTA format or extract sub-
287 sequence from indexed reference sequence. If no region is
288 specified, faidx will index the file and create
289 <ref.fasta>.fai on the disk. If regions are speficified, the
290 subsequences will be retrieved and printed to stdout in the
291 FASTA format. The input file can be compressed in the RAZF
295 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
297 Fill in mate coordinates, ISIZE and mate related flags from a
298 name-sorted alignment.
301 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
303 Remove potential PCR duplicates: if multiple read pairs have
304 identical external coordinates, only retain the pair with
305 highest mapping quality. In the paired-end mode, this com-
306 mand ONLY works with FR orientation and requires ISIZE is
307 correctly set. It does not work for unpaired reads (e.g. two
308 ends mapped to different chromosomes or orphan reads).
312 -s Remove duplicate for single-end reads. By default,
313 the command works for paired-end reads only.
315 -S Treat paired-end reads and single-end reads.
318 calmd samtools calmd [-eubS] <aln.bam> <ref.fasta>
320 Generate the MD tag. If the MD tag is already present, this
321 command will give a warning if the MD tag generated is dif-
322 ferent from the existing tag. Output SAM by default.
326 -e Convert a the read base to = if it is identical to
327 the aligned reference base. Indel caller does not
328 support the = bases at the moment.
330 -u Output uncompressed BAM
332 -b Output compressed BAM
334 -S The input is SAM with header lines
338 SAM is TAB-delimited. Apart from the header lines, which are started
339 with the `@' symbol, each alignment line consists of:
342 +----+-------+----------------------------------------------------------+
343 |Col | Field | Description |
344 +----+-------+----------------------------------------------------------+
345 | 1 | QNAME | Query (pair) NAME |
346 | 2 | FLAG | bitwise FLAG |
347 | 3 | RNAME | Reference sequence NAME |
348 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
349 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
350 | 6 | CIAGR | extended CIGAR string |
351 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
352 | 8 | MPOS | 1-based Mate POSistion |
353 | 9 | ISIZE | Inferred insert SIZE |
354 |10 | SEQ | query SEQuence on the same strand as the reference |
355 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
356 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
357 +----+-------+----------------------------------------------------------+
359 Each bit in the FLAG field is defined as:
362 +-------+-----+--------------------------------------------------+
363 | Flag | Chr | Description |
364 +-------+-----+--------------------------------------------------+
365 |0x0001 | p | the read is paired in sequencing |
366 |0x0002 | P | the read is mapped in a proper pair |
367 |0x0004 | u | the query sequence itself is unmapped |
368 |0x0008 | U | the mate is unmapped |
369 |0x0010 | r | strand of the query (1 for reverse) |
370 |0x0020 | R | strand of the mate |
371 |0x0040 | 1 | the read is the first read in a pair |
372 |0x0080 | 2 | the read is the second read in a pair |
373 |0x0100 | s | the alignment is not primary |
374 |0x0200 | f | the read fails platform/vendor quality checks |
375 |0x0400 | d | the read is either a PCR or an optical duplicate |
376 +-------+-----+--------------------------------------------------+
379 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
382 o In merging, the input files are required to have the same number of
383 reference sequences. The requirement can be relaxed. In addition,
384 merging does not reconstruct the header dictionaries automatically.
385 Endusers have to provide the correct header. Picard is better at
388 o Samtools paired-end rmdup does not work for unpaired reads (e.g.
389 orphan reads or ends mapped to different chromosomes). If this is a
390 concern, please use Picard's MarkDuplicate which correctly handles
391 these cases, although a little slower.
395 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
396 Handsaker from the Broad Institute implemented the BGZF library and Jue
397 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
398 people in the 1000 Genomes Project contributed to the SAM format speci-
403 Samtools website: <http://samtools.sourceforge.net>
407 samtools-0.1.8 11 July 2010 samtools(1)