1 samtools(1) Bioinformatics tools samtools(1)
6 samtools - Utilities for the Sequence Alignment/Map (SAM) format
9 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
11 samtools sort aln.bam aln.sorted
13 samtools index aln.sorted.bam
15 samtools idxstats aln.sorted.bam
17 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
19 samtools merge out.bam in1.bam in2.bam in3.bam
21 samtools faidx ref.fasta
23 samtools pileup -vcf ref.fasta aln.sorted.bam
25 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
27 samtools tview aln.sorted.bam ref.fasta
31 Samtools is a set of utilities that manipulate alignments in the BAM
32 format. It imports from and exports to the SAM (Sequence Alignment/Map)
33 format, does sorting, merging and indexing, and allows to retrieve
34 reads in any regions swiftly.
36 Samtools is designed to work on a stream. It regards an input file `-'
37 as the standard input (stdin) and an output file `-' as the standard
38 output (stdout). Several commands can thus be combined with Unix pipes.
39 Samtools always output warning and error messages to the standard error
42 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
43 HTTP server if the BAM file name starts with `ftp://' or `http://'.
44 Samtools checks the current working directory for the index file and
45 will download the index upon absence. Samtools does not retrieve the
46 entire alignment file unless it is asked to do so.
50 view samtools view [-bchuHS] [-t in.refList] [-o output] [-f
51 reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r read-
52 Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
54 Extract/print all or sub alignments in SAM or BAM format. If
55 no region is specified, all the alignments will be printed;
56 otherwise only alignments overlapping the specified regions
57 will be output. An alignment may be given multiple times if
58 it is overlapping several regions. A region can be presented,
59 for example, in the following format: `chr2' (the whole
60 chr2), `chr2:1000000' (region starting from 1,000,000bp) or
61 `chr2:1,000,000-2,000,000' (region between 1,000,000 and
62 2,000,000bp including the end points). The coordinate is
67 -b Output in the BAM format.
69 -f INT Only output alignments with all bits in INT present
70 in the FLAG field. INT can be in hex in the format of
73 -F INT Skip alignments with bits present in INT [0]
75 -h Include the header in the output.
77 -H Output the header only.
79 -l STR Only output reads in library STR [null]
81 -o FILE Output file [stdout]
83 -q INT Skip alignments with MAPQ smaller than INT [0]
85 -r STR Only output reads in read group STR [null]
87 -R FILE Output reads in read groups listed in FILE [null]
89 -S Input is in SAM. If @SQ header lines are absent, the
90 `-t' option is required.
92 -c Instead of printing the alignments, only count them
93 and print the total number. All filter options, such
94 as `-f', `-F' and `-q' , are taken into account.
96 -t FILE This file is TAB-delimited. Each line must contain
97 the reference name and the length of the reference,
98 one line for each distinct reference; additional
99 fields are ignored. This file also defines the order
100 of the reference sequences in sorting. If you run
101 `samtools faidx <ref.fa>', the resultant index file
102 <ref.fa>.fai can be used as this <in.ref_list> file.
104 -u Output uncompressed BAM. This option saves time spent
105 on compression/decomprssion and is thus preferred
106 when the output is piped to another samtools command.
109 tview samtools tview <in.sorted.bam> [ref.fasta]
111 Text alignment viewer (based on the ncurses library). In the
112 viewer, press `?' for help and press `g' to check the align-
113 ment start from a region in the format like
114 `chr10:10,000,000' or `=10,000,000' when viewing the same
118 mpileup samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
119 list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
122 Generate BCF or pileup for one or multiple BAM files. Align-
123 ment records are grouped by sample identifiers in @RG header
124 lines. If sample identifiers are absent, each input file is
125 regarded as one sample.
129 -B Disable probabilistic realignment for the computation
130 of base alignment quality (BAQ). BAQ is the Phred-
131 scaled probability of a read base being misaligned.
132 Applying this option greatly helps to reduce false
133 SNPs caused by misalignments.
135 -C INT Coefficient for downgrading mapping quality for reads
136 containing excessive mismatches. Given a read with a
137 phred-scaled probability q of being generated from
138 the mapped position, the new mapping quality is about
139 sqrt((INT-q)/INT)*INT. A zero value disables this
140 functionality; if enabled, the recommended value for
143 -e INT Phred-scaled gap extension sequencing error probabil-
144 ity. Reducing INT leads to longer indels. [20]
146 -f FILE The reference file [null]
148 -g Compute genotype likelihoods and output them in the
149 binary call format (BCF).
151 -h INT Coefficient for modeling homopolymer errors. Given an
152 l-long homopolymer run, the sequencing error of an
153 indel of size s is modeled as INT*s/l. [100]
155 -l FILE File containing a list of sites where pileup or BCF
158 -o INT Phred-scaled gap open sequencing error probability.
159 Reducing INT leads to more indel calls. [40]
161 -P STR Comma dilimited list of platforms (determined by @RG-
162 PL) from which indel candidates are obtained. It is
163 recommended to collect indel candidates from sequenc-
164 ing technologies that have low indel error rate such
167 -q INT Minimum mapping quality for an alignment to be used
170 -Q INT Minimum base quality for a base to be considered [13]
172 -r STR Only generate pileup in region STR [all sites]
174 -u Similar to -g except that the output is uncompressed
175 BCF, which is preferred for piping.
178 reheader samtools reheader <in.header.sam> <in.bam>
180 Replace the header in in.bam with the header in
181 in.header.sam. This command is much faster than replacing
182 the header with a BAM->SAM->BAM conversion.
185 sort samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
187 Sort alignments by leftmost coordinates. File <out.pre-
188 fix>.bam will be created. This command may also create tempo-
189 rary files <out.prefix>.%d.bam when the whole alignment can-
190 not be fitted into memory (controlled by option -m).
194 -o Output the final alignment to the standard output.
196 -n Sort by read names rather than by chromosomal coordi-
199 -m INT Approximately the maximum required memory.
203 merge samtools merge [-nur] [-h inh.sam] [-R reg] <out.bam>
204 <in1.bam> <in2.bam> [...]
206 Merge multiple sorted alignments. The header reference lists
207 of all the input BAM files, and the @SQ headers of inh.sam,
208 if any, must all refer to the same set of reference
209 sequences. The header reference list and (unless overridden
210 by -h) `@' headers of in1.bam will be copied to out.bam, and
211 the headers of other files will be ignored.
215 -h FILE Use the lines of FILE as `@' headers to be copied to
216 out.bam, replacing any header lines that would other-
217 wise be copied from in1.bam. (FILE is actually in
218 SAM format, though any alignment records it may con-
221 -R STR Merge files in the specified region indicated by STR
223 -r Attach an RG tag to each alignment. The tag value is
224 inferred from file names.
226 -n The input alignments are sorted by read names rather
227 than by chromosomal coordinates
229 -u Uncompressed BAM output
232 index samtools index <aln.bam>
234 Index sorted alignment for fast random access. Index file
235 <aln.bam>.bai will be created.
238 idxstats samtools idxstats <aln.bam>
240 Retrieve and print stats in the index file. The output is TAB
241 delimited with each line consisting of reference sequence
242 name, sequence length, # mapped reads and # unmapped reads.
245 faidx samtools faidx <ref.fasta> [region1 [...]]
247 Index reference sequence in the FASTA format or extract sub-
248 sequence from indexed reference sequence. If no region is
249 specified, faidx will index the file and create
250 <ref.fasta>.fai on the disk. If regions are speficified, the
251 subsequences will be retrieved and printed to stdout in the
252 FASTA format. The input file can be compressed in the RAZF
256 fixmate samtools fixmate <in.nameSrt.bam> <out.bam>
258 Fill in mate coordinates, ISIZE and mate related flags from a
259 name-sorted alignment.
262 rmdup samtools rmdup [-sS] <input.srt.bam> <out.bam>
264 Remove potential PCR duplicates: if multiple read pairs have
265 identical external coordinates, only retain the pair with
266 highest mapping quality. In the paired-end mode, this com-
267 mand ONLY works with FR orientation and requires ISIZE is
268 correctly set. It does not work for unpaired reads (e.g. two
269 ends mapped to different chromosomes or orphan reads).
273 -s Remove duplicate for single-end reads. By default,
274 the command works for paired-end reads only.
276 -S Treat paired-end reads and single-end reads.
279 calmd samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
281 Generate the MD tag. If the MD tag is already present, this
282 command will give a warning if the MD tag generated is dif-
283 ferent from the existing tag. Output SAM by default.
287 -A When used jointly with -r this option overwrites the
288 original base quality.
290 -e Convert a the read base to = if it is identical to
291 the aligned reference base. Indel caller does not
292 support the = bases at the moment.
294 -u Output uncompressed BAM
296 -b Output compressed BAM
298 -S The input is SAM with header lines
300 -C INT Coefficient to cap mapping quality of poorly mapped
301 reads. See the pileup command for details. [0]
303 -r Compute the BQ tag without changing the base quality.
306 pileup samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
307 [-l in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N
308 nHap] [-r pairDiffRate] [-m mask] [-d maxIndelDepth] [-G
309 indelPrior] <in.bam>|<in.sam>
311 Print the alignment in the pileup format. In the pileup for-
312 mat, each line represents a genomic position, consisting of
313 chromosome name, coordinate, reference base, read bases, read
314 qualities and alignment mapping qualities. Information on
315 match, mismatch, indel, strand, mapping quality and start and
316 end of a read are all encoded at the read base column. At
317 this column, a dot stands for a match to the reference base
318 on the forward strand, a comma for a match on the reverse
319 strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
320 match on the forward strand and `acgtn' for a mismatch on the
321 reverse strand. A pattern `\+[0-9]+[ACGTNacgtn]+' indicates
322 there is an insertion between this reference position and the
323 next reference position. The length of the insertion is given
324 by the integer in the pattern, followed by the inserted
325 sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+' repre-
326 sents a deletion from the reference. The deleted bases will
327 be presented as `*' in the following lines. Also at the read
328 base column, a symbol `^' marks the start of a read. The
329 ASCII of the character following `^' minus 33 gives the map-
330 ping quality. A symbol `$' marks the end of a read segment.
332 If option -c is applied, the consensus base, Phred-scaled
333 consensus quality, SNP quality (i.e. the Phred-scaled proba-
334 bility of the consensus being identical to the reference) and
335 root mean square (RMS) mapping quality of the reads covering
336 the site will be inserted between the `reference base' and
337 the `read bases' columns. An indel occupies an additional
338 line. Each indel line consists of chromosome name, coordi-
339 nate, a star, the genotype, consensus quality, SNP quality,
340 RMS mapping quality, # covering reads, the first alllele, the
341 second allele, # reads supporting the first allele, # reads
342 supporting the second allele and # reads containing indels
343 different from the top two alleles.
345 NOTE: Since 0.1.10, the `pileup' command is deprecated by
350 -B Disable the BAQ computation. See the mpileup com-
353 -c Call the consensus sequence. Options -T, -N, -I and
354 -r are only effective when -c or -g is in use.
356 -C INT Coefficient for downgrading the mapping quality of
357 poorly mapped reads. See the mpileup command for
360 -d INT Use the first NUM reads in the pileup for indel
361 calling for speed up. Zero for unlimited. [1024]
363 -f FILE The reference sequence in the FASTA format. Index
364 file FILE.fai will be created if absent.
366 -g Generate genotype likelihood in the binary GLFv3
367 format. This option suppresses -c, -i and -s. This
368 option is deprecated by the mpileup command.
370 -i Only output pileup lines containing indels.
372 -I INT Phred probability of an indel in sequencing/prep.
375 -l FILE List of sites at which pileup is output. This file
376 is space delimited. The first two columns are
377 required to be chromosome and 1-based coordinate.
378 Additional columns are ignored. It is recommended
381 -m INT Filter reads with flag containing bits in INT
384 -M INT Cap mapping quality at INT [60]
386 -N INT Number of haplotypes in the sample (>=2) [2]
388 -r FLOAT Expected fraction of differences between a pair of
391 -s Print the mapping quality as the last column. This
392 option makes the output easier to parse, although
393 this format is not space efficient.
395 -S The input file is in SAM.
397 -t FILE List of reference names ane sequence lengths, in
398 the format described for the import command. If
399 this option is present, samtools assumes the input
400 <in.alignment> is in SAM format; otherwise it
401 assumes in BAM format. -s together with -l as in
402 the default format we may not know the mapping
405 -T FLOAT The theta parameter (error dependency coefficient)
406 in the maq consensus calling model [0.85]
410 SAM is TAB-delimited. Apart from the header lines, which are started
411 with the `@' symbol, each alignment line consists of:
414 +----+-------+----------------------------------------------------------+
415 |Col | Field | Description |
416 +----+-------+----------------------------------------------------------+
417 | 1 | QNAME | Query (pair) NAME |
418 | 2 | FLAG | bitwise FLAG |
419 | 3 | RNAME | Reference sequence NAME |
420 | 4 | POS | 1-based leftmost POSition/coordinate of clipped sequence |
421 | 5 | MAPQ | MAPping Quality (Phred-scaled) |
422 | 6 | CIAGR | extended CIGAR string |
423 | 7 | MRNM | Mate Reference sequence NaMe (`=' if same as RNAME) |
424 | 8 | MPOS | 1-based Mate POSistion |
425 | 9 | ISIZE | Inferred insert SIZE |
426 |10 | SEQ | query SEQuence on the same strand as the reference |
427 |11 | QUAL | query QUALity (ASCII-33 gives the Phred base quality) |
428 |12 | OPT | variable OPTional fields in the format TAG:VTYPE:VALUE |
429 +----+-------+----------------------------------------------------------+
431 Each bit in the FLAG field is defined as:
434 +-------+-----+--------------------------------------------------+
435 | Flag | Chr | Description |
436 +-------+-----+--------------------------------------------------+
437 |0x0001 | p | the read is paired in sequencing |
438 |0x0002 | P | the read is mapped in a proper pair |
439 |0x0004 | u | the query sequence itself is unmapped |
440 |0x0008 | U | the mate is unmapped |
441 |0x0010 | r | strand of the query (1 for reverse) |
442 |0x0020 | R | strand of the mate |
443 |0x0040 | 1 | the read is the first read in a pair |
444 |0x0080 | 2 | the read is the second read in a pair |
445 |0x0100 | s | the alignment is not primary |
446 |0x0200 | f | the read fails platform/vendor quality checks |
447 |0x0400 | d | the read is either a PCR or an optical duplicate |
448 +-------+-----+--------------------------------------------------+
451 o Import SAM to BAM when @SQ lines are present in the header:
453 samtools view -bS aln.sam > aln.bam
455 If @SQ lines are absent:
457 samtools faidx ref.fa
458 samtools view -bt ref.fa.fai aln.sam > aln.bam
460 where ref.fa.fai is generated automatically by the faidx command.
463 o Attach the RG tag while merging sorted alignments:
465 perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
466 mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
467 samtools merge -rh rg.txt merged.bam ga.bam 454.bam
469 The value in a RG tag is determined by the file name the read is com-
470 ing from. In this example, in the merged.bam, reads from ga.bam will
471 be attached RG:Z:ga, while reads from 454.bam will be attached
475 o Call SNPs and short indels for one diploid individual:
477 samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - >
479 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 >
482 The -D option of varFilter controls the maximum read depth, which
483 should be adjusted to about twice the average read depth. One may
484 consider to add -C50 to mpileup if mapping quality is overestimated
485 for reads containing excessive mismatches. Applying this option usu-
486 ally helps BWA-short but may not other mappers.
489 o Call SNPs and short indels for multiple diploid individuals:
491 samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view
492 -bcvg - > var.raw.bcf
493 bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 >
496 Individuals are identified from the SM tags in the @RG header lines.
497 Individuals can be pooled in one alignment file; one individual can
498 also be separated into multiple files. The -P option specifies that
499 indel candidates should be collected only from read groups with the
500 @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads
501 sequenced by an indel-prone technology may affect the performance of
505 o Derive the allele frequency spectrum (AFS) on a list of sites from
506 multiple individuals:
508 samtools mpileup -Igf ref.fa *.bam > all.bcf
509 bcftools view -bl sites.list all.bcf > sites.bcf
510 bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
511 bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
512 bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
515 where sites.list contains the list of sites with each line consisting
516 of the reference sequence name and position. The following bcftools
517 commands estimate AFS by EM.
520 o Dump BAQ applied alignment for other SNP callers:
522 samtools calmd -bAr aln.bam > aln.baq.bam
524 It adds and corrects the NM and MD tags at the same time. The calmd
525 command also comes with the -C option, the same as the one in pileup
526 and mpileup. Apply if it helps.
530 o Unaligned words used in bam_import.c, bam_endian.h, bam.c and
533 o In merging, the input files are required to have the same number of
534 reference sequences. The requirement can be relaxed. In addition,
535 merging does not reconstruct the header dictionaries automatically.
536 Endusers have to provide the correct header. Picard is better at
539 o Samtools paired-end rmdup does not work for unpaired reads (e.g.
540 orphan reads or ends mapped to different chromosomes). If this is a
541 concern, please use Picard's MarkDuplicate which correctly handles
542 these cases, although a little slower.
546 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
547 Handsaker from the Broad Institute implemented the BGZF library and Jue
548 Ruan from Beijing Genomics Institute wrote the RAZF library. John Mar-
549 shall and Petr Danecek contribute to the source code and various people
550 from the 1000 Genomes Project have contributed to the SAM format speci-
555 Samtools website: <http://samtools.sourceforge.net>
559 samtools-0.1.12 2 December 2010 samtools(1)