samtools.1

   1 .TH samtools 1 "21 April 2011" "samtools-0.1.16" "Bioinformatics tools"
   2 .SH NAME
   3 .PP
   4 samtools - Utilities for the Sequence Alignment/Map (SAM) format
   5 .SH SYNOPSIS
   6 .PP
   7 samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
   8 .PP
   9 samtools sort aln.bam aln.sorted
  10 .PP
  11 samtools index aln.sorted.bam
  12 .PP
  13 samtools idxstats aln.sorted.bam
  14 .PP
  15 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  16 .PP
  17 samtools merge out.bam in1.bam in2.bam in3.bam
  18 .PP
  19 samtools faidx ref.fasta
  20 .PP
  21 samtools pileup -vcf ref.fasta aln.sorted.bam
  22 .PP
  23 samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  24 .PP
  25 samtools tview aln.sorted.bam ref.fasta
  26
  27 .SH DESCRIPTION
  28 .PP
  29 Samtools is a set of utilities that manipulate alignments in the BAM
  30 format. It imports from and exports to the SAM (Sequence Alignment/Map)
  31 format, does sorting, merging and indexing, and allows to retrieve reads
  32 in any regions swiftly.
  33
  34 Samtools is designed to work on a stream. It regards an input file `-'
  35 as the standard input (stdin) and an output file `-' as the standard
  36 output (stdout). Several commands can thus be combined with Unix
  37 pipes. Samtools always output warning and error messages to the standard
  38 error output (stderr).
  39
  40 Samtools is also able to open a BAM (not SAM) file on a remote FTP or
  41 HTTP server if the BAM file name starts with `ftp://' or `http://'.
  42 Samtools checks the current working directory for the index file and
  43 will download the index upon absence. Samtools does not retrieve the
  44 entire alignment file unless it is asked to do so.
  45
  46 .SH COMMANDS AND OPTIONS
  47
  48 .TP 10
  49 .B view
  50 samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F
  51 skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  52
  53 Extract/print all or sub alignments in SAM or BAM format. If no region
  54 is specified, all the alignments will be printed; otherwise only
  55 alignments overlapping the specified regions will be output. An
  56 alignment may be given multiple times if it is overlapping several
  57 regions. A region can be presented, for example, in the following
  58 format: `chr2' (the whole chr2), `chr2:1000000' (region starting from
  59 1,000,000bp) or `chr2:1,000,000-2,000,000' (region between 1,000,000 and
  60 2,000,000bp including the end points). The coordinate is 1-based.
  61
  62 .B OPTIONS:
  63 .RS
  64 .TP 8
  65 .B -b
  66 Output in the BAM format.
  67 .TP
  68 .BI -f \ INT
  69 Only output alignments with all bits in INT present in the FLAG
  70 field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
  71 .TP
  72 .BI -F \ INT
  73 Skip alignments with bits present in INT [0]
  74 .TP
  75 .B -h
  76 Include the header in the output.
  77 .TP
  78 .B -H
  79 Output the header only.
  80 .TP
  81 .BI -l \ STR
  82 Only output reads in library STR [null]
  83 .TP
  84 .BI -o \ FILE
  85 Output file [stdout]
  86 .TP
  87 .BI -q \ INT
  88 Skip alignments with MAPQ smaller than INT [0]
  89 .TP
  90 .BI -r \ STR
  91 Only output reads in read group STR [null]
  92 .TP
  93 .BI -R \ FILE
  94 Output reads in read groups listed in
  95 .I FILE
  96 [null]
  97 .TP
  98 .B -S
  99 Input is in SAM. If @SQ header lines are absent, the
 100 .B `-t'
 101 option is required.
 102 .TP
 103 .B -c
 104 Instead of printing the alignments, only count them and print the
 105 total number. All filter options, such as
 106 .B `-f',
 107 .B `-F'
 108 and
 109 .B `-q'
 110 , are taken into account.
 111 .TP
 112 .BI -t \ FILE
 113 This file is TAB-delimited. Each line must contain the reference name
 114 and the length of the reference, one line for each distinct reference;
 115 additional fields are ignored. This file also defines the order of the
 116 reference sequences in sorting. If you run `samtools faidx <ref.fa>',
 117 the resultant index file
 118 .I <ref.fa>.fai
 119 can be used as this
 120 .I <in.ref_list>
 121 file.
 122 .TP
 123 .B -u
 124 Output uncompressed BAM. This option saves time spent on
 125 compression/decomprssion and is thus preferred when the output is piped
 126 to another samtools command.
 127 .RE
 128
 129 .TP
 130 .B tview
 131 samtools tview <in.sorted.bam> [ref.fasta]
 132
 133 Text alignment viewer (based on the ncurses library). In the viewer,
 134 press `?' for help and press `g' to check the alignment start from a
 135 region in the format like `chr10:10,000,000' or `=10,000,000' when
 136 viewing the same reference sequence.
 137
 138 .TP
 139 .B mpileup
 140 samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]
 141
 142 Generate BCF or pileup for one or multiple BAM files. Alignment records
 143 are grouped by sample identifiers in @RG header lines. If sample
 144 identifiers are absent, each input file is regarded as one sample.
 145
 146 .B OPTIONS:
 147 .RS
 148 .TP 10
 149 .B -A
 150 Do not skip anomalous read pairs in variant calling.
 151 .TP
 152 .B -B
 153 Disable probabilistic realignment for the computation of base alignment
 154 quality (BAQ). BAQ is the Phred-scaled probability of a read base being
 155 misaligned. Applying this option greatly helps to reduce false SNPs
 156 caused by misalignments.
 157 .TP
 158 .BI -C \ INT
 159 Coefficient for downgrading mapping quality for reads containing
 160 excessive mismatches. Given a read with a phred-scaled probability q of
 161 being generated from the mapped position, the new mapping quality is
 162 about sqrt((INT-q)/INT)*INT. A zero value disables this
 163 functionality; if enabled, the recommended value for BWA is 50. [0]
 164 .TP
 165 .BI -d \ INT
 166 At a position, read maximally
 167 .I INT
 168 reads per input BAM. [250]
 169 .TP
 170 .B -D
 171 Output per-sample read depth
 172 .TP
 173 .BI -e \ INT
 174 Phred-scaled gap extension sequencing error probability. Reducing
 175 .I INT
 176 leads to longer indels. [20]
 177 .TP
 178 .B -E
 179 Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt
 180 specificity a little bit.
 181 .TP
 182 .BI -f \ FILE
 183 The reference file [null]
 184 .TP
 185 .B -g
 186 Compute genotype likelihoods and output them in the binary call format (BCF).
 187 .TP
 188 .BI -h \ INT
 189 Coefficient for modeling homopolymer errors. Given an
 190 .IR l -long
 191 homopolymer
 192 run, the sequencing error of an indel of size
 193 .I s
 194 is modeled as
 195 .IR INT * s / l .
 196 [100]
 197 .TP
 198 .B -I
 199 Do not perform INDEL calling
 200 .TP
 201 .BI -l \ FILE
 202 File containing a list of sites where pileup or BCF is outputted [null]
 203 .TP
 204 .BI -L \ INT
 205 Skip INDEL calling if the average per-sample depth is above
 206 .IR INT .
 207 [250]
 208 .TP
 209 .BI -o \ INT
 210 Phred-scaled gap open sequencing error probability. Reducing
 211 .I INT
 212 leads to more indel calls. [40]
 213 .TP
 214 .BI -P \ STR
 215 Comma dilimited list of platforms (determined by
 216 .BR @RG-PL )
 217 from which indel candidates are obtained. It is recommended to collect
 218 indel candidates from sequencing technologies that have low indel error
 219 rate such as ILLUMINA. [all]
 220 .TP
 221 .BI -q \ INT
 222 Minimum mapping quality for an alignment to be used [0]
 223 .TP
 224 .BI -Q \ INT
 225 Minimum base quality for a base to be considered [13]
 226 .TP
 227 .BI -r \ STR
 228 Only generate pileup in region
 229 .I STR
 230 [all sites]
 231 .TP
 232 .B -S
 233 Output per-sample Phred-scaled strand bias P-value
 234 .TP
 235 .B -u
 236 Similar to
 237 .B -g
 238 except that the output is uncompressed BCF, which is preferred for piping.
 239 .RE
 240
 241 .TP
 242 .B reheader
 243 samtools reheader <in.header.sam> <in.bam>
 244
 245 Replace the header in
 246 .I in.bam
 247 with the header in
 248 .I in.header.sam.
 249 This command is much faster than replacing the header with a
 250 BAM->SAM->BAM conversion.
 251
 252 .TP
 253 .B cat
 254 samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]
 255
 256 Concatenate BAMs. The sequence dictionary of each input BAM must be identical,
 257 although this command does not check this. This command uses a similar trick
 258 to
 259 .B reheader
 260 which enables fast BAM concatenation.
 261
 262 .TP
 263 .B sort
 264 samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
 265
 266 Sort alignments by leftmost coordinates. File
 267 .I <out.prefix>.bam
 268 will be created. This command may also create temporary files
 269 .I <out.prefix>.%d.bam
 270 when the whole alignment cannot be fitted into memory (controlled by
 271 option -m).
 272
 273 .B OPTIONS:
 274 .RS
 275 .TP 8
 276 .B -o
 277 Output the final alignment to the standard output.
 278 .TP
 279 .B -n
 280 Sort by read names rather than by chromosomal coordinates
 281 .TP
 282 .BI -m \ INT
 283 Approximately the maximum required memory. [500000000]
 284 .RE
 285
 286 .TP
 287 .B merge
 288 samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...]
 289
 290 Merge multiple sorted alignments.
 291 The header reference lists of all the input BAM files, and the @SQ headers of
 292 .IR inh.sam ,
 293 if any, must all refer to the same set of reference sequences.
 294 The header reference list and (unless overridden by
 295 .BR -h )
 296 `@' headers of
 297 .I in1.bam
 298 will be copied to
 299 .IR out.bam ,
 300 and the headers of other files will be ignored.
 301
 302 .B OPTIONS:
 303 .RS
 304 .TP 8
 305 .B -1
 306 Use zlib compression level 1 to comrpess the output
 307 .TP
 308 .B -f
 309 Force to overwrite the output file if present.
 310 .TP 8
 311 .BI -h \ FILE
 312 Use the lines of
 313 .I FILE
 314 as `@' headers to be copied to
 315 .IR out.bam ,
 316 replacing any header lines that would otherwise be copied from
 317 .IR in1.bam .
 318 .RI ( FILE
 319 is actually in SAM format, though any alignment records it may contain
 320 are ignored.)
 321 .TP
 322 .B -n
 323 The input alignments are sorted by read names rather than by chromosomal
 324 coordinates
 325 .TP
 326 .BI -R \ STR
 327 Merge files in the specified region indicated by
 328 .I STR
 329 [null]
 330 .TP
 331 .B -r
 332 Attach an RG tag to each alignment. The tag value is inferred from file names.
 333 .TP
 334 .B -u
 335 Uncompressed BAM output
 336 .RE
 337
 338 .TP
 339 .B index
 340 samtools index <aln.bam>
 341
 342 Index sorted alignment for fast random access. Index file
 343 .I <aln.bam>.bai
 344 will be created.
 345
 346 .TP
 347 .B idxstats
 348 samtools idxstats <aln.bam>
 349
 350 Retrieve and print stats in the index file. The output is TAB delimited
 351 with each line consisting of reference sequence name, sequence length, #
 352 mapped reads and # unmapped reads.
 353
 354 .TP
 355 .B faidx
 356 samtools faidx <ref.fasta> [region1 [...]]
 357
 358 Index reference sequence in the FASTA format or extract subsequence from
 359 indexed reference sequence. If no region is specified,
 360 .B faidx
 361 will index the file and create
 362 .I <ref.fasta>.fai
 363 on the disk. If regions are speficified, the subsequences will be
 364 retrieved and printed to stdout in the FASTA format. The input file can
 365 be compressed in the
 366 .B RAZF
 367 format.
 368
 369 .TP
 370 .B fixmate
 371 samtools fixmate <in.nameSrt.bam> <out.bam>
 372
 373 Fill in mate coordinates, ISIZE and mate related flags from a
 374 name-sorted alignment.
 375
 376 .TP
 377 .B rmdup
 378 samtools rmdup [-sS] <input.srt.bam> <out.bam>
 379
 380 Remove potential PCR duplicates: if multiple read pairs have identical
 381 external coordinates, only retain the pair with highest mapping quality.
 382 In the paired-end mode, this command
 383 .B ONLY
 384 works with FR orientation and requires ISIZE is correctly set. It does
 385 not work for unpaired reads (e.g. two ends mapped to different
 386 chromosomes or orphan reads).
 387
 388 .B OPTIONS:
 389 .RS
 390 .TP 8
 391 .B -s
 392 Remove duplicate for single-end reads. By default, the command works for
 393 paired-end reads only.
 394 .TP 8
 395 .B -S
 396 Treat paired-end reads and single-end reads.
 397 .RE
 398
 399 .TP
 400 .B calmd
 401 samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>
 402
 403 Generate the MD tag. If the MD tag is already present, this command will
 404 give a warning if the MD tag generated is different from the existing
 405 tag. Output SAM by default.
 406
 407 .B OPTIONS:
 408 .RS
 409 .TP 8
 410 .B -A
 411 When used jointly with
 412 .B -r
 413 this option overwrites the original base quality.
 414 .TP 8
 415 .B -e
 416 Convert a the read base to = if it is identical to the aligned reference
 417 base. Indel caller does not support the = bases at the moment.
 418 .TP
 419 .B -u
 420 Output uncompressed BAM
 421 .TP
 422 .B -b
 423 Output compressed BAM
 424 .TP
 425 .B -S
 426 The input is SAM with header lines
 427 .TP
 428 .BI -C \ INT
 429 Coefficient to cap mapping quality of poorly mapped reads. See the
 430 .B pileup
 431 command for details. [0]
 432 .TP
 433 .B -r
 434 Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).
 435 .TP
 436 .B -E
 437 Extended BAQ calculation. This option trades specificity for sensitivity, though the
 438 effect is minor.
 439 .RE
 440
 441 .TP
 442 .B targetcut
 443 samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>
 444
 445 This command identifies target regions by examining the continuity of read depth, computes
 446 haploid consensus sequences of targets and outputs a SAM with each sequence corresponding
 447 to a target. When option
 448 .B -f
 449 is in use, BAQ will be applied. This command is
 450 .B only
 451 designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].
 452 .RE
 453
 454 .TP
 455 .B phase
 456 samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>
 457
 458 Call and phase heterozygous SNPs.
 459 .B OPTIONS:
 460 .RS
 461 .TP 8
 462 .B -A
 463 Drop reads with ambiguous phase.
 464 .TP 8
 465 .BI -b \ STR
 466 Prefix of BAM output. When this option is in use, phase-0 reads will be saved in file
 467 .BR STR .0.bam
 468 and phase-1 reads in
 469 .BR STR .1.bam.
 470 Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads
 471 with switch errors will be saved in
 472 .BR STR .chimeric.bam.
 473 [null]
 474 .TP
 475 .B -F
 476 Do not attempt to fix chimeric reads.
 477 .TP
 478 .BI -k \ INT
 479 Maximum length for local phasing. [13]
 480 .TP
 481 .BI -q \ INT
 482 Minimum Phred-scaled LOD to call a heterozygote. [40]
 483 .TP
 484 .BI -Q \ INT
 485 Minimum base quality to be used in het calling. [13]
 486 .RE
 487
 488 .TP
 489 .B pileup
 490 samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list] [-l
 491 in.site_list] [-C capMapQ] [-M maxMapQ] [-T theta] [-N nHap] [-r
 492 pairDiffRate] [-m mask] [-d maxIndelDepth] [-G indelPrior]
 493 <in.bam>|<in.sam>
 494
 495 Print the alignment in the pileup format. In the pileup format, each
 496 line represents a genomic position, consisting of chromosome name,
 497 coordinate, reference base, read bases, read qualities and alignment
 498 mapping qualities. Information on match, mismatch, indel, strand,
 499 mapping quality and start and end of a read are all encoded at the read
 500 base column. At this column, a dot stands for a match to the reference
 501 base on the forward strand, a comma for a match on the reverse strand,
 502 a '>' or '<' for a reference skip, `ACGTN' for a mismatch on the forward
 503 strand and `acgtn' for a mismatch on the reverse strand. A pattern
 504 `\\+[0-9]+[ACGTNacgtn]+' indicates there is an insertion between this
 505 reference position and the next reference position. The length of the
 506 insertion is given by the integer in the pattern, followed by the
 507 inserted sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'
 508 represents a deletion from the reference. The deleted bases will be
 509 presented as `*' in the following lines. Also at the read base column, a
 510 symbol `^' marks the start of a read. The ASCII of the character
 511 following `^' minus 33 gives the mapping quality. A symbol `$' marks the
 512 end of a read segment.
 513
 514 If option
 515 .B -c
 516 is applied, the consensus base, Phred-scaled consensus quality, SNP
 517 quality (i.e. the Phred-scaled probability of the consensus being
 518 identical to the reference) and root mean square (RMS) mapping quality
 519 of the reads covering the site will be inserted between the `reference
 520 base' and the `read bases' columns. An indel occupies an additional
 521 line. Each indel line consists of chromosome name, coordinate, a star,
 522 the genotype, consensus quality, SNP quality, RMS mapping quality, #
 523 covering reads, the first alllele, the second allele, # reads supporting
 524 the first allele, # reads supporting the second allele and # reads
 525 containing indels different from the top two alleles.
 526
 527 .B NOTE:
 528 Since 0.1.10, the `pileup' command is deprecated by `mpileup'.
 529
 530 .B OPTIONS:
 531 .RS
 532 .TP 10
 533 .B -B
 534 Disable the BAQ computation. See the
 535 .B mpileup
 536 command for details.
 537 .TP
 538 .B -c
 539 Call the consensus sequence. Options
 540 .BR -T ", " -N ", " -I " and " -r
 541 are only effective when
 542 .BR -c " or " -g
 543 is in use.
 544 .TP
 545 .BI -C \ INT
 546 Coefficient for downgrading the mapping quality of poorly mapped
 547 reads. See the
 548 .B mpileup
 549 command for details. [0]
 550 .TP
 551 .BI -d \ INT
 552 Use the first
 553 .I NUM
 554 reads in the pileup for indel calling for speed up. Zero for unlimited. [1024]
 555 .TP
 556 .BI -f \ FILE
 557 The reference sequence in the FASTA format. Index file
 558 .I FILE.fai
 559 will be created if
 560 absent.
 561 .TP
 562 .B -g
 563 Generate genotype likelihood in the binary GLFv3 format. This option
 564 suppresses -c, -i and -s. This option is deprecated by the
 565 .B mpileup
 566 command.
 567 .TP
 568 .B -i
 569 Only output pileup lines containing indels.
 570 .TP
 571 .BI -I \ INT
 572 Phred probability of an indel in sequencing/prep. [40]
 573 .TP
 574 .BI -l \ FILE
 575 List of sites at which pileup is output. This file is space
 576 delimited. The first two columns are required to be chromosome and
 577 1-based coordinate. Additional columns are ignored. It is
 578 recommended to use option
 579 .TP
 580 .BI -m \ INT
 581 Filter reads with flag containing bits in
 582 .I INT
 583 [1796]
 584 .TP
 585 .BI -M \ INT
 586 Cap mapping quality at INT [60]
 587 .TP
 588 .BI -N \ INT
 589 Number of haplotypes in the sample (>=2) [2]
 590 .TP
 591 .BI -r \ FLOAT
 592 Expected fraction of differences between a pair of haplotypes [0.001]
 593 .TP
 594 .B -s
 595 Print the mapping quality as the last column. This option makes the
 596 output easier to parse, although this format is not space efficient.
 597 .TP
 598 .B -S
 599 The input file is in SAM.
 600 .TP
 601 .BI -t \ FILE
 602 List of reference names ane sequence lengths, in the format described
 603 for the
 604 .B import
 605 command. If this option is present, samtools assumes the input
 606 .I <in.alignment>
 607 is in SAM format; otherwise it assumes in BAM format.
 608 .B -s
 609 together with
 610 .B -l
 611 as in the default format we may not know the mapping quality.
 612 .TP
 613 .BI -T \ FLOAT
 614 The theta parameter (error dependency coefficient) in the maq consensus
 615 calling model [0.85]
 616 .RE
 617
 618 .SH SAM FORMAT
 619
 620 SAM is TAB-delimited. Apart from the header lines, which are started
 621 with the `@' symbol, each alignment line consists of:
 622
 623 .TS
 624 center box;
 625 cb | cb | cb
 626 n | l | l .
 627 Col     Field   Description
 628 _
 629 1       QNAME   Query (pair) NAME
 630 2       FLAG    bitwise FLAG
 631 3       RNAME   Reference sequence NAME
 632 4       POS     1-based leftmost POSition/coordinate of clipped sequence
 633 5       MAPQ    MAPping Quality (Phred-scaled)
 634 6       CIAGR   extended CIGAR string
 635 7       MRNM    Mate Reference sequence NaMe (`=' if same as RNAME)
 636 8       MPOS    1-based Mate POSistion
 637 9       ISIZE   Inferred insert SIZE
 638 10      SEQ     query SEQuence on the same strand as the reference
 639 11      QUAL    query QUALity (ASCII-33 gives the Phred base quality)
 640 12      OPT     variable OPTional fields in the format TAG:VTYPE:VALUE
 641 .TE
 642
 643 .PP
 644 Each bit in the FLAG field is defined as:
 645
 646 .TS
 647 center box;
 648 cb | cb | cb
 649 l | c | l .
 650 Flag    Chr     Description
 651 _
 652 0x0001  p       the read is paired in sequencing
 653 0x0002  P       the read is mapped in a proper pair
 654 0x0004  u       the query sequence itself is unmapped
 655 0x0008  U       the mate is unmapped
 656 0x0010  r       strand of the query (1 for reverse)
 657 0x0020  R       strand of the mate
 658 0x0040  1       the read is the first read in a pair
 659 0x0080  2       the read is the second read in a pair
 660 0x0100  s       the alignment is not primary
 661 0x0200  f       the read fails platform/vendor quality checks
 662 0x0400  d       the read is either a PCR or an optical duplicate
 663 .TE
 664
 665 .SH EXAMPLES
 666 .IP o 2
 667 Import SAM to BAM when
 668 .B @SQ
 669 lines are present in the header:
 670
 671   samtools view -bS aln.sam > aln.bam
 672
 673 If
 674 .B @SQ
 675 lines are absent:
 676
 677   samtools faidx ref.fa
 678   samtools view -bt ref.fa.fai aln.sam > aln.bam
 679
 680 where
 681 .I ref.fa.fai
 682 is generated automatically by the
 683 .B faidx
 684 command.
 685
 686 .IP o 2
 687 Attach the
 688 .B RG
 689 tag while merging sorted alignments:
 690
 691   perl -e 'print "@RG\\tID:ga\\tSM:hs\\tLB:ga\\tPL:Illumina\\n@RG\\tID:454\\tSM:hs\\tLB:454\\tPL:454\\n"' > rg.txt
 692   samtools merge -rh rg.txt merged.bam ga.bam 454.bam
 693
 694 The value in a
 695 .B RG
 696 tag is determined by the file name the read is coming from. In this
 697 example, in the
 698 .IR merged.bam ,
 699 reads from
 700 .I ga.bam
 701 will be attached
 702 .IR RG:Z:ga ,
 703 while reads from
 704 .I 454.bam
 705 will be attached
 706 .IR RG:Z:454 .
 707
 708 .IP o 2
 709 Call SNPs and short indels for one diploid individual:
 710
 711   samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf
 712   bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf
 713
 714 The
 715 .B -D
 716 option of varFilter controls the maximum read depth, which should be
 717 adjusted to about twice the average read depth.  One may consider to add
 718 .B -C50
 719 to
 720 .B mpileup
 721 if mapping quality is overestimated for reads containing excessive
 722 mismatches. Applying this option usually helps
 723 .B BWA-short
 724 but may not other mappers.
 725
 726 .IP o 2
 727 Generate the consensus sequence for one diploid individual:
 728
 729   samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq
 730
 731 .IP o 2
 732 Phase one individual:
 733
 734   samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out
 735
 736 The
 737 .B calmd
 738 command is used to reduce false heterozygotes around INDELs.
 739
 740 .IP o 2
 741 Call SNPs and short indels for multiple diploid individuals:
 742
 743   samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf
 744   bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf
 745
 746 Individuals are identified from the
 747 .B SM
 748 tags in the
 749 .B @RG
 750 header lines. Individuals can be pooled in one alignment file; one
 751 individual can also be separated into multiple files. The
 752 .B -P
 753 option specifies that indel candidates should be collected only from
 754 read groups with the
 755 .B @RG-PL
 756 tag set to
 757 .IR ILLUMINA .
 758 Collecting indel candidates from reads sequenced by an indel-prone
 759 technology may affect the performance of indel calling.
 760
 761 .IP o 2
 762 Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:
 763
 764   samtools mpileup -Igf ref.fa *.bam > all.bcf
 765   bcftools view -bl sites.list all.bcf > sites.bcf
 766   bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
 767   bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
 768   bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
 769   ......
 770
 771 where
 772 .I sites.list
 773 contains the list of sites with each line consisting of the reference
 774 sequence name and position. The following
 775 .B bcftools
 776 commands estimate AFS by EM.
 777
 778 .IP o 2
 779 Dump BAQ applied alignment for other SNP callers:
 780
 781   samtools calmd -bAr aln.bam > aln.baq.bam
 782
 783 It adds and corrects the
 784 .B NM
 785 and
 786 .B MD
 787 tags at the same time. The
 788 .B calmd
 789 command also comes with the
 790 .B -C
 791 option, the same as the one in
 792 .B pileup
 793 and
 794 .BR mpileup .
 795 Apply if it helps.
 796
 797 .SH LIMITATIONS
 798 .PP
 799 .IP o 2
 800 Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.
 801 .IP o 2
 802 In merging, the input files are required to have the same number of
 803 reference sequences. The requirement can be relaxed. In addition,
 804 merging does not reconstruct the header dictionaries
 805 automatically. Endusers have to provide the correct header. Picard is
 806 better at merging.
 807 .IP o 2
 808 Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan
 809 reads or ends mapped to different chromosomes). If this is a concern,
 810 please use Picard's MarkDuplicate which correctly handles these cases,
 811 although a little slower.
 812
 813 .SH AUTHOR
 814 .PP
 815 Heng Li from the Sanger Institute wrote the C version of samtools. Bob
 816 Handsaker from the Broad Institute implemented the BGZF library and Jue
 817 Ruan from Beijing Genomics Institute wrote the RAZF library. John
 818 Marshall and Petr Danecek contribute to the source code and various
 819 people from the 1000 Genomes Project have contributed to the SAM format
 820 specification.
 821
 822 .SH SEE ALSO
 823 .PP
 824 Samtools website: <http://samtools.sourceforge.net>