sam/samtools.txt

   1 samtools(1)                  Bioinformatics tools                  samtools(1)
   2
   3
   4
   5 NAME
   6        samtools - Utilities for the Sequence Alignment/Map (SAM) format
   7
   8 SYNOPSIS
   9        samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
  10
  11        samtools sort aln.bam aln.sorted
  12
  13        samtools index aln.sorted.bam
  14
  15        samtools idxstats aln.sorted.bam
  16
  17        samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  18
  19        samtools merge out.bam in1.bam in2.bam in3.bam
  20
  21        samtools faidx ref.fasta
  22
  23        samtools pileup -vcf ref.fasta aln.sorted.bam
  24
  25        samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
  26
  27        samtools tview aln.sorted.bam ref.fasta
  28
  29
  30 DESCRIPTION
  31        Samtools is a set of utilities that manipulate alignments  in  the  BAM
  32        format. It imports from and exports to the SAM (Sequence Alignment/Map)
  33        format, does sorting, merging and  indexing,  and  allows  to  retrieve
  34        reads in any regions swiftly.
  35
  36        Samtools  is designed to work on a stream. It regards an input file `-'
  37        as the standard input (stdin) and an output file `-'  as  the  standard
  38        output (stdout). Several commands can thus be combined with Unix pipes.
  39        Samtools always output warning and error messages to the standard error
  40        output (stderr).
  41
  42        Samtools  is  also able to open a BAM (not SAM) file on a remote FTP or
  43        HTTP server if the BAM file name starts  with  `ftp://'  or  `http://'.
  44        Samtools  checks  the  current working directory for the index file and
  45        will download the index upon absence. Samtools does  not  retrieve  the
  46        entire alignment file unless it is asked to do so.
  47
  48
  49 COMMANDS AND OPTIONS
  50        view      samtools  view  [-bchuHS]  [-t  in.refList]  [-o  output] [-f
  51                  reqFlag] [-F skipFlag] [-q minMapQ] [-l  library]  [-r  read-
  52                  Group] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]
  53
  54                  Extract/print  all or sub alignments in SAM or BAM format. If
  55                  no region is specified, all the alignments will  be  printed;
  56                  otherwise  only  alignments overlapping the specified regions
  57                  will be output. An alignment may be given multiple  times  if
  58                  it is overlapping several regions. A region can be presented,
  59                  for example, in  the  following  format:  `chr2'  (the  whole
  60                  chr2),  `chr2:1000000'  (region starting from 1,000,000bp) or
  61                  `chr2:1,000,000-2,000,000'  (region  between  1,000,000   and
  62                  2,000,000bp  including  the  end  points).  The coordinate is
  63                  1-based.
  64
  65                  OPTIONS:
  66
  67                  -b      Output in the BAM format.
  68
  69                  -f INT  Only output alignments with all bits in  INT  present
  70                          in the FLAG field. INT can be in hex in the format of
  71                          /^0x[0-9A-F]+/ [0]
  72
  73                  -F INT  Skip alignments with bits present in INT [0]
  74
  75                  -h      Include the header in the output.
  76
  77                  -H      Output the header only.
  78
  79                  -l STR  Only output reads in library STR [null]
  80
  81                  -o FILE Output file [stdout]
  82
  83                  -q INT  Skip alignments with MAPQ smaller than INT [0]
  84
  85                  -r STR  Only output reads in read group STR [null]
  86
  87                  -R FILE Output reads in read groups listed in FILE [null]
  88
  89                  -S      Input is in SAM. If @SQ header lines are absent,  the
  90                          `-t' option is required.
  91
  92                  -c      Instead  of  printing the alignments, only count them
  93                          and print the total number. All filter options,  such
  94                          as `-f', `-F' and `-q' , are taken into account.
  95
  96                  -t FILE This  file  is  TAB-delimited. Each line must contain
  97                          the reference name and the length of  the  reference,
  98                          one  line  for  each  distinct  reference; additional
  99                          fields are ignored. This file also defines the  order
 100                          of  the  reference  sequences  in sorting. If you run
 101                          `samtools faidx <ref.fa>', the resultant  index  file
 102                          <ref.fa>.fai  can be used as this <in.ref_list> file.
 103
 104                  -u      Output uncompressed BAM. This option saves time spent
 105                          on  compression/decomprssion  and  is  thus preferred
 106                          when the output is piped to another samtools command.
 107
 108
 109        tview     samtools tview <in.sorted.bam> [ref.fasta]
 110
 111                  Text  alignment viewer (based on the ncurses library). In the
 112                  viewer, press `?' for help and press `g' to check the  align-
 113                  ment    start    from   a   region   in   the   format   like
 114                  `chr10:10,000,000' or `=10,000,000'  when  viewing  the  same
 115                  reference sequence.
 116
 117
 118        mpileup   samtools mpileup [-Bug] [-C capQcoef] [-r reg] [-f in.fa] [-l
 119                  list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam
 120                  [...]]
 121
 122                  Generate  BCF or pileup for one or multiple BAM files. Align-
 123                  ment records are grouped by sample identifiers in @RG  header
 124                  lines.  If  sample identifiers are absent, each input file is
 125                  regarded as one sample.
 126
 127                  OPTIONS:
 128
 129                  -B      Disable probabilistic realignment for the computation
 130                          of  base  alignment  quality (BAQ). BAQ is the Phred-
 131                          scaled probability of a read base  being  misaligned.
 132                          Applying  this  option  greatly helps to reduce false
 133                          SNPs caused by misalignments.
 134
 135                  -C INT  Coefficient for downgrading mapping quality for reads
 136                          containing  excessive mismatches. Given a read with a
 137                          phred-scaled probability q of  being  generated  from
 138                          the mapped position, the new mapping quality is about
 139                          sqrt((INT-q)/INT)*INT. A  zero  value  disables  this
 140                          functionality;  if enabled, the recommended value for
 141                          BWA is 50. [0]
 142
 143                  -e INT  Phred-scaled gap extension sequencing error probabil-
 144                          ity. Reducing INT leads to longer indels. [20]
 145
 146                  -f FILE The reference file [null]
 147
 148                  -g      Compute  genotype  likelihoods and output them in the
 149                          binary call format (BCF).
 150
 151                  -h INT  Coefficient for modeling homopolymer errors. Given an
 152                          l-long  homopolymer  run,  the sequencing error of an
 153                          indel of size s is modeled as INT*s/l.  [100]
 154
 155                  -l FILE File containing a list of sites where pileup  or  BCF
 156                          is outputted [null]
 157
 158                  -o INT  Phred-scaled  gap  open sequencing error probability.
 159                          Reducing INT leads to more indel calls. [40]
 160
 161                  -P STR  Comma dilimited list of platforms (determined by @RG-
 162                          PL)  from  which indel candidates are obtained. It is
 163                          recommended to collect indel candidates from sequenc-
 164                          ing  technologies that have low indel error rate such
 165                          as ILLUMINA. [all]
 166
 167                  -q INT  Minimum mapping quality for an alignment to  be  used
 168                          [0]
 169
 170                  -Q INT  Minimum base quality for a base to be considered [13]
 171
 172                  -r STR  Only generate pileup in region STR [all sites]
 173
 174                  -u      Similar to -g except that the output is  uncompressed
 175                          BCF, which is preferred for piping.
 176
 177
 178        reheader  samtools reheader <in.header.sam> <in.bam>
 179
 180                  Replace   the   header   in   in.bam   with   the  header  in
 181                  in.header.sam.  This command is much  faster  than  replacing
 182                  the header with a BAM->SAM->BAM conversion.
 183
 184
 185        sort      samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>
 186
 187                  Sort  alignments  by  leftmost  coordinates.  File  <out.pre-
 188                  fix>.bam will be created. This command may also create tempo-
 189                  rary  files <out.prefix>.%d.bam when the whole alignment can-
 190                  not be fitted into memory (controlled by option -m).
 191
 192                  OPTIONS:
 193
 194                  -o      Output the final alignment to the standard output.
 195
 196                  -n      Sort by read names rather than by chromosomal coordi-
 197                          nates
 198
 199                  -m INT  Approximately    the    maximum    required   memory.
 200                          [500000000]
 201
 202
 203        merge     samtools  merge  [-nur]  [-h  inh.sam]  [-R  reg]   <out.bam>
 204                  <in1.bam> <in2.bam> [...]
 205
 206                  Merge multiple sorted alignments.  The header reference lists
 207                  of all the input BAM files, and the @SQ headers  of  inh.sam,
 208                  if  any,  must  all  refer  to  the  same  set  of  reference
 209                  sequences.  The header reference list and (unless  overridden
 210                  by  -h) `@' headers of in1.bam will be copied to out.bam, and
 211                  the headers of other files will be ignored.
 212
 213                  OPTIONS:
 214
 215                  -h FILE Use the lines of FILE as `@' headers to be copied  to
 216                          out.bam, replacing any header lines that would other-
 217                          wise be copied from in1.bam.  (FILE  is  actually  in
 218                          SAM  format, though any alignment records it may con-
 219                          tain are ignored.)
 220
 221                  -R STR  Merge files in the specified region indicated by STR
 222
 223                  -r      Attach an RG tag to each alignment. The tag value  is
 224                          inferred from file names.
 225
 226                  -n      The  input alignments are sorted by read names rather
 227                          than by chromosomal coordinates
 228
 229                  -u      Uncompressed BAM output
 230
 231
 232        index     samtools index <aln.bam>
 233
 234                  Index sorted alignment for fast  random  access.  Index  file
 235                  <aln.bam>.bai will be created.
 236
 237
 238        idxstats  samtools idxstats <aln.bam>
 239
 240                  Retrieve and print stats in the index file. The output is TAB
 241                  delimited with each line  consisting  of  reference  sequence
 242                  name, sequence length, # mapped reads and # unmapped reads.
 243
 244
 245        faidx     samtools faidx <ref.fasta> [region1 [...]]
 246
 247                  Index  reference sequence in the FASTA format or extract sub-
 248                  sequence from indexed reference sequence.  If  no  region  is
 249                  specified,   faidx   will   index   the   file   and   create
 250                  <ref.fasta>.fai on the disk. If regions are speficified,  the
 251                  subsequences  will  be retrieved and printed to stdout in the
 252                  FASTA format. The input file can be compressed  in  the  RAZF
 253                  format.
 254
 255
 256        fixmate   samtools fixmate <in.nameSrt.bam> <out.bam>
 257
 258                  Fill in mate coordinates, ISIZE and mate related flags from a
 259                  name-sorted alignment.
 260
 261
 262        rmdup     samtools rmdup [-sS] <input.srt.bam> <out.bam>
 263
 264                  Remove potential PCR duplicates: if multiple read pairs  have
 265                  identical  external  coordinates,  only  retain the pair with
 266                  highest mapping quality.  In the paired-end mode,  this  com-
 267                  mand  ONLY  works  with  FR orientation and requires ISIZE is
 268                  correctly set. It does not work for unpaired reads (e.g.  two
 269                  ends mapped to different chromosomes or orphan reads).
 270
 271                  OPTIONS:
 272
 273                  -s      Remove  duplicate  for  single-end reads. By default,
 274                          the command works for paired-end reads only.
 275
 276                  -S      Treat paired-end reads and single-end reads.
 277
 278
 279        calmd     samtools calmd [-eubSr] [-C capQcoef] <aln.bam> <ref.fasta>
 280
 281                  Generate the MD tag. If the MD tag is already  present,  this
 282                  command  will  give a warning if the MD tag generated is dif-
 283                  ferent from the existing tag. Output SAM by default.
 284
 285                  OPTIONS:
 286
 287                  -A      When used jointly with -r this option overwrites  the
 288                          original base quality.
 289
 290                  -e      Convert  a  the  read base to = if it is identical to
 291                          the aligned reference base.  Indel  caller  does  not
 292                          support the = bases at the moment.
 293
 294                  -u      Output uncompressed BAM
 295
 296                  -b      Output compressed BAM
 297
 298                  -S      The input is SAM with header lines
 299
 300                  -C INT  Coefficient  to  cap mapping quality of poorly mapped
 301                          reads. See the pileup command for details. [0]
 302
 303                  -r      Compute the BQ tag without changing the base quality.
 304
 305
 306        pileup    samtools pileup [-2sSBicv] [-f in.ref.fasta] [-t in.ref_list]
 307                  [-l in.site_list] [-C capMapQ] [-M maxMapQ]  [-T  theta]  [-N
 308                  nHap]  [-r  pairDiffRate]  [-m  mask]  [-d maxIndelDepth] [-G
 309                  indelPrior] <in.bam>|<in.sam>
 310
 311                  Print the alignment in the pileup format. In the pileup  for-
 312                  mat,  each  line represents a genomic position, consisting of
 313                  chromosome name, coordinate, reference base, read bases, read
 314                  qualities  and  alignment  mapping  qualities. Information on
 315                  match, mismatch, indel, strand, mapping quality and start and
 316                  end  of  a  read  are all encoded at the read base column. At
 317                  this column, a dot stands for a match to the  reference  base
 318                  on  the  forward  strand,  a comma for a match on the reverse
 319                  strand, a '>' or '<' for a reference skip, `ACGTN' for a mis-
 320                  match on the forward strand and `acgtn' for a mismatch on the
 321                  reverse strand. A pattern  `\+[0-9]+[ACGTNacgtn]+'  indicates
 322                  there is an insertion between this reference position and the
 323                  next reference position. The length of the insertion is given
 324                  by  the  integer  in  the  pattern,  followed by the inserted
 325                  sequence. Similarly, a pattern `-[0-9]+[ACGTNacgtn]+'  repre-
 326                  sents  a  deletion from the reference. The deleted bases will
 327                  be presented as `*' in the following lines. Also at the  read
 328                  base  column,  a  symbol  `^'  marks the start of a read. The
 329                  ASCII of the character following `^' minus 33 gives the  map-
 330                  ping quality. A symbol `$' marks the end of a read segment.
 331
 332                  If  option  -c  is  applied, the consensus base, Phred-scaled
 333                  consensus quality, SNP quality (i.e. the Phred-scaled  proba-
 334                  bility of the consensus being identical to the reference) and
 335                  root mean square (RMS) mapping quality of the reads  covering
 336                  the  site  will  be inserted between the `reference base' and
 337                  the `read bases' columns. An  indel  occupies  an  additional
 338                  line.  Each  indel  line consists of chromosome name, coordi-
 339                  nate, a star, the genotype, consensus quality,  SNP  quality,
 340                  RMS mapping quality, # covering reads, the first alllele, the
 341                  second allele, # reads supporting the first allele,  #  reads
 342                  supporting  the  second  allele and # reads containing indels
 343                  different from the top two alleles.
 344
 345                  NOTE: Since 0.1.10, the `pileup'  command  is  deprecated  by
 346                  `mpileup'.
 347
 348                  OPTIONS:
 349
 350                  -B        Disable  the  BAQ computation. See the mpileup com-
 351                            mand for details.
 352
 353                  -c        Call the consensus sequence. Options -T, -N, -I and
 354                            -r are only effective when -c or -g is in use.
 355
 356                  -C INT    Coefficient  for downgrading the mapping quality of
 357                            poorly mapped reads. See the  mpileup  command  for
 358                            details. [0]
 359
 360                  -d INT    Use  the  first  NUM  reads in the pileup for indel
 361                            calling for speed up. Zero for unlimited. [1024]
 362
 363                  -f FILE   The reference sequence in the FASTA  format.  Index
 364                            file FILE.fai will be created if absent.
 365
 366                  -g        Generate  genotype  likelihood  in the binary GLFv3
 367                            format. This option suppresses -c, -i and -s.  This
 368                            option is deprecated by the mpileup command.
 369
 370                  -i        Only output pileup lines containing indels.
 371
 372                  -I INT    Phred  probability  of an indel in sequencing/prep.
 373                            [40]
 374
 375                  -l FILE   List of sites at which pileup is output. This  file
 376                            is  space  delimited.  The  first  two  columns are
 377                            required to be chromosome and  1-based  coordinate.
 378                            Additional  columns  are ignored. It is recommended
 379                            to use option
 380
 381                  -m INT    Filter reads  with  flag  containing  bits  in  INT
 382                            [1796]
 383
 384                  -M INT    Cap mapping quality at INT [60]
 385
 386                  -N INT    Number of haplotypes in the sample (>=2) [2]
 387
 388                  -r FLOAT  Expected  fraction of differences between a pair of
 389                            haplotypes [0.001]
 390
 391                  -s        Print the mapping quality as the last column.  This
 392                            option  makes  the output easier to parse, although
 393                            this format is not space efficient.
 394
 395                  -S        The input file is in SAM.
 396
 397                  -t FILE   List of reference names ane  sequence  lengths,  in
 398                            the  format  described  for  the import command. If
 399                            this option is present, samtools assumes the  input
 400                            <in.alignment>  is  in  SAM  format;  otherwise  it
 401                            assumes in BAM format.  -s together with -l  as  in
 402                            the  default  format  we  may  not know the mapping
 403                            quality.
 404
 405                  -T FLOAT  The theta parameter (error dependency  coefficient)
 406                            in the maq consensus calling model [0.85]
 407
 408
 409 SAM FORMAT
 410        SAM  is  TAB-delimited.  Apart from the header lines, which are started
 411        with the `@' symbol, each alignment line consists of:
 412
 413
 414        +----+-------+----------------------------------------------------------+
 415        |Col | Field |                       Description                        |
 416        +----+-------+----------------------------------------------------------+
 417        | 1  | QNAME | Query (pair) NAME                                        |
 418        | 2  | FLAG  | bitwise FLAG                                             |
 419        | 3  | RNAME | Reference sequence NAME                                  |
 420        | 4  | POS   | 1-based leftmost POSition/coordinate of clipped sequence |
 421        | 5  | MAPQ  | MAPping Quality (Phred-scaled)                           |
 422        | 6  | CIAGR | extended CIGAR string                                    |
 423        | 7  | MRNM  | Mate Reference sequence NaMe (`=' if same as RNAME)      |
 424        | 8  | MPOS  | 1-based Mate POSistion                                   |
 425        | 9  | ISIZE | Inferred insert SIZE                                     |
 426        |10  | SEQ   | query SEQuence on the same strand as the reference       |
 427        |11  | QUAL  | query QUALity (ASCII-33 gives the Phred base quality)    |
 428        |12  | OPT   | variable OPTional fields in the format TAG:VTYPE:VALUE   |
 429        +----+-------+----------------------------------------------------------+
 430
 431        Each bit in the FLAG field is defined as:
 432
 433
 434           +-------+-----+--------------------------------------------------+
 435           | Flag  | Chr |                   Description                    |
 436           +-------+-----+--------------------------------------------------+
 437           |0x0001 |  p  | the read is paired in sequencing                 |
 438           |0x0002 |  P  | the read is mapped in a proper pair              |
 439           |0x0004 |  u  | the query sequence itself is unmapped            |
 440           |0x0008 |  U  | the mate is unmapped                             |
 441           |0x0010 |  r  | strand of the query (1 for reverse)              |
 442           |0x0020 |  R  | strand of the mate                               |
 443           |0x0040 |  1  | the read is the first read in a pair             |
 444           |0x0080 |  2  | the read is the second read in a pair            |
 445           |0x0100 |  s  | the alignment is not primary                     |
 446           |0x0200 |  f  | the read fails platform/vendor quality checks    |
 447           |0x0400 |  d  | the read is either a PCR or an optical duplicate |
 448           +-------+-----+--------------------------------------------------+
 449
 450 EXAMPLES
 451        o Import SAM to BAM when @SQ lines are present in the header:
 452
 453            samtools view -bS aln.sam > aln.bam
 454
 455          If @SQ lines are absent:
 456
 457            samtools faidx ref.fa
 458            samtools view -bt ref.fa.fai aln.sam > aln.bam
 459
 460          where ref.fa.fai is generated automatically by the faidx command.
 461
 462
 463        o Attach the RG tag while merging sorted alignments:
 464
 465            perl      -e       'print       "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illu-
 466          mina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt
 467            samtools merge -rh rg.txt merged.bam ga.bam 454.bam
 468
 469          The value in a RG tag is determined by the file name the read is com-
 470          ing from. In this example, in the merged.bam, reads from ga.bam  will
 471          be  attached  RG:Z:ga,  while  reads  from  454.bam  will be attached
 472          RG:Z:454.
 473
 474
 475        o Call SNPs and short indels for one diploid individual:
 476
 477            samtools mpileup -ugf ref.fa aln.bam |  bcftools  view  -bvcg  -  >
 478          var.raw.bcf
 479            bcftools   view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  100  >
 480          var.flt.vcf
 481
 482          The -D option of varFilter controls the  maximum  read  depth,  which
 483          should  be  adjusted  to about twice the average read depth.  One may
 484          consider to add -C50 to mpileup if mapping quality  is  overestimated
 485          for  reads containing excessive mismatches. Applying this option usu-
 486          ally helps BWA-short but may not other mappers.
 487
 488
 489        o Call SNPs and short indels for multiple diploid individuals:
 490
 491            samtools mpileup -P ILLUMINA -ugf  ref.fa  *.bam  |  bcftools  view
 492          -bcvg - > var.raw.bcf
 493            bcftools  view  var.raw.bcf  |  vcfutils.pl  varFilter  -D  2000  >
 494          var.flt.vcf
 495
 496          Individuals are identified from the SM tags in the @RG header  lines.
 497          Individuals  can  be pooled in one alignment file; one individual can
 498          also be separated into multiple files. The -P option  specifies  that
 499          indel  candidates  should be collected only from read groups with the
 500          @RG-PL tag set to ILLUMINA.  Collecting indel candidates  from  reads
 501          sequenced  by an indel-prone technology may affect the performance of
 502          indel calling.
 503
 504
 505        o Derive the allele frequency spectrum (AFS) on a list  of  sites  from
 506          multiple individuals:
 507
 508            samtools mpileup -Igf ref.fa *.bam > all.bcf
 509            bcftools view -bl sites.list all.bcf > sites.bcf
 510            bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs
 511            bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs
 512            bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs
 513            ......
 514
 515          where sites.list contains the list of sites with each line consisting
 516          of the reference sequence name and position. The  following  bcftools
 517          commands estimate AFS by EM.
 518
 519
 520        o Dump BAQ applied alignment for other SNP callers:
 521
 522            samtools calmd -bAr aln.bam > aln.baq.bam
 523
 524          It  adds  and corrects the NM and MD tags at the same time. The calmd
 525          command also comes with the -C option, the same as the one in  pileup
 526          and mpileup.  Apply if it helps.
 527
 528
 529 LIMITATIONS
 530        o Unaligned   words  used  in  bam_import.c,  bam_endian.h,  bam.c  and
 531          bam_aux.c.
 532
 533        o In merging, the input files are required to have the same  number  of
 534          reference  sequences.  The  requirement  can be relaxed. In addition,
 535          merging does not reconstruct the header  dictionaries  automatically.
 536          Endusers  have  to  provide  the  correct header. Picard is better at
 537          merging.
 538
 539        o Samtools paired-end rmdup does not  work  for  unpaired  reads  (e.g.
 540          orphan  reads  or ends mapped to different chromosomes). If this is a
 541          concern, please use Picard's MarkDuplicate  which  correctly  handles
 542          these cases, although a little slower.
 543
 544
 545 AUTHOR
 546        Heng  Li from the Sanger Institute wrote the C version of samtools. Bob
 547        Handsaker from the Broad Institute implemented the BGZF library and Jue
 548        Ruan  from Beijing Genomics Institute wrote the RAZF library. John Mar-
 549        shall and Petr Danecek contribute to the source code and various people
 550        from the 1000 Genomes Project have contributed to the SAM format speci-
 551        fication.
 552
 553
 554 SEE ALSO
 555        Samtools website: <http://samtools.sourceforge.net>
 556
 557
 558
 559 samtools-0.1.12                 2 December 2010                    samtools(1)