samtools.1

   1 .TH samtools 1 "22 December 2008" "samtools-0.1.1" "Bioinformatics tools"
   2 .SH NAME
   3 .PP
   4 samtools - Utilities for the Sequence Alignment/Map (SAM) format
   5 .SH SYNOPSIS
   6 .PP
   7 samtools import ref_list.txt aln.sam.gz aln.bam
   8 .PP
   9 samtools sort aln.bam aln.sorted
  10 .PP
  11 samtools index aln.sorted.bam
  12 .PP
  13 samtools view aln.sorted.bam chr2:20,100,000-20,200,000
  14 .PP
  15 samtools merge out.bam in1.bam in2.bam in3.bam
  16 .PP
  17 samtools faidx ref.fasta
  18 .PP
  19 samtools pileup -f ref.fasta aln.sorted.bam
  20 .PP
  21 samtools tview aln.sorted.bam ref.fasta
  22
  23 .SH DESCRIPTION
  24 .PP
  25 Samtools is a set of utilities that manipulate alignments in the BAM
  26 format. It imports from and exports to the SAM (Sequence
  27 Alignment/Map) format, does sorting, merging and indexing, and
  28 allows to retrieve reads in any regions swiftly.
  29
  30 .SH COMMANDS AND OPTIONS
  31 .TP 10
  32 .B import
  33 samtools import <in.ref_list> <in.sam> <out.bam>
  34
  35 Convert alignments in SAM format to BAM format. File
  36 .I <in.ref_list>
  37 is TAB-delimited. Each line must contain the reference name and the
  38 length of the reference, one line for each distinct reference;
  39 additional fields are ignored. This file also defines the order of the
  40 reference sequences in sorting. File
  41 .I <in.sam>
  42 can be optionally compressed by zlib or gzip. A single hyphen is
  43 recognized as stdin or stdout, depending on the context.
  44
  45 .TP
  46 .B sort
  47 samtools sort [-n] [-m maxMem] <in.bam> <out.prefix>
  48
  49 Sort alignments based on the leftmost coordinate. File
  50 .I <out.prefix>.bam
  51 will be created. This command may also create temporary files
  52 .I <out.prefix>.%d.bam
  53 when the whole alignment cannot be fitted into memory (controlled by
  54 option -m).
  55
  56 .B OPTIONS:
  57 .RS
  58 .TP 8
  59 .B -n
  60 Sort by read names rather than by chromosomal coordinates
  61 .TP
  62 .B -m INT
  63 Approximately the maximum required memory.
  64 .RE
  65
  66 .TP
  67 .B merge
  68 samtools merge [-n] <out.bam> <in1.bam> <in2.bam> [...]
  69
  70 Merge multiple sorted alignments. The header of
  71 .I <in1.bam>
  72 will be copied to
  73 .I <out.bam>
  74 and the headers of other files will be ignored.
  75
  76 .B OPTIONS:
  77 .RS
  78 .TP 8
  79 .B -n
  80 The input alignments are sorted by read names rather than by chromosomal
  81 coordinates
  82 .RE
  83
  84 .TP
  85 .B index
  86 samtools index <aln.bam>
  87
  88 Index sorted alignment for fast random access. Index file
  89 .I <aln.bam>.bai
  90 will be created.
  91
  92 .TP
  93 .B view
  94 samtools view [-b] <in.bam> [region1 [...]]
  95
  96 Extract/print all or sub alignments in SAM or BAM format. If no region
  97 is specified, all the alignments will be printed; otherwise only
  98 alignments overlapping with the specified regions will be output. An
  99 alignment may be given multiple times if it is overlapping several
 100 regions. A region can be presented, for example, in the following
 101 format: `chr2', `chr2:1000000' or `chr2:1,000,000-2,000,000'.
 102
 103 .B OPTIONS:
 104 .RS
 105 .TP 8
 106 .B -b
 107 Output in the BAM format.
 108 .RE
 109
 110 .TP
 111 .B faidx
 112 samtools faidx <ref.fasta> [region1 [...]]
 113
 114 Index reference sequence in the FASTA format or extract subsequence from
 115 indexed reference sequence. If no region is specified,
 116 .B faidx
 117 will index the file and create
 118 .I <ref.fasta>.fai
 119 on the disk. If regions are speficified, the subsequences will be
 120 retrieved and printed to stdout in the FASTA format. The input file can
 121 be compressed in the
 122 .B RAZF
 123 format.
 124
 125 .TP
 126 .B pileup
 127 samtools pileup [-f in.ref.fasta] [-t in.ref_list] [-l in.site_list]
 128 [-s] [-c] [-T theta] [-N nHap] [-r pairDiffRate] <in.alignment>
 129
 130 Print the alignment in the pileup format. In the pileup format, each
 131 line represents a genomic position, consisting of chromosome name,
 132 coordinate, reference base, read bases, read qualities and alignment
 133 mapping qualities. Information on match, mismatch, indel, strand,
 134 mapping quality and start and end of a read are all encoded at the read
 135 base column. At this column, a dot stands for a match to the reference
 136 base on the forward strand, a comma for a match on the reverse strand,
 137 `ACGTN' for a mismatch on the forward strand and `acgtn' for a mismatch
 138 on the reverse strand. A pattern `\\+[0-9]+[ACGTNacgtn]+' indicates
 139 there is an insertion between this reference position and the next
 140 reference position. The length of the insertion is given by the integer
 141 in the pattern, followed by the inserted sequence. Similarly, a pattern
 142 `-[0-9]+[ACGTNacgtn]+' represents a deletion from the reference. Also at
 143 the read base column, a symbol `^' marks the start of a read segment
 144 which is a contiguous subsequence on the read separated by `N/S/H' CIGAR
 145 operations. The ASCII of the character following `^' minus 33 gives the
 146 mapping quality. A symbol `$' marks the end of a read segment.
 147
 148 If option
 149 .B -c
 150 is applied, the consensus base, consensus quality, SNP quality and
 151 maximum mapping quality of the reads covering the site will be inserted
 152 between the `reference base' and the `read bases' columns. An indel
 153 occupies an additional line. Each indel line consists of chromosome
 154 name, coordinate, a star, top two high-scoring ins/del sequences, the
 155 number of reads strongly supporting the first indel, the number of reads
 156 strongly supporting the second indel, the number of reads that confer
 157 little information on distinguishing indels and the number of reads that
 158 contain indels different from the top two ones.
 159
 160 .B OPTIONS:
 161 .RS
 162
 163 .TP 10
 164 .B -s
 165 Print the mapping quality as the last column. This option makes the
 166 output easier to parse, although this format is not space efficient.
 167
 168 .TP
 169 .B -f FILE
 170 The reference sequence in the FASTA format. Index file
 171 .I FILE.fai
 172 will be created if
 173 absent.
 174
 175 .TP
 176 .B -t FILE
 177 List of reference names ane sequence lengths, in the format described
 178 for the
 179 .B import
 180 command. If this option is present, samtools assumes the input
 181 .I <in.alignment>
 182 is in SAM format; otherwise it assumes in BAM format.
 183
 184 .TP
 185 .B -l FILE
 186 List of sites at which pileup is output. This file is space
 187 delimited. The first two columns are required to be chromosome and
 188 1-based coordinate. Additional columns are ignored. It is
 189 recommended to use option
 190 .B -s
 191 together with
 192 .B -l
 193 as in the default format we may not know the mapping quality.
 194
 195 .TP
 196 .B -c
 197 Call the consensus sequnce using MAQ consensus model. Options
 198 .B -T,
 199 .B -N
 200 and
 201 .B -r
 202 are only effective when
 203 .B -c
 204 is in use.
 205
 206 .TP
 207 .B -T FLOAT
 208 The theta parameter (error dependency coefficient) in the maq consensus
 209 calling model [0.85]
 210
 211 .TP
 212 .B -N INT
 213 Number of haplotypes in the sample (>=2) [2]
 214
 215 .TP
 216 .B -r FLOAT
 217 Expected fraction of differences between a pair of haplotypes [0.001]
 218
 219 .RE
 220
 221 .TP
 222 .B tview
 223 samtools tview <in.sorted.bam> [ref.fasta]
 224
 225 Text alignment viewer (based on the ncurses library). In the viewer,
 226 press `?' for help and press `g' to check the alignment start from a
 227 region in the format like `chr10:10,000,000'. Note that if the region
 228 showed on the screen contains no mapped reads, a blank screen will be
 229 seen. This is a known issue and will be improved later.
 230
 231 .RE
 232
 233 .SH LIMITATIONS
 234 .PP
 235 .IP o 2
 236 In general, more testing is needed to ensure there is no severe bug.
 237 .IP o 2
 238 PCR duplicate removal has not been implemented.
 239 .IP o 2
 240 Only MAQ->SAM converter is implemented. More converters are needed.
 241 .IP o 2
 242 Reference sequence names and lengths are not acquired from the BAM/SAM header.
 243 .IP o 2
 244 CIGAR operations N and P may not be properly handled.
 245 .IP o 2
 246 There is a small known memory leak in the viewer.
 247
 248 .SH AUTHOR
 249 .PP
 250 Heng Li from the Sanger Institute is the author of samtools. Bob
 251 Handsaker from the Broad Institute implemented the BGZF library and Jue
 252 Ruan from Beijing Genomics Institute wrote the RAZF library. Various
 253 people in the 1000Genomes Project contributed to the SAM format
 254 specification.
 255
 256 .SH SEE ALSO
 257 .PP
 258 Samtools website: http://samtools.sourceforge.net