3 \title{Read DNA Sequences in a File}
5 read.dna(file, format = "interleaved", skip = 0,
6 nlines = 0, comment.char = "#",
7 as.character = FALSE, as.matrix = NULL)
10 \item{file}{a file name specified by either a variable of mode character,
11 or a double-quoted string.}
12 \item{format}{a character string specifying the format of the DNA
13 sequences. Four choices are possible: \code{"interleaved"},
14 \code{"sequential"}, \code{"clustal"}, or \code{"fasta"}, or any
15 unambiguous abbreviation of these.}
16 \item{skip}{the number of lines of the input file to skip before
17 beginning to read data.}
18 \item{nlines}{the number of lines to be read (by default the file is
19 read untill its end).}
20 \item{comment.char}{a single character, the remaining of the line
21 after this character is ignored.}
22 \item{as.character}{a logical controlling whether to return the
23 sequences as an object of class \code{"DNAbin"} (the default).}
24 \item{as.matrix}{(used if \code{format = "fasta"}) one of the three
25 followings: (i) \code{NULL}: returns the sequences in a matrix if
26 they are of the same length, otherwise in a list; (ii) \code{TRUE}:
27 returns the sequences in a matrix, or stops with an error if they
28 are of different lengths; (iii) \code{FALSE}: always returns the
32 This function reads DNA sequences in a file, and returns a matrix or a
33 list of DNA sequences with the names of the taxa read in the file as
34 rownames or names, respectively. By default, the sequences are stored
35 in binary format, otherwise (if \code{as.character = "TRUE"}) in lower
39 This function follows the interleaved and sequential formats defined
40 in PHYLIP (Felsenstein, 1993) but with the original feature than there
41 is no restriction on the lengths of the taxa names. For these two
42 formats, the first line of the file must contain the dimensions of the
43 data (the numbers of taxa and the numbers of nucleotides); the
44 sequences are considered as aligned and thus must be of the same
45 lengths for all taxa. For the FASTA format, the conventions defined in
46 the URL below (see References) are followed; the sequences are taken as
47 non-aligned. For all formats, the nucleotides can be arranged in any
48 way with blanks and line-breaks inside (with the restriction that the
49 first ten nucleotides must be contiguous for the interleaved and
50 sequential formats, see below). The names of the sequences are read in
51 the file. Particularities for each format are detailed below.
54 \item{Interleaved:}{the function starts to read the sequences after it
55 finds one or more spaces (or tabulations). All characters before the
56 sequences are taken as the taxa names after removing the leading and
57 trailing spaces (so spaces in taxa names are allowed). It is assumed
58 that the taxa names are not repeated in the subsequent blocks of
61 \item{Sequential:}{the same criterion than for the interleaved format
62 is used to start reading the sequences and the taxa names; the
63 sequences are then read until the number of nucleotides specified in
64 the first line of the file is reached. This is repeated for each taxa.}
66 \item{Clustal:}{this is the format output by the Clustal programs
67 (.aln). It is somehow similar to the interleaved format: the
68 differences being that the dimensions of the data are not indicated
69 in the file, and the names of the sequences are repeated in each block.}
71 \item{FASTA:}{This looks like the sequential format but the taxa names
72 (or rather a description of the sequence) are on separate lines
73 beginning with a `greater than' character ``>'' (there may be
74 leading spaces before this character). These lines are taken as taxa
75 names after removing the ``>'' and the possible leading and trailing
76 spaces. All the data in the file before the first sequence is ignored.}
79 a matrix or a list (if \code{format = "fasta"}) of DNA sequences
80 stored in binary format, or of mode character (if \code{as.character =
84 Anonymous. FASTA format description.
85 \url{http://www.ncbi.nlm.nih.gov/BLAST/fasta.html}
87 Anonymous. IUPAC ambiguity codes.
88 \url{http://www.ncbi.nlm.nih.gov/SNP/iupac.html}
90 Felsenstein, J. (1993) Phylip (Phylogeny Inference Package) version
91 3.5c. Department of Genetics, University of Washington.
92 \url{http://evolution.genetics.washington.edu/phylip/phylip.html}
95 \code{\link{read.GenBank}}, \code{\link{write.dna}},
96 \code{\link{DNAbin}}, \code{\link{dist.dna}}, \code{\link{woodmouse}}
98 \author{Emmanuel Paradis}
100 ### a small extract from `data(woddmouse)'
102 "No305 NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
103 "No304 ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
104 "No306 ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
105 file = "exdna.txt", sep = "\n")
106 ex.dna <- read.dna("exdna.txt", format = "sequential")
109 ### the same data in interleaved format...
111 "No305 NTTCGAAAAA CACACCCACT",
112 "No304 ATTCGAAAAA CACACCCACT",
113 "No306 ATTCGAAAAA CACACCCACT",
114 " ACTAAAANTT ATCAGTCACT",
115 " ACTAAAAATT ATCAACCACT",
116 " ACTAAAAATT ATCAATCACT",
117 file = "exdna.txt", sep = "\n")
118 ex.dna2 <- read.dna("exdna.txt")
119 ### ... in clustal format...
120 cat("CLUSTAL (ape) multiple sequence alignment", "",
121 "No305 NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
122 "No304 ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
123 "No306 ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
124 " ************************** ****** ****",
125 file = "exdna.txt", sep = "\n")
126 ex.dna3 <- read.dna("exdna.txt", format = "clustal")
127 ### ... and in FASTA format
129 "NTTCGAAAAACACACCCACTACTAAAANTTATCAGTCACT",
131 "ATTCGAAAAACACACCCACTACTAAAAATTATCAACCACT",
133 "ATTCGAAAAACACACCCACTACTAAAAATTATCAATCACT",
134 file = "exdna.txt", sep = "\n")
135 ex.dna4 <- read.dna("exdna.txt", format = "fasta")
136 ### The first three are the same!
137 identical(ex.dna, ex.dna2)
138 identical(ex.dna, ex.dna3)
139 identical(ex.dna, ex.dna4)
140 unlink("exdna.txt") # clean-up