3 \title{Pairwise Distances from DNA Sequences}
5 dist.dna(x, model = "K80", variance = FALSE,
6 gamma = FALSE, pairwise.deletion = FALSE,
7 base.freq = NULL, as.matrix = FALSE)
10 \item{x}{a matrix or a list containing the DNA sequences.}
11 \item{model}{a character string specifying the evlutionary model to be
12 used; must be one of \code{"raw"}, \code{"JC69"}, \code{"K80"} (the
13 default), \code{"F81"}, \code{"K81"}, \code{"F84"}, \code{"BH87"},
14 \code{"T92"}, \code{"TN93"}, \code{"GG95"}, \code{"logdet"}, or
16 \item{variance}{a logical indicating whether to compute the variances
17 of the distances; defaults to \code{FALSE} so the variances are not
19 \item{gamma}{a value for the gamma parameter which is possibly used to
20 apply a gamma correction to the distances (by default \code{gamma =
21 FALSE} so no correction is applied).}
22 \item{pairwise.deletion}{a logical indicating whether to delete the
23 sites with missing data in a pairwise way. The default is to delete
24 the sites with at least one missing data for all sequences.}
25 \item{base.freq}{the base frequencies to be used in the computations
26 (if applicable, i.e. if \code{method = "F84"}). By default, the
27 base frequencies are computed from the whole sample of sequences.}
28 \item{as.matrix}{a logical indicating whether to return the results as
29 a matrix. The default is to return an object of class
33 This function computes a matrix of pairwise distances from DNA
34 sequences using a model of DNA evolution. Eleven substitution models
35 (and the raw distance) are currently available.
38 The molecular evolutionary models available through the option
39 \code{model} have been extensively described in the literature. A
40 brief description is given below; more details can be found in the
43 \item{``raw''}{This is simply the proportion of sites that differ
44 between each pair of sequences. This may be useful to draw
45 ``saturation plots''. The options \code{variance} and \code{gamma}
46 have no effect, but \code{pairwise.deletion} can.}
48 \item{``JC69''}{This model was developed by Jukes and Cantor (1969). It
49 assumes that all substitutions (i.e. a change of a base by another
50 one) have the same probability. This probability is the same for all
51 sites along the DNA sequence. This last assumption can be relaxed by
52 assuming that the substition rate varies among site following a
53 gamma distribution which parameter must be given by the user. By
54 default, no gamma correction is applied. Another assumption is that
55 the base frequencies are balanced and thus equal to 0.25.}
57 \item{``K80''}{The distance derived by Kimura (1980), sometimes referred
58 to as ``Kimura's 2-parameters distance'', has the same underlying
59 assumptions than the Jukes--Cantor distance except that two kinds of
60 substitutions are considered: transitions (A <-> G, C <-> T), and
61 transversions (A <-> C, A <-> T, C <-> G, G <-> T). They are assumed
62 to have different probabilities. A transition is the substitution of
63 a purine (C, T) by another one, or the substitution of a pyrimidine
64 (A, G) by another one. A transversion is the substitution of a
65 purine by a pyrimidine, or vice-versa. Both transition and
66 transversion rates are the same for all sites along the DNA
67 sequence. Jin and Nei (1990) modified the Kimura model to allow for
68 variation among sites following a gamma distribution. Like for the
69 Jukes--Cantor model, the gamma parameter must be given by the
70 user. By default, no gamma correction is applied.}
72 \item{``F81''}{Felsenstein (1981) generalized the Jukes--Cantor model
73 by relaxing the assumption of equal base frequencies. The formulae
74 used in this function were taken from McGuire et al. (1999)}.
76 \item{``K81''}{Kimura (1981) generalized his model (Kimura 1980) by
77 assuming different rates for two kinds of transversions: A <-> C and
78 G <-> T on one side, and A <-> T and C <-> G on the other. This is
79 what Kimura called his ``three substitution types model'' (3ST), and
80 is sometimes referred to as ``Kimura's 3-parameters distance''}.
82 \item{``F84''}{This model generalizes K80 by relaxing the assumption
83 of equal base frequencies. It was first introduced by Felsenstein in
84 1984 in Phylip, and is fully described by Felsenstein and Churchill
85 (1996). The formulae used in this function were taken from McGuire
88 \item{``BH87''}{Barry and Hartigan (1987) developed a distance based
89 on the observed proportions of changes among the four bases. This
90 distance is not symmetric.}
92 \item{``T92''}{Tamura (1992) generalized the Kimura model by relaxing
93 the assumption of equal base frequencies. This is done by taking
94 into account the bias in G+C content in the sequences. The
95 substitution rates are assumed to be the same for all sites along
98 \item{``TN93''}{Tamura and Nei (1993) developed a model which assumes
99 distinct rates for both kinds of transition (A <-> G versus C <->
100 T), and transversions. The base frequencies are not assumed to be
101 equal and are estimated from the data. A gamma correction of the
102 inter-site variation in substitution rates is possible.}
104 \item{``GG95''}{Galtier and Gouy (1995) introduced a model where the
105 G+C content may change through time. Different rates are assumed for
106 transitons and transversions.}
108 \item{``logdet''}{The Log-Det distance, developed by Lockhart et
109 al. (1994), is related to BH87. However, this distance is symmetric.}
111 \item{``paralin''}{Lake (1994) developed the paralinear distance which
112 can be viewed as another variant of the Barry--Hartigan distance.}
115 an object of class \link[stats]{dist} (by default), or a numeric
116 matrix if \code{as.matrix = TRUE}. If \code{model = "BH87"}, a numeric
117 matrix is returned because the Barry--Hartigan distance is not
120 If \code{variance = TRUE} an attribute called \code{"variance"} is
121 given to the returned object.
124 Barry, D. and Hartigan, J. A. (1987) Asynchronous distance between
125 homologous DNA sequences. \emph{Biometrics}, \bold{43}, 261--276.
127 Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a
128 maximum likelihood approach. \emph{Journal of Molecular Evolution},
131 Felsenstein, J. and Churchill, G. A. (1996) A Hidden Markov model
132 approach to variation among sites in rate of evolution.
133 \emph{Molecular Biology and Evolution}, \bold{13}, 93--104.
135 Galtier, N. and Gouy, M. (1995) Inferring phylogenies from DNA
136 sequences of unequal base compositions. \emph{Proceedings of the
137 National Academy of Sciences USA}, \bold{92}, 11317--11321.
139 Jukes, T. H. and Cantor, C. R. (1969) Evolution of protein
140 molecules. in \emph{Mammalian Protein Metabolism}, ed. Munro, H. N.,
141 pp. 21--132, New York: Academic Press.
143 Kimura, M. (1980) A simple method for estimating evolutionary rates of
144 base substitutions through comparative studies of nucleotide
145 sequences. \emph{Journal of Molecular Evolution}, \bold{16}, 111--120.
147 Kimura, M. (1981) Estimation of evolutionary distances between
148 homologous nucleotide sequences. \emph{Proceedings of the National
149 Academy of Sciences USA}, \bold{78}, 454--458.
151 Jin, L. and Nei, M. (1990) Limitations of the evolutionary parsimony
152 method of phylogenetic analysis. \emph{Molecular Biology and
153 Evolution}, \bold{7}, 82--102.
155 Lake, J. A. (1994) Reconstructing evolutionary trees from DNA and
156 protein sequences: paralinear distances. \emph{Proceedings of the
157 National Academy of Sciences USA}, \bold{91}, 1455--1459.
159 Lockhart, P. J., Steel, M. A., Hendy, M. D. and Penny, D. (1994)
160 Recovering evolutionary trees under a more realistic model of sequence
161 evolution. \emph{Molecular Biology and Evolution}, \bold{11},
164 McGuire, G., Prentice, M. J. and Wright, F. (1999). Improved error
165 bounds for genetic distances from DNA sequences. \emph{Biometrics},
166 \bold{55}, 1064--1070.
168 Tamura, K. (1992) Estimation of the number of nucleotide substitutions
169 when there are strong transition-transversion and G + C-content
170 biases. \emph{Molecular Biology and Evolution}, \bold{9}, 678--687.
172 Tamura, K. and Nei, M. (1993) Estimation of the number of nucleotide
173 substitutions in the control region of mitochondrial DNA in humans and
174 chimpanzees. \emph{Molecular Biology and Evolution}, \bold{10}, 512--526.
176 \author{Emmanuel Paradis \email{Emmanuel.Paradis@mpl.ird.fr}}
178 \code{\link{read.GenBank}}, \code{\link{read.dna}},
179 \code{\link{write.dna}}, \code{\link{DNAbin}},
180 \code{\link{dist.gene}}, \code{\link{cophenetic.phylo}},
181 \code{\link[stats]{dist}}
184 \keyword{multivariate}