inst/doc/MoranI.Rnw

   1 \documentclass[a4paper]{article}
   2 %\VignetteIndexEntry{Moran's I}
   3 %\VignettePackage{ape}
   4 \usepackage{fancyvrb}
   5 \usepackage{color}
   6
   7 \newcommand{\code}{\texttt}
   8 \newcommand{\pkg}{\textsf}
   9 \newcommand{\ape}{\pkg{ape}}
  10 \newcommand{\ade}{\pkg{ade4}}
  11 \newcommand{\spatial}{\pkg{spatial}}
  12 \renewcommand{\sp}{\pkg{sp}}
  13
  14 \author{Emmanuel Paradis}
  15 \title{Moran's Autocorrelation Coefficient in Comparative Methods}
  16
  17 \begin{document}
  18
  19 \maketitle
  20
  21 <<echo=false,quiet=true>>=
  22 options(width=60)
  23 @
  24
  25 This document clarifies the use of Moran's autocorrelation coefficient
  26 to quantify whether the distribution of a trait among a set of species
  27 is affected or not by their phylogenetic relationships.
  28
  29 \section{Theoretical Background}
  30
  31 Moran's autocorrelation coefficient (often denoted as $I$) is an
  32 extension of Pearson product-moment correlation coefficient to a univariate
  33 series \cite{Cliff1973, Moran1950}. Recall that Pearson's correlation
  34 (denoted as $\rho$) between two variables $x$ and $y$ both of length $n$ is:
  35
  36 \begin{displaymath}
  37 \rho = \frac{\displaystyle\sum_{i=1}^n(x_i - \bar{x})(y_i -
  38   \bar{y})}{\displaystyle\left[{\sum_{i=1}^n(x_i - \bar{x})^2\sum_{i=1}^n(y_i - \bar{y})^2}\right]^{1/2}},
  39 \end{displaymath}
  40 where $\bar{x}$ and $\bar{y}$ are the sample means of both
  41 variables. $\rho$ measures whether, on average, $x_i$ and $y_i$ are
  42 associated. For a single variable, say $x$, $I$ will
  43 measure whether $x_i$ and $x_j$, with $i\ne j$, are associated. Note
  44 that with $\rho$, $x_i$ and $x_j$ are {\em not} associated since the
  45 pairs $(x_i,y_i)$ are assumed to be independent of each other.
  46
  47 In the study of spatial patterns and processes, we may logically
  48 expect that close observations are more likely to be similar than
  49 those far apart. It is usual to associate a {\em weight} to each
  50 pair $(x_i,x_j)$ which quantifies this \cite{Cliff1981}. In its simplest
  51 form, these weights will take values 1 for close neighbours, and 0
  52 otherwise. We also set $w_{ii}=0$. These weights are sometimes
  53 referred to as a {\em neighbouring function}.
  54
  55 $I$'s formula is:
  56
  57 \begin{equation}
  58 I = \frac{n}{S_0}
  59 \frac{\displaystyle\sum_{i=1}^n \sum_{j=1}^n w_{ij}(x_i - \bar{x})(x_j -
  60   \bar{x})}{\displaystyle\sum_{i=1}^n (x_i - \bar{x})^2},\label{eq:morani}
  61 \end{equation}
  62 where $w_{ij}$ is the weight between observation $i$ and $j$, and
  63 $S_0$ is the sum of all $w_{ij}$'s:
  64
  65 \begin{displaymath}
  66 S_0 = \sum_{i=1}^n \sum_{j=1}^n w_{ij}.
  67 \end{displaymath}
  68
  69 Quite not so intuitively, the expected value of $I$ under the null
  70 hypothesis of no autocorrelation is not equal to zero but given by
  71 $I_0 = -1/(n-1)$. The expected variance of  $I_0$ is also known, and
  72 so we can make a test of the null hypothesis. If the observed value
  73 of $I$ (denoted $\hat{I}$) is significantly greater than $I_0$, then
  74 values of $x$ are positively autocorrelated, whereas if $\hat{I}<I_0$,
  75 this will indicate negative autocorrelation. This allows us to design
  76 one- or two-tailed tests in the standard way.
  77
  78 Gittleman \& Kot \cite{Gittleman1990} proposed to use Moran's $I$ to
  79 test for ``phylogenetic effects''. They considered two ways to
  80 calculate the weights $w$:
  81
  82 \begin{itemize}
  83 \item With phylogenetic distances among species, e.g., $w_{ij} =
  84   1/d_{ij}$, where $1/d_{ij}$ are distances measured on a tree.
  85 \item With taxonomic levels where $w_{ij} = 1$ if species $i$ and $j$
  86   belong to the same group, 0 otherwise.
  87 \end{itemize}
  88
  89 Note that in the first situation, there are quite a lot of
  90 possibilities to set the weights. For instance, Gittleman \& Kot also proposed:
  91
  92 \[\begin{array}{ll}
  93 w_{ij} = 1/d_{ij}^\alpha & \mathrm{if}\ d_{ij} \le c\\
  94 w_{ij} = 0 & \mathrm{if}\ d_{ij} > c,\\
  95 \end{array}\]
  96 where $c$ is a cut-off phylogenetic distance above which the species
  97 are considered to have evolved completely independently, and $\alpha$
  98 is a coefficient (see \cite{Gittleman1990} for details).
  99 By analogy to the use of a spatial correlogram where coefficients are
 100 calculated assuming different sizes of the ``neighbourhood'' and then
 101 plotted to visualize the spatial extent of autocorrelation, they
 102 proposed to calculate $I$ at different taxonomic levels.
 103
 104 \section{Implementation in \ape}
 105
 106 From version 1.2-6, \ape\ has functions \code{Moran.I} and
 107 \code{correlogram.formula} implementing the approach developed by Gittleman \&
 108 Kot. There was an error in the help pages of \code{?Moran.I}
 109 (corrected in ver.\ 2.1) where the weights were referred to as
 110 ``distance weights''. This has been wrongly interpreted in my book
 111 \cite[pp.~139--142]{Paradis2006}. The analyses below aim to correct
 112 this.
 113
 114 \subsection{Phylogenetic Distances}
 115
 116 The data, taken from \cite{Cheverud1985}, are the log-transformed
 117 body mass and longevity of five species of primates:
 118
 119 <<>>=
 120 body <- c(4.09434, 3.61092, 2.37024, 2.02815, -1.46968)
 121 longevity <- c(4.74493, 3.3322, 3.3673, 2.89037, 2.30259)
 122 names(body) <- names(longevity) <- c("Homo", "Pongo", "Macaca", "Ateles", "Galago")
 123 @
 124
 125 The tree has branch lengths scaled so that the root age is one. We
 126 read the tree with \ape, and plot it:
 127
 128 <<fig=true>>=
 129 library(ape)
 130 trnwk <- "((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62)"
 131 trnwk[2] <- ":0.38,Galago:1.00);"
 132 tr <- read.tree(text = trnwk)
 133 plot(tr)
 134 axisPhylo()
 135 @
 136
 137 We choose the weights as $w_{ij}=1/d_{ij}$, where the $d$'s is the
 138 distances measured on the tree:
 139
 140 <<>>=
 141 w <- 1/cophenetic(tr)
 142 w
 143 @
 144 Of course, we must set the diagonal to zero:
 145
 146 <<>>=
 147 diag(w) <- 0
 148 @
 149 We can now perform the analysis with Moran's $I$:
 150
 151 <<>>=
 152 Moran.I(body, w)
 153 @
 154
 155 Not surprisingly, the results are opposite to those in
 156 \cite{Paradis2006} since, there, the distances (given by
 157 \code{cophenetic(tr)}) were used as weights. (Note that the argument
 158 \code{dist} has been since renamed \code{weight}.\footnote{The older
 159   code was actually correct; nevertheless, it has been rewritten, and
 160   is now much faster. The documentation has been clarified.
 161   The function \code{correlogram.phylo}, which computed
 162   Moran's $I$ for a tree given as argument using the distances among
 163   taxa, has been removed.}) We can now conclude for a slighly
 164 significant positive phylogenetic correlation among body mass values
 165 for these five species.
 166
 167 The new version of \code{Moran.I} gains the option \code{alternative}
 168 which specifies the alternative hypothesis (\code{"two-sided"} by
 169 default, i.e., H$_1$: $I \ne I_0$). As expected from the above result, we divide the $P$-value
 170 be two if we define H$_1$ as $I > I_0$:
 171
 172 <<>>=
 173 Moran.I(body, w, alt = "greater")
 174 @
 175
 176 The same analysis with \code{longevity} gives:
 177
 178 <<>>=
 179 Moran.I(longevity, w)
 180 @
 181
 182 As for \code{body}, the results are nearly mirrored compared to
 183 \cite{Paradis2006} where a non-significant negative phylogenetic
 184 correlation was found: it is now positive but still largely not
 185 significant.
 186
 187 \subsection{Taxonomic Levels}
 188
 189 The function \code{correlogram.formula} provides an interface to
 190 calculate Moran's $I$ for one or several variables giving a series of
 191 taxonomic levels. An example of its use was provided in
 192 \cite[pp.~141--142]{Paradis2006}. The code of this function has been
 193 simplified, and the graphical presentation of the results have been improved.
 194
 195 \code{correlogram.formula}'s main argument is a formula which is ``sliced'',
 196 and \code{Moran.I} is called for each of these elements. Two things
 197 have been changed for the end-user at this level:
 198
 199 \begin{enumerate}
 200 \item In the old version, the rhs of the formula was given in the
 201   order of the taxonomic hierarchy: e.g.,
 202   \code{Order/SuperFamily/Family/Genus}. Not respecting this order
 203   resulted in an error. In the new version, any order is accepted, but
 204   the order given it is then respected when plotted the correlogram.
 205 \item Variable transformations (e.g., log) were allowed on the lhs of
 206   the formula. Because of the simplification of the code, this is no
 207   more possible. So it is the responsibility of the user to apply any
 208   tranformation before the analysis.
 209 \end{enumerate}
 210
 211 Following Gittleman \& Kot \cite{Gittleman1990}, the autocorrelation at a higher level
 212 (e.g., family) is calculated among species belonging to the same
 213 category and to different categories at the level below (genus).
 214 To formalize this, let us write the different levels as
 215 $X^1/X^2/X^3/\dots/X^n$ with $X^n$ being the lowest one (\code{Genus} in the
 216 above formula):
 217
 218 \begin{displaymath}
 219 \begin{array}{l}
 220 \left.\begin{array}{ll}
 221 w_{ij}=1 & \mathrm{if}\ X_i^k = X_j^k\ \mathrm{and}\ X_i^{k+1} \ne X_j^{k+1}\\
 222 w_{ij}=0 & \mathrm{otherwise}\\
 223 \end{array} \right\} k < n
 224 \\\\
 225 \left.\begin{array}{ll}
 226 w_{ij}=1 & \mathrm{if}\ X_i^k = X_j^k\\
 227 w_{ij}=0 & \mathrm{otherwise}\\
 228 \end{array} \right\} k = n
 229 \end{array}
 230 \end{displaymath}
 231 This is thus different from the idea of a ``neighbourhood'' of
 232 different sizes, but rather similar to the idea of partial correlation
 233 where the influence of the lowest level is removed when considering
 234 the highest ones \cite{Gittleman1990}.
 235
 236 To repeat the analyses on the \code{carnivora} data set, we first
 237 log$_{10}$-transform the variables mean body mass (\code{SW}) and the
 238 mean female body mass (\code{FW}):
 239
 240 <<>>=
 241 data(carnivora)
 242 carnivora$log10SW <- log10(carnivora$SW)
 243 carnivora$log10FW <- log10(carnivora$FW)
 244 @
 245 We first consider a single variable analysis (as in \cite{Paradis2006}):
 246
 247 <<fig=true>>=
 248 fm1.carn <- log10SW ~ Order/SuperFamily/Family/Genus
 249 co1 <- correlogram.formula(fm1.carn, data = carnivora)
 250 plot(co1)
 251 @
 252
 253 A legend now appears by default, but can be removed with \code{legend
 254 = FALSE}. Most of the appearance of the graph can be customized via
 255 the option of the plot method (see \code{?plot.correlogram} for
 256 details). This is the same analysis than the one displayed on Fig.~6.3
 257 of \cite{Paradis2006}.
 258
 259 When a single variable is given in the lhs in
 260 \code{correlogram.formula}, an object of class \code{"correlogram"} is
 261 returned as above. If several variables are analysed simultaneously,
 262 the object returned is of class \code{"correlogramList"}, and the
 263 correlograms can be plotted together with the appropriate plot method:
 264
 265 <<fig=true>>=
 266 fm2.carn <- log10SW + log10FW ~ Order/SuperFamily/Family/Genus
 267 co2 <- correlogram.formula(fm2.carn, data = carnivora)
 268 print(plot(co2))
 269 @
 270
 271 By default, lattice is used to plot the correlograms on separate
 272 panels; using \code{lattice = FALSE} (actually the second argument,
 273 see \code{?plot.correlogramList}) makes a standard graph superimposing
 274 the different correlograms:
 275
 276 <<fig=true>>=
 277 plot(co2, FALSE)
 278 @
 279
 280 The options are roughly the same than above, but do not have always
 281 the same effect since lattice and base graphics do not have the same
 282 graphical parameters. For instance, \code{legend = FALSE} has no
 283 effect if \code{lattice = TRUE}.
 284
 285 \section{Implementation in \ade}
 286
 287 The analysis done with \ade\ in \cite{Paradis2006} suffers from the
 288 same error than the one done with \code{Moran.I} since it was also
 289 done with a distance matrix. So I correct this below:
 290
 291 \begin{Schunk}
 292 \begin{Sinput}
 293 > library(ade4)
 294 > gearymoran(w, data.frame(body, longevity))
 295 \end{Sinput}
 296 \begin{Soutput}
 297 class: krandtest
 298 Monte-Carlo tests
 299 Call: as.krandtest(sim = matrix(res$result, ncol = nvar, byr = TRUE),
 300     obs = res$obs, alter = alter, names = test.names)
 301
 302 Test number:   2
 303 Permutation number:   999
 304 Alternative hypothesis: greater
 305
 306        Test         Obs   Std.Obs Pvalue
 307 1      body -0.06256789 2.1523342  0.001
 308 2 longevity -0.22990437 0.3461414  0.414
 309
 310 other elements: NULL
 311 \end{Soutput}
 312 \end{Schunk}
 313
 314 The results are wholly consistent with those from \ape, but the
 315 estimated coefficients are substantially different. This is because
 316 the computational methods are not the same in both packages. In \ade,
 317 the weight matrix is first transformed as a relative frequency matrix with
 318 $\tilde{w}_{ij} = w_{ij}/S_0$. The weights are further transformed with:
 319
 320 \begin{displaymath}
 321 p_{ij} = \tilde{w}_{ij} - \sum_{i=1}^n\tilde{w}_{ij}\sum_{j=1}^n\tilde{w}_{ij},
 322 \end{displaymath}
 323 with $p_{ij}$ being the elements of the matrix denoted as $P$. Moran's
 324 $I$ is finally computed with $x^\mathrm{T}Px$. In \ape, the weights
 325 are first row-normalized:
 326
 327 \begin{displaymath}
 328 w_{ij} \Big/ \sum_{i=1}^n w_{ij},
 329 \end{displaymath}
 330 then eq.~\ref{eq:morani} is applied.
 331
 332 Another difference between both packages, though less
 333 important, is that in \ade\ the weight matrix is forced to be
 334 symmetric with $(W+W^\mathrm{T})/2$. In \ape, this matrix is assumed
 335 to be symmetric, which is likely to be the case like in the examples above.
 336
 337 \section{Other Implementations}
 338
 339 Package \sp\ has several functions, including
 340 \code{moran.test}, that are more specifically targeted to the analysis
 341 of spatial data. Package \spatial\ has the function \code{correlogram}
 342 that computes and plots spatial correlograms.
 343
 344 \section*{Acknowledgements}
 345
 346 I am thankful to Thibaut Jombart for clarifications on Moran's $I$.
 347
 348 \bibliographystyle{plain}
 349 \bibliography{ape}
 350
 351 \end{document}