Page 7 - Lovison_alii_2010
P. 7
G. LOVISON ET AL.
the distribution of the response variable Y, conditional on the covariates, need not be Normal, but can be any distribution belonging to the
Natural Exponential family (denoted by N:E:F:ðu i ; fÞ); i.e., the probability density function of the response must be of the form:
½y i u i bðu i Þ
fðy i ; u i ; fÞ¼ exp þ cðy i ; fÞ 8i ð1Þ
a i ðfÞ
where the natural parameter u i is a known function of the expected value m i ¼ EðY i jx i Þ, i.e., u ¼ mðm i Þ, while the dispersion parameter f
2
plays a role similar to that of the variance s in the Normal distribution. It is important to recall that many distributions used in modeling
biological phenomena, like the Normal, Gamma, Binomial, and Poisson belong to this family;
the scale on which the explanatory variables act linearly need not be the original one of the expected values m i , but can now be any
monotonic transformation; in other words, the link function gðÞ, connecting the linear predictor and the expected value of the response,
1
can be any invertible function, not just the identity. This can be rephrased in terms of the response function g ðÞ ¼ hðÞ, by saying that the
expected values of m i ¼ EðY i jx i Þ are modeled by a nonlinear (but invertible) function of the linear predictor.
Summarizing, and following the same scheme used above, any GLM is characterized by:
error distribution Y i jx i N:E:F:ðu i ; fÞ with u i ¼ mðm i Þ
T
linear predictor h i ¼ x b
i
link function gðm i Þ¼ h i with gðÞ any invertible function
1
(or, response function m i ¼ g ðh ÞÞ.
i
Finally, like in classical linear models, standard GLM’s assume independence among observations:
8i 6¼ j
Y i Y j
4.2. GLM versus Linear Models for transformed data
As recalled in Section 2, an approach that has been extensively used in ecological applications to fit a Gaussian linear model to data with
nonlinear relationships, and/or unequal variances and/or non-Normal distribution, consists in transforming the data so that the new scores
satisfy at least approximately the assumptions of the Gaussian linear model (Digby and Kempton, 1987). This ‘‘data transformation’’
approach has a long history in Statistics, its formal introduction dating back to Box and Cox (1964) and Grizzle et al. (1969).
In this approach, the goal is to find a function tðÞ such that, at least approximately, the standard methods of inference for Gaussian linear
models can be applied to tðY i Þ rather than to Y i directly.
If from the conceptual and computational point of view this approach is very simple and appealing, it is not without drawbacks. The two
main problems are:
(1) in general, it is hard to find a unique transformation that satisfies all assumptions simultaneously;
(2) even when a unique transformation is able to account for all the departures from the Gaussian linear model, the use of transformations can
still be problematic, due to the increased difficulty of interpretation of the results. If the original scale of the data is ecologically
meaningful, then it may be required to express the results back on this original scale, but this is not, in general, an easy task.
The question obviously arises about the relative merits of the ‘‘data transformation’’ approach versus the GLM approach.
The main advantage of the transformation approach is its conceptual and computational simplicity. Basically, it builds upon the huge and
widespread body of statistical knowledge on Gaussian linear models, trying to ‘‘force’’ all types of data to fit, at least approximately, into the
assumptions of such well known models. However, this is also its main weakness: the attempt to find a unique transformation which satisfies
all such assumptions usually fails because they are often separately violated by real data. In this respect, a Generalized Linear Model is
superior because it is less rigid, since it addresses the violations separately: e.g., a nonlinear regression function can be combined with a
Normal distribution for the conditional distribution of the response, or a linear regression with a non-Normal, heteroscedastic distribution,
etc. Moreover, the interpretation of the results on the original scale is more natural within a GLM than within the transformation approach,
since the invertibility of the link function provides a straightforward way to transfer the results from the link scale to the response scale. This
is particularly appealing in the ecological context, since model parameters, predictions, etc. usually have a definite biological meaning only
on the original scale.
4.3. Cross-sectional analysis of Sicily PosiData-1
The methodological issues discussed in the previous Sections are here illustrated with real examples taken from the Sicily PosiData-1 dataset.
In order to keep the sample sizes reasonably large, the analysis has been carried out on the lepidochronological years from 1991 to 1998, for
which a larger number of shoots is available. This gives a total of 400 year/meadow combinations, i.e., 8 years for three stations at different
depths for 16 sites, plus two stations at a site where only two depths were available.
4.3.1. Exploring violations: the departure from Normality
In order to check the Normality assumption in our dataset, we performed both informal graphical checks and formal tests systematically on
all of the year/meadow combinations available in Sicily PosiData-1. We chose to work on Rhizome elongation as response variable and
376
wileyonlinelibrary.com/journal/environmetrics Copyright ß 2010 John Wiley & Sons, Ltd. Environmetrics 2011; 22: 370–382