Artigo Acesso aberto Revisado por pares

Repeated measures: There's added value in modelling over time

2019; Wiley; Volume: 175; Issue: 2 Linguagem: Inglês

10.1111/aab.12534

ISSN

1744-7348

Autores

Stephen J. Powers, Marcin Kozak,

Tópico(s)

Turfgrass Adaptation and Management

Resumo

Repeated measures of the height of willow plants can be made and analysed. How does a grass plot develop its yield over time and which stage is crucial for this process? Is irrigation of potatoes equally important over the growing season? How does the rate of incidence of a pathogen vary? At what rate do different cultivars of winter cereals develop before and after winter? How does the growth rate of lettuce-fed slugs alter when they are given a less palatable food? Applied biologists often study the effect of time on traits of interest. In experimentation, such information can be obtained in two ways. First, you can design an experiment in which you analyse replicate experimental units for a given treatment at one time point, different experimental units for the treatment at the next time point, and so on. This is termed a destructive experiment, because, whether the experimental units are physically destructed or not, each one is visited only once. Easy to design, such an approach can have drawbacks. If many time points are required, it requires a large experiment, with many experimental units. Moreover, it disregards crucial – from a biological point of view – information about how each experimental unit would have progressed over time. For instance, you could measure the number of cereal plants before winter on one plot and the number of plants in spring on another plot. In doing so, you would miss important information on how the number of plants changes on particular plots. Thus, a second approach to study the effect of time is to design an experiment in which you measure traits of interest on the same experimental units. That way, you make repeated measures. So, repeated measures data occur when a trait of interest is measured on each experimental unit at least twice over time. Statistically, destructive experiments are more powerful (have more residual degrees of freedom for making comparisons of treatments) than their repeated measures equivalents, so, from a statistical point of view, we would prefer the first approach. However, if you want to study how the effect of different treatments varies over time, the advantages of being able to follow each experimental unit throughout the course of a longitudinal experiment often outweigh the disadvantages. The advent of automatic data recorders set up to monitor experimental units continuously in experiments has led to an explosion of repeated measures data sets. In the optimal scenario, measurements you are to take do not require experimental units to be destroyed. In such a scenario, the consecutive measurements are non-destructive, and so they can be taken not only on the very same experimental units, but also on the very same specimens. Consider chlorophyll measurements using a soil–plant analysis development (SPAD) meter: you can use it without destroying a plant. Or consider counting the number of branches on a plant in a pot, or the number of plants in a field plot, or determining average plant height in the plot: all this can be done without destroying the experimental unit (pot or plot). Such measurement is perfect when you want to study the effect of time on a response – you can record it at a number of time points and analyse the time effect, accounting for the specificity of individual specimens. An example for repeated discrete data measured over time would be comparing a number of wheat cultivars in terms of the speed of germination. In such an experiment, for each cultivar, 75 wheat seeds would be sown in three Petri dishes (25 seeds per dish) and checked at daily intervals to count how many of them germinated. The experimental unit is the Petri dish, and there would be at least three Petri dishes, as biological replicates, per cultivar. These measurements are non-destructive. An example for repeated continuous data measured over time would be monitoring biomass yields from rye grass plots at weekly intervals after a first cut of silage. This would be done by applying different quantities of nitrogen (N) to the plots as the treatments. The experiment would aim to discover an economical N level giving the most rapid growth response towards the second cut. Each week, biomass yield is estimated using normalised difference vegetation index (NDVI) via satellite imagery and converted to t ha−1. The experimental unit is the plot, and there would be at least three plots per N level. These measurements also are non-destructive. As statistical editors for Annals of Applied Biology (AAB), we have encountered situations in which statistical analysis of repeated measures experiments could extract more biological information through the application of appropriate statistical techniques. As such data are neither easy nor straightforward to analyse, we present this editorial to make you aware of various methods, in order to help you to choose the best one for your particular experiment. Such knowledge is important not only for analysing repeated measures data, but also for designing future experiments. Do not believe anyone claiming that reading a short paper will be sufficient for you to learn a difficult statistical method. We simply wish this editorial to make a point – and help you remember – that repeated measures data call for devoted statistical methods. Failing to use them would likely lead you not only to incorrect analysis, but also to incorrect interpretation and conclusions, a failure any biologist struggles to avoid. If you do not want to experience such a failure, never ignore the specific structure of repeated measures data; if in doubt, consult a statistician! It was Ronald A. Fisher who wrote, "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of" (Fisher, 1938). We could not agree more: we had better consider the design of a repeated measures experiment before discussing the analysis of data arising from it. You should never underestimate the advantages of blocking to account for any features of the experimentation, sufficient replication to allow power in statistical comparison of treatments, and randomisation to ensure no bias in allocation of treatments to experimental units. But when an experiment is to involve repeated measures, you should consider three additional aspects: how should the measurements be taken, how many time points should be taken and when should the measurements be taken? We have largely covered the first question in the introduction. If you can take non-destructive measurements from experimental units, do so, during the whole experiment. Only the last (harvest) measurement may be destructive. Some measurements, however, require at least part of an experimental unit to be destroyed. If we want to make repeated measures then, each experimental unit must be only partially destroyed during a particular observation time point and the remaining part of the experimental unit is available for subsequent (repeated) measures. In one scenario – the better one – you destroy just a small part of an experimental unit, for example, several leaves from a large tree. In this situation, although you do not take into account the specificity of leaves, you do take into account the specificity of trees – although the sampling is altering the tree, it is only in a biologically insignificant way. In another scenario, however, you have a number of technical replicates per experimental unit and sample from a technical replicate too severely for it to be used at the next time point. In other words, you have to destroy it: a different technical replicate is used instead. So, the number of time points you can have is restricted by the number of technical replicates in the experimental units. Note that these types of experiments are in-between the experiment without repeated measures and the one with repeated measures and non-destructive samples: while we take into account the specificity of experimental units, we disregard the specificity of the particular specimens being sampled on the subsequent occasions. By so doing, we are mixing two sources of information: that about time effect and that about the specimen-to-specimen variation – and we can do nothing about that. If this variation is large, we are risking that the time effect will be lost in it. For example, consider measuring the contents of microelements in leaves of spring barley grown in plots in a field trial. In order to measure them, you have to take samples of leaves to obtain the measurements. These are destructive samples. At subsequent time points, you will do the same, but you will not take samples from the same plants, but from different ones within the same experimental units. So, you will miss the information about the specificity of individual plants. You will, however, account for the correlation between plants growing in the same experimental unit, and you will not be affecting the remaining plants when taking repeated measures over time, as long as the plots are sufficiently large. However, consider an experiment in which the experimental unit is a pot of five oilseed rape plants, and one plant could be selected for a destructive sample at each of five time points. Again, the same experimental unit is being revisited, and so individual items within an experimental unit are not independent. However, if the experiment is supposed to provide repeated measures, the protocol is irresponsible because the experimental unit is severely altered after each removal and the growth of the remaining plants will change for reasons not associated with any treatment. Moreover, although the five plants are all in the same pot, they are still different plants. Hence, at the experimental scale of pots, we would prefer repeated measurements from the same plant (if this were possible, which depends on the character of the sampling procedure), whereas if we had plots of plants in a field trial then, clearly, we would have repeated measures as multiple samples from the same plot. If your samples must be destructive, you need to figure out the optimal level of sampling: can you take such multiple destructive samples over time from the specimens and be certain that you are making insignificant biological changes over time? If so, do so. If not, you are left with taking samples from different specimens from the same experimental units. Both statistically and biologically, the first scenario is optimal while the second is the worst. Let us now move on to the other two aspects, regarding the number and choice of time points. Measurements at extra time points do not constitute extra replication: replication is based on the number of experimental units per treatment, not on the number of observations per experimental unit. This is because variation between replications reflects the variation between experimental units while variation between repeated measurements within experimental units represents their internal variation. In the situation of automatic continuous recording of data from experimental units, we could easily take many such measurements. That might lead to more observations than we actually need in order to identify whether there are statistical differences between treatments. Hence, be cautious: if a response does not change to any biological extent in the course of an hour or a day, then recording need not be hourly or daily, but at a longer interval. When recording is by hand, the physical time required to obtain a set of measures from all the experimental units can be an issue: you need to be able to complete one set of measurements before it is time to start the next. Moreover, the order of recording needs to be the same over the experimental units (and block-by-block in the case of randomised block designs) at every time point, so that the required duration between time points is achieved – and is the same – for all experimental units. Hence, the number of time points depends on cost and physical constraints but also, of course, on the last aspect of the three mentioned above: when the measurements should be taken. Time points need to be chosen to cover the biological process unfolding and, vitally, to cover the varying rates of change in the process. This is crucial for traits that are known to have diurnal variation, such as those related to photosynthesis, including expression of genes involved in this process. Equidistant time points may be appropriate for an expected linear rate of change, but non-equidistant time points are essential for expected nonlinear patterns, with time points being closer together during times of rapid change or around proposed maxima or minima in the response. Clearly, this is where a pilot study can be useful to investigate where time points should be taken in the actual experiment, and recording at points along a derived time-scale, such as thermal time, can be more appropriate. Lovell, Powers, Welham, and Parker (2004) discuss such aspects of sampling in the context of plant pathogens. Typically, a nonlinear response with respect to time can turn into a linear response with respect to a derived time scale. Finally, in some experiments, the treatments applied to certain (or all) experimental units change at proposed time points. For example, in studies of the effects of water stress, typically plants are monitored over time in a well-watered condition until they reach a certain growth stage, at which point they are subjected to the stress condition (withholding of water). Lengths of time vary in this situation, so the plants will be watered again and their recovery assessed. You need to consider the design of such experiments carefully, for example with three treatments (i.e., well-watered, stressed and recovery) being applied to each experimental unit consecutively over time. The number and positioning of measurement time points in each treatment portion is vital to the success of the experiment, in order to consider the effect of each treatment within each experimental unit and then across experimental units. Usually, the first analysis of a repeated measures data set is to examine the data from each time point separately. Typically, such analysis will be in the form of t-tests or analysis of variance (ANOVA) (Quinn & Keough, 2002; Gomez & Gomez, 1984) for continuous data or generalised linear modelling (GLM) for discrete data (McCullagh & Nelder, 1989). And, rest assured, we are not saying that this approach is wrong. On the contrary, it is usually a valuable initial check on the nature and extent of the effects of treatments, and it is certainly worthwhile when particular time points are more important than others. For example, if comparison of some form of the final yield is paramount, analysis of data at a final (say, harvest) time point is most likely required in isolation from the other time points. So it may be that only statistical differences between treatments at a certain time point are of interest, for example, at anthesis in an experiment comparing wheat cultivars, or at a certain instar in an experiment comparing different chemicals applied to larvae of a certain insect. Such two analyses – of the final yield and of yield-contributing traits at a particular time – help answer different questions, and both can be important. Unfortunately, such analysis disregarding changes over time is often as far as the analysis will go, even though such an approach can detract from the understanding of the biological process unfolding over time. What's more, it is certainly less statistically powerful than analysing all the data together, the latter enabling you to benefit from comparing treatments at a time point, or comparing time points for a given treatment, against the full underlying biological variance. In other words, statistical analysis of data at each time point separately is not the most powerful approach to revealing at which time point the first, or the greatest, statistically significant difference between treatments occurs. And if repeated measures throughout an experiment were taken, then why not use all of them to look at that process? In particular, through statistical modelling of the data, the rate(s) of change over time in a response can be estimated, and we can then compare the treatments in terms of those rates. Fundamentally, however, the statistical problem lies in how to deal with the lack of independence in measurements taken over time from the same experimental unit. ANOVA assumes independent observations, so you would commit a statistical crime by fitting a treatment-by-time factorial structure (which is still quite a common mistake). The statistical assumption of independence must be checked, and if independence cannot be assumed, then the extent of dependence must be accounted for in the subsequent analysis (see the More comprehensive statistical approaches section below). What can be done, then? A two-stage modelling approach is the simplest applicable method for repeated measures data (Crowder & Hand, 1990). It gets around the non-independence problem neatly. The first stage initially involves plotting the response data for each experimental unit, to consider the overall shape of the response over time. If the same form of the shape is applicable to all experimental units regardless of the treatments applied to them, an appropriate model is fitted to each experimental unit. For example, if the response over time resembles a linear one for all experimental units, then the first stage would be to fit a simple linear regression to the data for each one, assuming there are at least three measurement time points to allow such regression. Next, the second stage consists of analysing the estimated sets of parameters from the models fitted. This is typically done via ANOVA, making sure that any design structure (e.g., blocks in a randomised complete block design; or blocks, main plots and sub-plots in a split-plot design) is included, that any factorial treatment structure and/or contrasts are included, and that any subsequent comparisons of means are only those of most biological importance (Kozak & Powers, 2017). So, if we have a simple linear relationship, we would analyse the linear rates in order to test (using F-tests) whether the treatments differ in terms of those rates. Analysis of the estimated intercepts (if fitted) allows for comparing the treatments at "time zero". Vitally, this approach overcomes the issue of non-independence because ANOVA is applied to a single set of independent observations from the experimental units. Response data on cellular proliferation is a good example to which the above two-stage approach may be successfully applied. As cellular growth can be in terms of rings (Barlow, Brain, & Powers, 2002), the growth in the radius of the corresponding organisms over time may well be linear. Esteves, Peteira, Powers, Magan, and Kerry (2009) show a good example. They studied the growth of colonies of three isolates of the fungus Pochonia chlamydosporia, using seven time points (5, 7, 10, 12, 14, 18 and 25 days after inoculation of Petri dishes). Two different solutes, KCl or glycerol, were used with stress being applied in terms of four different water potentials for osmotic and matric stress separately in two independent experiments, one experiment for osmotic and the other for matric stress. Thus, for each type of stress, there was a three (isolates) by two (solutes) by four (stress levels) factorial treatment structure at each time point. Using a completely randomised design for each experiment, the authors used five dishes per treatment combination. In the first step of the analysis, they plotted the radii of the fungal colonies (one per Petri dish) against time. For each replicate dish (experimental unit), the radius of its colony increased linearly, so linear regression was applied to estimate the radial growth rate (mm day−1) per colony (dish). For the second stage of analysis, the authors compared the radial growth rates in two separate factorial ANOVAs (testing the main effects and interactions between factors of isolate, solute and stress level) for each type of stress. Moreover, the four water potentials for each stress were ordinal – for example, −7.1, −2.8, −1.4 and −0.7 MPa for osmotic stress with stress increasing with lower MPa – so it was also possible to use regression to model the estimated growth rates on the water potentials for each type of stress (see figs. 2 and 3 of Esteves et al., 2009). This made it possible to consider the change in the relationship between the isolates and the solutes as stress increased for each type of stress, and also to observe the different overall relationships (nonlinear for osmotic, linear for matric) apparent for each type of stress. The two-stage method extends naturally to situations where a nonlinear response over time appears to be, for example, exponential or logistic. In such nonlinear models, the (exponential) rate parameters are of interest. The other model parameters, however – for example, the maximal (asymptotic) response and the time to 50% of maximal response in the case of the logistic – should also be analysed in the second stage. But the differentiation of the nonlinear equation for the model allows the rate at any time point to be estimated and thence compared between the treatments in the second stage. Having described the given developmental process over time, the analysis provides a further biological insight into the differences between the treatments as a whole. Classic texts on nonlinear regression modelling – such as Ratkowsky (1983) and Seber and Wild (1989) for theory, and Ratkowsky (1990) and Bates and Watts (2007) for application – can be consulted for details and, in particular, for the possible models that can be used to describe the observed shape of a given response. The authors considered also the asymmetric Gompertz and critical exponential models, but comparison of the models using F-tests showed the logistic to perform best. For the second stage, the sets of parameter estimates – for C, B and M – were analysed across the treatment structure, to test the significance of differences between genetic types (null vs. transgenic) and lines nested within types. Wilkinson et al. (2017) used the residual maximum likelihood method to fit a linear mixed model to the sets of the estimated parameters, accounting for blocks and plates nested within blocks as random terms (variance components) and testing (using approximate F-tests) the main fixed effects of types and of lines nested within types. The authors chose linear mixed modelling over ANOVA because the observations of lines were unbalanced over the plates, there being only four of the 16 lines per plate and four plates per block (run together). We note that these blocks corresponded to the blocks in the glasshouse wherein the wheat plants were grown. The relevant predicted means of the estimated parameters were shown along with the standard errors of the difference (SED). The authors used the predicted means for the three parameters from the output of the linear mixed models to draw the predicted mean logistic curves for the lines (see fig. 2 of Wilkinson et al., 2017). Simple and straightforward, the two-stage method for analysing repeated measures data does not come without cost: it does not account for the entire underlying variation from all experimental units. In other words, it does not model the data all in one go to provide a more powerful comparison of treatments, tested using the residual variation arising from a single model while vitally accounting for the lack of independence over time. Thus, the two-stage model is statistically conservative. This is not necessarily a bad thing if only the strongest differences between treatments are exposed, but do we indeed focus only on such effects? As we see, the two-stage model consists of two separate stages, and so statisticians have been trying to combine them, hoping to gain additional benefit by doing so. Hand and Crowder's (1996) approach, using a random coefficient model, does so when the shape of the response is linear; Davidian and Giltinan's (1995), when nonlinear models are in use. But what if there is no well-defined shape to the response over time? Or, even if there is, can we also apply ANOVA? Well, yes, and actually ANOVA has received much attention in the context of repeated measures, the underlying premise being to consider the repeated observations per experimental unit as split-plots (with the experimental units being main plots) to give a so-called "split-plot in time" design structure. But the issue with this premise is that you cannot randomise time! The time factor (which we should not call treatment) is most certainly ordered, meaning that the key ANOVA assumption of independent observations is violated when analysing data from such a split-plot design. The issue, therefore, is to judge to what extent these split plots (i.e., time points) are non-independent within the main plots. This is done by considering the covariance between the pairs of times. If all these pairwise covariances are statistically small and quite similar despite varying variances at each time point, then we have a uniform covariance structure and can assume independence. We can test for departure from independence using a χ2 test. If we cannot assume independence, we need to account for the covariance over time in the modified tests (F-tests) of the treatment, time, and treatment-by-time interaction ANOVA terms. To reflect the loss of power to assign statistical difference associated with non-independence, SEs, SEDs and least significant differences (LSDs) will all be modified upwards from what they would be if there was independence. Hence, many statistical packages (such as Genstat, R, SAS, SPSS and Statistica) employ statistical methods – such as the one developed by Greenhouse and Geisser (1959) or its adjusted version by Huynh and Feldt (1976) (with various other methods being available depending on the statistical package) – to estimate the extent of non-independence in the situation of repeated measures ANOVA. Repeated measures ANOVA has been quite popular. In AAB, Abeli et al. (2017) employed repeated measures ANOVA with Greenhouse and Geisser (1959) correction to study vegetative growth (the numbers of buds, flowers and fruits) of seashore mallow plants (Kosteletzkya pentacarpos) over time, having treated them with different combinations of salt (NaCl) and fertiliser (NPK). de Pedro et al. (2019) used a repeated measures ANOVA to compare the performance (represented by percentage parasitism) of two medfly parasitoids, Diachasmimorpha longicaudata and Aganaspis daci, over time – albeit without stating which method of accounting for non-independence they used. So, the point is that, although such ANOVA can account for the experimental structure of a repeated measures experiment, sufficient detailed output from the analysis is needed. In both above-mentioned studies more information could have been included about the outcome of the repeated measures ANOVA in the results section of these papers, to tell us the extent of non-independence over time. Following such an analysis, the relevant statistics (e.g., the Greenhouse–Geisser epsilon) should be presented and briefly interpreted. However useful, repeated measures ANOVA is not the only method for analysing repeated measures experiments – and in fact, it is not the best one. Linear mixed modelling (LMM) is a more comprehensive approach that allows the form of the non-independence to be modelled in more than just one way. It allows us to test for non-independence over time by including a model term for first-order auto-regression (AR1) when specifying the random part of the model. This is a standard method when time points are equidistant. When they are not – or simply as an alternative approach – an ante-dependence covariance structure may be used (Kenward, 1987); Zimmerman and Núñez-Antón (2017) provide a detailed exposition of this approach. Otherwise, in the situation of non-equidistant time points, a simple power law relationship may be used in the model of the covariance structure over time. These three approaches make the intuitive assumption of there being less correlation with greater duration between any chosen pair of time points. The LMM approach has gained some interest. In AAB, Gontijo et al. (2018) used it to model counts of aphids and their natural enemies over time. They tested for the best model term to include from among the following covariance structures: unstructured, ante-dependence, or auto-regression. In the case of the auto-regression covariance structure, the authors also tested whether the variance at each time point should be the same or different (heterogeneous). This allowed them to present correct F-test results for interpretation, given the inclusion of the appropriate model terms (see table 1 of Gontijo et al., 2018). LMM allows more complex variance–covariance relationships over time to be constructed and tested for statistical significance (Rao, 1997). In a study of the outdoor storage of biofuel, Whittaker, Yates, Powers, Misselbrook, and Shield (2016) modelled the emission of CO2, CH4 and N2O from two willow wood chip heaps (one in the East Midlands and one in Hertfordshire, UK). Repeated measures from probes in the heaps were taken at non-equidistant time points. The authors accounted for the non-independence by imposing a power model structure in LMM. They fitted the model assuming a common variance structure (the same for all the time points), and then with a different variance at each time point to test for heterogeneity of variance over time, as Gontijo et al. (2018) did (see above). Whittaker et al. (2016) also added spline terms (see Verbyla, Cullis, Kenward, & Welham, 1999) in time to the model, to test for statistically significant curvature, either over time as a whole or separately for the different spatial locations (depth and side of heap) of the probes, over time. Thanks to this, they compared – with sufficient statistical rigour – the emission responses from the different depths and sides of the heaps (see figs. 5 and 7 of Whittaker et al., 2016). For repeated measures of discrete data, such as those pertaining to the binomial or Poisson distributions, you can use generalised linear mixed models (GLMM). The principles of testing for non-independence over time, for example by the inclusion of an AR1 term, remain the same as for continuous response data. Analysis of repeated measures categorical data forming contingency tables of counts, where either the rows or the columns denote the time points, have also come under statistical scrutiny (Agresti, 2013). Repeated measures of discrete data can be analysed with empirical generalised least squares estimation and log linear modelling. Repeated measures data also arise when the time axis is not specifically time but is a derived time-axis in the form of accumulated "developmental units". An example in the discrete case is research aiming to compare oilseed rape cultivars by Powers, Pirie, and Nemeth (2009); Powers, Pirie, Latunde-Dada, and Fitt (2010). They counted the numbers of healthy, diseased and dead leaves on plants, measured over thermal time from sowing. Here, there were multiple (and correlated) responses with repeated measures for each experimental unit, giving rise to the opportunity for innovative statistical modelling of all the responses together per plant. In the continuous case, an example is the height of willow trees over estimated day-length developmental units (estimating the base day-length), with the trees being subjected to different levels of mechanical defoliation simulating damage due to willow beetles (Powers, Peacock, Yap, & Brain, 2006). In both these examples, the two-stage modelling approach was used. AAB has a panel of statistical editors, who – as part of the review process – report on the state and appropriateness of statistical design and analysis of experiments in papers submitted to the journal. The journal provides its general expectations from statistical analysis and its presentation in both materials and methods and results sections, which can be found here: https://onlinelibrary.wiley.com/page/journal/17447348/homepage/forauthors.html. The statistical editors will query situations where repeated measures are likely to have been taken but where authors have not made this explicit in their materials and methods section. Moreover, if they see it as appropriate, they will ask for a more comprehensive analysis of repeated measures data. This will certainly happen when they see that such a comprehensive analysis will add value to the study, by extracting more biological information through the application of appropriate statistical modelling. And, clearly, authors should not miss out on an opportunity to enhance their work through reporting illuminating results, given a meaningful statistical analysis of their data.

Referência(s)
Altmetric
PlumX