Popularity leads to bad habits: Alternatives to “the statistics” routine of significance, “alphabet soup” and dynamite plots

Artigo Acesso aberto Revisado por pares

Popularity leads to bad habits: Alternatives to “the statistics” routine of significance, “alphabet soup” and dynamite plots

2021; Wiley; Volume: 180; Issue: 2 Linguagem: Inglês

10.1111/aab.12734

ISSN

1744-7348

Autores

R.C. Butler,

Tópico(s)

Statistics Education and Methodologies

Resumo

Combine 3 with 1 in the presentation: Draw a bar chart of the means, with "SEM" error bars. Add stars/letters to each bar OR Table 1a and Figures 1a and 2a show typical examples of results presented using the Routine. These are fabricated, but one or other of these forms of presentation can be found in most issues of almost all biological journals, from the most humble to the most exalted. Other routines (Zuur & Ieno, 2016) are available, but are less widely followed. Example 1a also shows common features: the poor legend wording and no reference to the multiple range tests used (here, Duncan, 1955). This journal (AAB) has some reasonably strong guidelines for statistical presentation: These essentially prohibit use of the Routine. The roots of the Routine are hard to identify, but one of the seeds must surely be Ronald Fisher's book "Statistics for research workers" (Fisher, 1925), where tables of the values of statistics for particular p-values were first published, making statistical testing accessible. In conjunction with this, the last few decades have seen the widespread availability of software that makes the steps in the Routine very easy to carry out, with the consequent explosion in its use. Meanwhile, developments in statistical methodology have proceeded at exponential rates, driven by major developments in theory and computing resources. Whilst still including some significance testing, these newer techniques provide additional and more reliable information. Despite its widespread use, the Routine tends to mean that some useful information in the data is obscured, sometimes leading to unsupported conclusions. There is a very large body of literature, written over several decades, critically addressing the steps, generally suggesting alternative approaches that make more informative use of the data (Vail & Wilkinson, 2020). It would seem that no paper addresses all the steps in the Routine, with many of these papers not published in journals relevant to experimental biologists. This paper provides a partial review of publications relevant to all parts of the Routine. Informative data summary and data analysis are assisted by an understanding of ideas behind the methods used. Therefore, ideas relevant to the steps in the Routine are briefly summarised, along with the suggested alternative practices. The Routine results in a summary of statistical analyses; so understanding the purpose of a statistical analysis is a first step to gaining more information from data. Statistical analysis is about summarising data to find information and about making estimates and predictions (inferences) about what might happen in the future (Fisher, 1925). The need to deal with variation is inherent in these activities. If your dataset includes every possible data point, variation is not a problem because the dataset contains everything: the results of all football matches in a competition are known exactly. Very few research datasets contain all possible data points, so the key problem is to obtain representative summaries, estimates or predictions, and associated measures of uncertainty for these. Significance testing is a part of assessing uncertainty (Fisher, 1925), but has, in the view of many scientists, become synonymous with "the statistics," with the consequence that potentially useful information is not found. An increased awareness of a fuller range of available analysis and data presentation tools can thus lead to an increase in information gained from a dataset. For many biologists, the "p-value" is the entire point of a statistical analysis. Some journals require a capital P, and even that it should be a capital, italicised P, perhaps reflecting this high importance attributed to p-values. AAB uses p, as do the majority of statistical and mathematical publications. There is a dichotomous interpretation of p: A significant "p-value" (usually one less than .05) is interpreted to mean that a difference between treatments is "real." Conversely, if p > .05, then the treatments are "the same." For a given comparison, it is also widely believed there is only one "true value" of p. These ideas are appealing, because they appear to give simple and easily interpretable conclusions. Sadly, the beliefs are mistaken (Goodman, 2008; Sterne, Cox, & Smith, 2001a; Wasserstein, Schirm, & Lazar, 2019). In the context of most significance tests, p simply stands for "probability," and is: The probability of getting a result as extreme or more extreme than the one observed GIVEN THAT The null hypothesis of "no difference" is true AND THAT The assumptions behind the analysis carried out are (sufficiently) true. The last part is often not recognised, but it is very important. p-values can be small because the null hypothesis is not true, or because the result obtained is just one of the rare ones, or because the assumptions about the data required for a valid analysis are not appropriately satisfied. So, for a p-value to be meaningful, the data must sufficiently satisfy the assumptions: where this is not the case, a lack of adherence to assumptions can be the primary cause of a significant result. Statistical significance (or not) says nothing about biological importance (Ziliak & McCloskey, 2008): any "test" result must be interpreted in the context of the trial and relevant biology. Values other than .05 can be used to determine significance. Fisher (1925) promoted choosing a value to reflect the context (e.g., information from previous trials), so sometimes p= .1 might be a good choice, sometimes a value smaller than .05. Fisher suggested p = .05 was convenient, partly because, with the assumption of Normality, it approximately corresponds to a difference being equal to twice its standard error, partly because he thought a 1 in 20 chance was sufficiently small to be interesting (Fisher, 1925). Today, computers can calculate exact p-values in a fraction of a second, so these can be presented, allowing readers to make their own decisions as to the importance of effects (Webster, 2001). A natural interpretation is that the smaller the p-value, the more "evidence" there is that the effect is "real," interpreting 1-p as "evidence against the null hypothesis." Unfortunately, this is not the case. "Evidence against the null hypothesis" could be calculated using Bayes' theorem, but this requires further, usually unavailable, information (Matthews, 2001). In practice, it is reasonable to interpret a large value of p (close to 1) as evidence in support of the null hypothesis (no effect), and very small p values (such as p < .001) as evidence against the hypothesis (Sterne et al., 2001a). Ultimately though, results from many trials and sources are needed to confirm that an effect is "real." Use of asterisks/stars is very common: the practice probably originated as a reference to a footnote referring readers to published tables of critical levels of a statistic. Stars convert a continuous 0 to 1 p-value scale into classes (usually, n.s., *, **, ***), losing both subtlety and information (Wasserstein et al., 2019) whilst simultaneously attributing undue accuracy to p-values near the cut-off points between the classes. It is more informative to present actual p-values, especially for n.s. (not significant), because n.s. can mean anything from a shade larger than .05 (slightly "interesting") all the way up to p = 1 (identical means). p-values are only estimates: analysis assumptions are never exactly satisfied. Thus, very similar p-values, including p = .051 and p = .049, are essentially the "same" (Wasserstein et al., 2019) and should be interpreted as such. In general, p-values should be quoted to only two or three decimal places. One or two decimals are enough for associated statistics (F, t, etc.) if they are also presented. Trial design is fundamental to the value of a study (Finney, 1988), but is not mentioned in the Routine. Frequently, only the number of replicates and treatments used are provided (Haddaway & Verhoeven, 2015), perhaps because it is assumed that the design is always a randomised complete block design. There is a huge range of potential designs, old (Fisher, 1926) to recent (Williams & Piepho, 2019), and sound experimental design underpins effective and efficient studies (Finney, 1988; Kilkenny et al., 2009; Smith & Cullis, 2019). Appropriate analysis methods can vary substantially between different design types. An accurate and complete description of the design is therefore essential, to support the validity of trial results and enable reproducibility (Haddaway & Verhoeven, 2015). ANOVA is the most frequently used method for data summarised using the Routine. Like all statistical methods, ANOVA is underpinned by a set of assumptions that must be satisfied for the results to be valid. ANOVA has five underlying assumptions, with three being the most important (Finney, 1989). The analysis is assessing the differences (meanB−meanA) between treatments rather than other possible relationships (e.g., meanB being a multiple of meanA). The variance is the same for each treatment: Fisher (Fisher & Mackenzie, 1923) developed ANOVA with the primary aim to obtain a more robust estimate of the underlying variation (Finney, 1988). If the variation around each treatment mean is (close to) similar, a pooled (i.e., combined) error can be calculated, based on more data. The pooled error, and any test using it, is more reliable when individually calculated errors are used. Therefore, the pooled measure of error should be used in the presentation of analysis results (Welham, Gezan, Clark, & Mead, 2015, chap. 5). Any data point is independent of another data point. Lack of independence is often caused by taking multiple measurements of the same thing, either at different times ("Repeated measures") or the same time ("Pseudo replication"). The analysis needs to be adjusted to allow for this, otherwise test results will be unreliable, frequently with p-values being too small. The remaining assumptions are less important. The data around each mean is Normally distributed. Normality underlies the validity of the testing of the F or t statistics, although not always the validity of the statistics themselves (Finney, 1989). ANOVA is "robust" to departures from Normality (Lumley, Diehr, Emerson, & Chen, 2002), whereas it is not strongly robust to variance heterogeneity (Welham et al., 2015). Bias is when the result is skewed because of experimental procedures, such as the use of a systematic layout, where treatments have not been randomly allocated (such layouts can sometimes be justified). Assumptions need to be checked, and only need to be approximately satisfied. Assumption checking however is frequently confined to assessing Normality, or not checking assumptions at all (Warton & Hui, 2011). Where assumptions are not sufficiently satisfied, many scientists address the problem by using modern statistical methods. However, the two most common ways to address violations of assumptions are to transform the data and then use the same analysis method, or to use a nonparametric method. Each has drawbacks. ANOVA and standard regression are parametric methods because the primary output from the analyses is estimates of parameters (predicted means, regression parameters, etc.). Parametric techniques involve fairly strong assumptions, whereas nonparametric analyses require much less stringent assumptions to be satisfied. The most commonly used nonparametric methods (e.g., Mann–Whitney U) obtain ranks of the data, and then carry out tests on those ranks rather than on the actual data. Thus, the primary output from these methods is not estimates like treatment means, but a test statistic, assessing the similarity of the mean of the ranks for treatments. p-values for these statistics will usually be larger than those for the same comparison from a parametric analysis, because more assumptions are made for a parametric analysis. Thus, in general, the parametric test has more "power." The assessment of differences between mean ranks does not automatically apply to differences between treatment means: care needs to be taken when interpreting the results of a nonparametric analysis. Historically, data transformation prior to analysis was used to enable the assumptions behind standard ANOVA or regression to be sufficiently satisfied, thus enabling these methods to be used. The primary reason to transform data was a lack of variance homogeneity and to allow analysis of non-Normal data (Faraway, 2002; Welham et al., 2015). In an age with only rudimentary computing power, this was the best that could be reasonably done. Nowadays, methods that more directly model the characteristics of the data are available and should be used in preference to data transformation (Lane, 2002). A key advantage is that the results are then generally more interpretable from a biological point of view. A transformation modifies several things (Lane, 2002) that need to be considered when interpreting analysis results. These modifications are illustrated with a log transform. If the data for each treatment are log-Normally distributed, then the distribution of the logged data is Normally distributed (Evans, Hastings, & Peacock, 1993). For log-Normally distributed data, the variance increases with the size of the treatment means. The variance of the log(data) is often constant across treatments. ANOVA of the log(data) gives the means of the log of the data, which when back-transformed are geometric means. For n data points, the geometric mean is the nth root (i.e., to the power of 1/n) of the product of all of the data points. The geometric mean can be quite different from the usual (arithmetic) mean: The mean of 5, 9, 12, 25, 27 is 15.6 and the geometric mean is (5 × 9 × 12 × 25 × 27)1/5 = 12.95. Means and differences between means obtained on the log scale can be back-transformed, as can confidence limits or a least significant difference (LSD), which becomes a ratio. (LSD: the smallest difference between two means such that the means are significantly different at the required significance level.) However, standard errors cannot meaningfully be back-transformed. ANOVA of the log(data) is comparing the differences between the means of the log-data. Back-transforming this difference gives the ratio between the geometric means. This does not directly say anything about differences between the raw means: Untransformed data: MeanA−MeanB Logged data: log(GeoMeanA) − log(GeoMeanB) which, when back transformed, is: GeoMeanA/GeoMeanB. Historically, the square root transform was used for count data, and arcsine for proportions out of a fixed total. These both change the four features listed above. However, unlike for the log transform, there is no meaningful way to back-transform the difference between the transformed means or LSDs. As Warton and Hui (2011) demonstrate for the arc-sine transformation, using ANOVA subsequent to these transformations gives uninterpretable results. There are cases where transformation makes sense, such as a square root transform for areas (~lengths) or cube root for weights (~proportional to a length). Welham et al. (2015, section 6.2.4) say: "In an ideal case, a transformation will give interpretable physical [or biological] representation of the response as well as enabling it to satisfy the assumptions of the analysis so that the conclusions are valid". Where the data do not sufficiently well conform to the assumptions, data transformation and nonparametric methods can generally be avoided, because methods are available that allow the characteristics of the data as collected to be modelled directly. Such methods are currently utilised in some papers in many journals and are described in many introductory texts (Welham et al., 2015). Generalized linear models (GLMs; McCullagh & Nelder, 1989) provide suitable methods for the analysis of both counts and proportions with interpretable results. (Note: the use of "GLM" for "General Linear Model" is very widespread, but applies to general regression/ANOVA for Normally distributed data, which is just one type of GLM. In this paper, GLM always refers to a generalized linear model.) For a (simple) GLM analysis, means are identical to the raw means, and the relationship between explanatory variables and the means is determined by the link function: The "link" function is an integral part of a GLM. For a log link and a simple model with one explanatory variable X and associated parameter b to estimate, log(mean) = bX. Other useful analysis methods are REML (restricted maximum likelihood; for mixed models with fixed and random effects (Welham et al., 2015), repeated measures, etc.), or variants of a generalised linear mixed model (Breslow & Clayton, 1993). If a dataset is analysed using more than one method, a range of p-values will be obtained for comparing the same treatments, and interpretations can vary. The p-value for a comparison is affected by several things, including the type of comparison between treatments (difference between means, ratio of means, etc.), number of explanatory variables and analysis assumptions. Analyses of the fabricated data in Table 2a illustrate how test results can change between different methods. The data are counts, for five replicates of two treatments, A and B. Table 2b has the summary statistics. The treatments were compared using four different but valid "tests" (Table 2c), with p-values as shown. The first test is a standard t-test, using the pooled variance. The second is the same t-test, but with the p-value calculated using a permutation test, making the test partly nonparametric. In this test, all 10 values of the data are permuted (re-ordered) many times, each time re-allocating five data values to each treatment. The p-value then assesses how extreme the result with the actual data was relative to the results for all permutations of the data. The third test is the Mann–Whitney U test, a nonparametric version of a t-test. This test compares the means of the ranks (sorting order) of the data (Siegel & Castellan Jr, 1988). The fourth test compares the treatments using a Poisson GLM. The model includes a log link relating the explanatory variables (treatments) to the means, which leads to the ratio meanB/meanA of the means being assessed, rather than the difference: this is a change in the hypothesis being assessed. The p-values vary from .010 for the standard t-test to .212 for the Poisson test. The two nonparametric tests have similar, but not identical, p-values, which are larger than those for the parametric t-test: this is expected, because of the extra assumptions underlying the parametric test. The interpretation of the test result varies with the test, because the treatments are being compared in different ways. It would be incorrect to infer that the treatment means vary significantly because the result of the Mann–Whitney U test was significant. However, you can say that the values for treatment A were mostly lower than those for B. Similarly, the nonsignificant ratio (A/B) of the treatment means, as tested in the Poisson GLM, does not necessarily tell you much about the difference (A − B) between the means. Just as test results can vary greatly between different methods, so can estimates, errors and significance letters: results from one analysis method are not generally applicable to those from another method. Consequently, analysis methods need to be clearly described, and the interpretation and presentation of results will be influenced by that method. Thus, it is usually better not to mix results from different analyses methods, including keeping simple summary statistics separate from the results of a formal analysis. A well-planned trial or other type of study has clearly defined aims that determine how the study is carried out (Mead, Gilmour, & Mead, 2012; Webster & Lark, 2018). The study aims lead to questions to be answered, which lead to the choice of variables to be measured, treatments and specific treatment comparisons (Chew, 1977; Kozak & Powers, 2017; Little, 1981). The appropriate statistical analysis including tests should, at least in broad terms, be clear even before the data are collected. Thus, the need to compare each treatment with each other treatment should be very rare. All-pairwise testing (or "means separation") is a core component of the Routine, leading to "alphabet soup." However, the results are often not easily interpretable or particularly informative. Little (1981) describes a hypothetical plant growth trial, with treatments comprising seven different amounts of a substance. The question would be "what is the relationship between plant growth and the amount of applied substance?" and is unlikely to be "Of the 21 possible pairs of treatments, which ones are significantly different from each other?". Where there is a large number of unstructured treatments (e.g., several cultivars), it is often useful to make an overall test (e.g., F-test in ANOVA), and then present the means sorted in order of size, along with a measure of precision. This can be done graphically (Figure 2b) plotting the means in increasing order. This method can reveal groupings in the treatments. The aim for unstructured sets is often to find the "best" few treatments, so testing every pair of treatments is often not informative. Some comparisons are of more interest than others (Finney, 1988: is the comparison of the cultivars with the 31st and 30th highest yields really of as much interest as the one with the highest vs second highest yield?). Results of all-pairwise comparison procedures are generally presented using letters (Table 1a; Figures 1a and 2a). Letters sometimes help identify important features in the data. However, they convert a continuous 0 to 1 p-value scale into two classes (significant and not significant), losing even more information than stars. They reinforce the "real difference" versus "the same" false dichotomy. They also add clutter, obscuring the key information (means/estimates, changes between treatments, estimates of error/precision). In Figure 3, if the letters were not there, the conclusion for Figure 3a would be "T1 is lower than T2 and T3, which are reasonably similar." The letters add nothing informative: they imply that T1 is "the same" as T2, but "different" to T3, which is "the same" as T2! It is more helpful to instead include an error bar, such as the LSD, as in Figure 3b. For Figure 3c, the letters make sensible interpretation rather hard: How can S1 and S3 be different from S2 whilst S1 and S3 are "the same"? The effect seen can be a result of varying replication. Here again LSD or confidence limit bars are more helpful (Figure 3d). The ease with which all the appropriate treatment comparisons can be included within an analysis varies substantially between statistical packages. With some packages, the only way to make the required comparisons may be to do t-tests that use the standard error of a difference or an LSD produced from ANOVA. Comparisons done this way will give identical p-values to the same contrast assessed within an ANOVA. Sometimes it is appropriate to make unplanned ("post hoc") treatment comparisons: in such cases, Altman, Gore, Gardner, and Pocock (1983) suggest that the tests (including all pairwise tests) should generally be treated as exploratory only, with a view to forming new hypotheses to be assessed in future studies. When presenting analysis results, it is most useful to focus on the estimates/size of treatment effects/patterns in the data, especially those of biological interest, with actual p-values used as supporting evidence: To quote just the p-value or "it was significant" is almost completely uninformative (Lang, 2004). Nonsignificant effects can be as biologically interesting as significant effects, so can be discussed. Wasserstein et al. (2019) suggest avoiding using the word "significant" altogether. A major challenge for presentation of results is how to represent variability. There are many ways of showing variability: the choice as to which is most useful varies between studies. To show variability, the Routine uses either individually calculated "SEM" or SD. The SD is a descriptive statistic (Altman & Bland, 2005; Cumming & Finch, 2005; Nagele, 2003; Vail & Wilkinson, 2020) and is the average of the squared deviations from the mean, so reflects the variation in the data itself. For Normally distributed data, on average 95% of the data will fall within 2 SD from the mean (Altman & Bland, 2005; Cumming & Finch, 2005). In contrast, the "SEM" is an inferential statistic and is a rudimentary formal analysis (Vail & Wilkinson, 2020). It is a measure "of how the mean of a sample relates to the mean of the underlying population" (Nagele, 2003). As such, the validity of an "SEM" is dependent on adhering to analysis assumptions such as Normality to have any real meaning. The "SEM" calculated individually for a treatment gives a misleading indication of the variation in the raw data, because it is always smaller than the SD. As an estimate of variation or precision of a mean, the "SEM" is unreliable when, as is usual, it is calculated from a small number of values (often 3 or 4). The unreliability is reflected in the large associated t-values (at the 5% level, t = 12.71, 4.30 and 3.18 for 2, 3 and 4 data-points, respectively). ANOVA addresses this unreliability by pooling information over the whole dataset: the resulting pooled standard error (SEM) has larger associated degrees of freedom (df) and smaller t-values, so is preferable to individually calculated "SEM". Given the above, why is the "SEM" so widely used? The reasons would appear to include repeating what "has always been done," that the "SEM" is smaller than the SD, so it looks "better," and a (misguided) belief that the "SEM" can be easily used to assess whether two means are "significantly different" (Cumming, 2009). Belia, Fidler, Williams, and Cumming (2005) demonstrated that, on average, scientists cannot reliably say how far apart means need to be for "SEM" bars to allow judgement of "significance" making "S.E.M." rather uninformative. In addition to the SD and "SEM", there are many other ways of showing variation. There are (broadly) two sorts (Cumming & Finch, 2005): variation in the raw data, and precision of estimates as derived from an analysis. Appropriate ways to show variability (errors) vary between the two sorts so the form of appropriate tables and figures also varies. The various types of error are widely described (Cumming, 2009; Cumming, Fidler, & Vaux, 2007; Cumming & Finch, 2005; Saville & Rowarth, 2008). Table 3 summarises the main types. Whichever type(s) of error are shown in a table or figure, the error needs to be described in the legend, because interpretation of the information varies substantially between the error types: Error bars depicting descriptive statistics are drawn in the same way as error bars relating to inferential analyses, but they are fundamentally different. "Different types of error bars give quite different information, and so figure legends must make clear what error bars represent" (Cumming & Finch, 2005). In tables, the use of "±" to precede a measure of error is superfluous and can be misleading (Drummond & Tom, 2011): it implies a range that the statistic can vary between. Primarily, ± just adds visual clutter, so it should generally be avoided. In text, errors should be named: for example, use "mean = 4.35 (SEM = 0.27; df = 12)." It is frequently useful to include the df with an error (McCullagh & Nelder, 1989; Saville & Rowarth, 2008), because they give a measure of reliability of the error, and can allow it to be transformed into an alternative error type (e.g., convert s.e.m. into confidence limits). In Figure 4, six sets of data are summarised (following fig. 1 of Drummond & Vowler, 2011). The mean and SD are exactly the same for each of the six sets (mean = 4; SD = 1). Thus, the bar charts of the mean with SD (b) are all identical. The bar charts with "SEM" bars (a) differ only because the errors are calculated individually for each treatment, so are shorter where n = 50: the length is driven by n, not by the variation in the data itself. Such "dynamite" or "detonator" plots are widely derided by the statistical community (Drummond & Vowler, 2011; Koyama, 2011; Vail & Wilkinson, 2020) because they "hide" many of the details in the data. The more detailed presentation of the data, as illustrated with dot-histograms (c) (Wilkinson, 1999) contrasts with the impression from plots (a) and (b), showing that the sets differ substantially, both in the distribution of the data and the number of values. For both values of n, Set 1 is sampled from a Normal distribution, which is symmetrical. Set 2 is drawn from a log-Normal distribution, so is asymmetrical. Set 3 is an artificially constructed set where the data are in two groups. For a summary of the raw data, plot (c) is the most informative: plots (a) and (b) show little of the information that can be seen in (c). Where n is too large for dot-histogram presentation as in (c), box plots (d) (Tukey, 1977) or related plots such as violin plots (Wickham & Stryjewski, 2011) may be appropriate. However, the standard box-plot does not reveal the grouping seen for Set 3 in the dot-histogram: some related plot types may show this. Given that the "SEM" does not reflect the variation in the raw data, it is more informative to present the SD in tables. Or, particularly for non-Normal data, present the minimum and maximum of the data or the range (Drummond & Tom, 2011; Lang & Altman, 2013). For graphical presentation, dot-histograms or variants (Drummond & Vowler, 2011; Koyama, 2011; Ovais, 2016; Vail & Wilkinson, 2020; Weissgerber et al., 2019) or box-plots (Tukey, 1977) use less ink than the standard bar chart with error bars whilst showing more information (Tufte (2001): information to ink ratio; higher values are better). All parametric analyses carried out with sound statistical software calculate estimates and errors associated with them as a part of the analysis, because obtaining these is the primary reason to carry out the analysis. Thus, these estimates and errors should be presented, including when the estimates are identical to the raw means. In tables, it can be appropriate to show descriptive statistics alongside results from analysis. However, within figures it is less misleading to keep the two types completely separate. The type of errors presented should be chosen depending on the aims and structure of the trial, the nature of the data collected, and what you are trying to demonstrate. In a figure, presenting confidence limits on each mean may be most appropriate. If you have carried out ANOVA on the raw data, the means and an LSD may be useful, especially where the replication is the same for all treatments. Improved presentations following the above recommendations for the examples in Table 1a and Figures 1a and 2a are shown in Table 1b and Figures 1b and 2b. The bar chart is all-pervasive, despite the very many other different ways to display data graphically, as a web search and the large literature discussing graphical presentation will show (Cleveland, 1994; Robbins, 2013; Tufte, 2001; Weissgerber et al., 2019). The ubiquity of bar charts appears to be partly because that is what everyone uses, partly a lack of familiarity with other types and partly a perception that "everyone understands a bar chart." Software plays a role: in Microsoft® Excel®, bar charts are both the default type, and easiest to create. Whilst bar charts can be a useful way to present data, other graphical types will often be more informative. Here, three alternatives to bar charts are presented above: a line-and-point plot (Figure 1b), a dot-plot (Figure 2b, sometimes called a Cleveland dot-plot) and a dot-histogram (Figure 4c), which is also called a dot-density plot, or Wilkinson's dot-plot (Wilkinson, 1999). The first two are generally most useful for displaying the results from an analysis, and the third for data exploration. Of these graph types, only the line-and-point plot can easily be made with Excel. However, Excel macros can be obtained to create the other types (Peltier, 2020; XLSTAT). Statistical software (Minitab, R's ggplot2, Genstat, SAS, etc.) and scientific graphing software (Sigmaplot, Origin, etc.) enable these other plot types to be constructed relatively easily. Good graphical presentation takes time, experimentation and effort, both in choosing what form the graph should take and the detail within it, such as symbol, colour and text labelling. It is important to think about what you are trying to show. Default settings can be poor: do not just accept the first thing you get. Try out several different graph types and settings. Sometimes, try different software packages: the choice of available graph types vary, and some packages make it very hard to get the detail right. Include a statistical expert as part of your project team. Where this is not possible, consult with an expert. Develop a clear research question with specific rather than general aims. Choose treatments that can answer this question and identify which treatment comparisons are of interest. Use a statistically sound trial design, clearly defining the trial structure (blocks, plot, layout, etc.) Explore your data using summary statistics and appropriate graphs. Carry out a sound statistical analysis, using an analysis method that as much as is practical reflects the characteristics of the data. Check assumptions. Obtain estimates and associated measures of variability from that analysis Compare the treatments as determined by their structure, the pre-determined comparisons and the research question. Properly describe the trial design and statistical analysis methods. If appropriate, present raw data summaries and/or figures. Present results derived from the analysis: estimates and associated measures of variability. Present the informative parts of your data exploration and formal analysis by including appropriate tables and figure types. Include legends that fully describe all parts of the table/figure. With p-values included as supporting evidence (significant or not), interpret the results, discussing the biologically interesting estimates/size of treatment effects/patterns in the data. Use of alternative practices to the standard Routine often takes more time, effort and understanding, particularly for graphical presentation. This is at least partly because software makes using the Routine easier. However, the barriers to more informative practice are not substantial after some work to learn more about appropriate statistical methods, and how to use a wider range of software. The rewards: more and higher quality information for the same effort in collecting data. There is a vast literature that is relevant to the topics covered in this paper. This bibliography includes several of them, additional to those referenced in the paper. Many were used to inform the ideas presented in this paper.

Ver no editor

Altmetric

PlumX

Entrar

Lembrar minha senha

Receber meu e-mail de confirmação

Popularity leads to bad habits: Alternatives to “the statistics” routine of significance, “alphabet soup” and dynamite plots