Recurring controversies about P values and confidence intervals revisited
2014; Wiley; Volume: 95; Issue: 3 Linguagem: Inglês
10.1890/13-1291.1
ISSN1939-9170
Autores Tópico(s)Bayesian Modeling and Causal Inference
ResumoThe use, abuse, interpretations and reinterpretations of the notion of a P value has been a hot topic of controversy since the 1950s in statistics and several applied fields, including psychology, sociology, ecology, medicine, and economics. The initial controversy between Fisher's significance testing and the Neyman and Pearson (N-P; 1933) hypothesis testing concerned the extent to which the pre-data Type I error probability α can address the arbitrariness and potential abuse of Fisher's post-data threshold for the P value. Fisher adopted a falsificationist stance and viewed the P value as an indicator of disagreement (inconsistency, contradiction) between data x0 and the null hypothesis (H0). Indeed, Fisher (1925:80) went as far as to claim that "The actual value of p … indicates the strength of evidence against the hypothesis." Neyman's behavioristic interpretation of the pre-data Type I and II error probabilities precluded any evidential interpretation for the accept/reject the null (H0) rules, insisting that accept (reject) H0 does not connote the truth (falsity) of H0. The last exchange between these protagonists (Fisher 1955, Pearson 1955, Neyman 1956) did nothing to shed light on these issues. By the early 1960s, it was clear that neither account of frequentist testing provided an adequate answer to the question (Mayo 1996): When do data x0 provide evidence for or against a hypothesis H? The primary aim of this paper is to revisit several charges, interpretations, and comparisons of the P value with other procedures as they relate to their primary aims and objectives, the nature of the questions posed to the data, and the nature of their underlying reasoning and the ensuing inferences. The idea is to shed light on some of these issues using the error-statistical perspective; see Mayo and Spanos (2011). A crucial difference between the P value and the Type I and II error probabilities is that the former is defined post-data, since it requires τ(x0), but the latter are defined pre-data since they only require n and the choice of α. Despite that, the P value is often viewed by practitioners as the observed significance level and recast the accept/reject rules into (Lehmann 1986): reject H0 if p(x0) ≤ α, accept H0 if p(x0) > α, because the data specificity of p(x0) seems more informative than the dichotomous accept/reject decisions. A crucial weakness of both the P value and the N-P error probabilities is the so-called large n problem: there is always a large enough sample size n for which any simple null hypothesis. H0: μ = μ0 will be rejected by a frequentist α-significance level test; see Lindley (1957). As argued in Spanos (2013), there is nothing paradoxical about a small P value, or a rejection of H0, when n is large enough. The large n constitutes an example of a broader problem known as the fallacy of rejection: (mis)interpreting reject H0 (evidence against H0) as evidence for a particular H1; this can arise when a test has very high power, e.g., large n. A number of attempts have been made to alleviate the large n problem, including rules of thumb for decreasing α as n increases; see Lehmann (1986). Due to the trade-off between the Type I and II error probabilities, however, any attempt to ameliorate the problem renders the inference susceptible to the reverse fallacy known as the fallacy of acceptance: (mis)interpreting accept H0 (no evidence against H0) as evidence for H0; this can easily arise when a test has very low power; e.g., α is tiny or n is too small. These fallacies are routinely committed by practitioners in many applied fields. After numerous unsuccessful attempts, Mayo (1996) provided a reasoned answers to these fallacies in the form of a post-data severity assessment. Whether data x0 provide evidence for or against a particular hypothesis H depends crucially on the generic capacity (power) of the test to detect discrepancies from the null. This stems from the intuition that a small P value or a rejection of H0 based on a test with low power (e.g., a small n) for detecting a particular discrepancy γ provides stronger evidence for γ than using a test with much higher power (e.g., a large n). This intuition is harnessed by a post-data severity evaluation of accept/reject based on custom-tailoring the generic capacity of the test to establish the discrepancy γ warranted by data x0; see Mayo (1996). The severity evaluation is a post-data appraisal of the accept/reject and P value results with a view to provide an evidential interpretation; see Mayo and Spanos (2011). A hypothesis H (H0 or H1) "passes" a severe test Tα with data x0 if (i) x0 accords with H and (ii) with very high probability, test Tα would have produced a result that accords less well with H than x0 does, if H were false (Mayo and Spanos 2006). The notion of severity can be used to bridge the gap between accept/reject rules and P values and an evidential interpretation in so far as the result that H passes test Tα provides good evidence for inferring H (is correct) to the extent that Tα severely passes H with data x0. The severity assessment allows one to determine whether there is evidence for (or against) inferential claims of the form μ1 = μ0 + γ, for γ ≥ 0, in terms of a discrepancy γ from μ0, which includes H0 as well as any hypothesis belonging to the alternative parameter space μ1 > μ0. It should be emphasized that what is important for interpretation purposes is not the numerics of the tail areas, but the coherence of the underlying reasoning. The severity evaluation remedies this weakness of the P value by taking into account the generic capacity of the test to output the magnitude of the discrepancy γ warranted by data x0 and test Tα. This, however, necessitates considering alternative values of μ within the same statistical model. This is because N-P testing is inherently testing within the boundaries of a statistical model, as opposed to mis-specification (M-S) testing which probes outside those boundaries, with the prespecified model representing the null; see Mayo and Spanos (2004). The post-data severity evaluation in the case of reject H0 outputs which inferential claims of the form μ > μ1 are warranted (high severity) or unwarranted (low severity) on the basis of test Tα and data x0. This provides the basis for addressing the statistical vs. substantive significance problem that has bedeviled practitioners in several fields since the 1950s. Once the warranted discrepancy γ* is established, one needs to confer with substantive subject matter information to decide whether this discrepancy is substantively significant or not. Hence, not only statistical significance does not imply substantive significance, but the reverse is also true. A statistically insignificant result can implicate a substantively significant discrepancy; see Spanos (2010a) for an empirical example. The severity perspective calls into question the use of effect size measures, based on "distance functions" using point estimators, as flawed attempts to evaluate the warranted discrepancy by attempting to eliminate the influence of the sample size n in an ad hoc way. Indeed, classifying effect sizes as "small," "medium," and "large" (Cumming 2011), without invoking subject matter information, seems highly questionable. In contrast, the post-severity evaluation accounts for the effect of the sample size n by taking into consideration the generic capacity of the test to output the warranted discrepancy γ in a principled manner, and then lets the subject matter information make the call about substantive significance. More generally, in addition to circumventing the fallacies of acceptance and rejection, severity can be used to address other charges like the "arbitrariness" of the significance level, the one-sided vs. two-sided framing of hypotheses, the reversing of the null and alternative hypotheses, the effect size problem, etc.; see Mayo and Spanos (2011). In particular, the post-data severity evaluation addresses the initial arbitrariness of any threshold relating to the significance level or the P value by relying on the sign of τ(x0), and not on cα, to indicate the direction of the inferential claim that "passed." Indeed, this addresses the concerns for the dichotomy created by any threshold; see Spanos (2011). Inference procedures associated with hypothesis testing and CIs share a common objective: learn from data about the "true" (μ = μ*) statistical model M * (x) = {f(x;θ*)}, x ∈ ℝn yielding data x0. What about the questions posed? The question posed by a CI is: How often will a random interval [L(X), U(X)] cover the true value μ* of μ, whatever that unknown value μ* happens to be? The answer comes in the form of a (1 − α) CI using factual reasoning. The question posed by a test is: how close is the prespecified value μ0 to μ*? Using goodness-of-fit/prediction as the primary criterion for "ranking the different models," however, can potentially undermine the reliability of any inference in two ways. First, goodness-of-fit/prediction is neither necessary nor sufficient for statistical adequacy: the model assumptions like NIID are valid for data Z0. The latter ensures that the actual error probabilities approximate closely the nominal error probabilities. Applying a 0.05 significance level test when the actual Type I error is closer to 0.60 can easily lead an inference astray! Indeed, the appropriateness of particular goodness-of-fit/prediction measures, such as ln (z; θˆi), is questionable when is statistically misspecified; see Spanos (2007). One might object to this argument on the grounds that all inference procedures are vulnerable to statistical misspecification. Why single out Akaike-type model selection? The reason is that model validation based on thorough M-S testing to secure statistical adequacy (Mayo and Spanos 2004) is in direct conflict with such model selection procedures. This is because model validation will give rise to a choice of a particular model within Eq. 17 on statistical adequacy grounds, assuming Eq. 15 includes such an adequate model. This choice would render model selection procedures redundant and often misleading because the highest ranked model will rarely coincide with the statistically adequate one, largely due to the second way model selection procedures could undermine the reliability of inference. As shown below, the ranking of the different models is inferentially equivalent to N-P testing comparisons with a serious weakness: model selection procedures ignore the relevant error probabilities. If the implicit error probabilities are too low/high, that could give rise to unreliable inferences. In addition, if no statistically adequate model exists within Eq. 17, M-S testing would confirm that and no choice will be made, but model selection procedures would nevertheless indicate a highest ranked model; see Spanos (2010b) for empirical examples. At first sight, the Akaike model selection procedure's reliance on minimizing a distance function, combining the log-likelihood and the number of unknown parameters, seems to circumvent hypothesis testing and the controversies surrounding P values and accept/reject rules. Indeed, its simplicity and apparent objectivity made it a popular procedure among practitioners. Murtaugh (2013) brings out the connections between P values, CIs, and the AIC, and argues that: "Since P values, confidence intervals, and ΔAIC [difference of AIC] are based on the same statistical information, all have their places in modern statistical practice. The choice of which to use should be stylistic, dictated by details of the application rather than by dogmatic, a priori considerations." This argument is misleading because on closer examination, minimizing the AIC does not circumvent these problems and controversies. Although proponents of AIC generally discourage comparisons of only two models, the ranking of the different models by the AIC is inferentially equivalent to pairwise comparisons among the different models in { , i = 1, 2, … , m}, using N-P testing with a serious flaw: it ignores the relevant error probabilities; see Spanos (2010b). The paper focused primarily on certain charges, claims, and interpretations of the P value as they relate to CIs and the AIC. It as argued that some of these comparisons and claims are misleading because they ignore key differences in the procedures being compared, such as (1) their primary aims and objectives, (2) the nature of the questions posed to the data, as well as (3) the nature of their underlying reasoning and the ensuing inferences. In the case of the P value, the crucial issue is whether Fisher's evidential interpretation of the P value as "indicating the strength of evidence against H0" is appropriate. It is argued that, despite Fisher's maligning of the Type II error, a principled way to provide an adequate evidential account, in the form of post-data severity evaluation, calls for taking into account the power of the test. The error-statistical perspective brings out a key weakness of the P value and addresses several foundational issues raised in frequentist testing, including the fallacies of acceptance and rejection as well as misinterpretations of observed CIs; see Mayo and Spanos (2011). The paper also uncovers the connection between model selection procedures and hypothesis testing, revealing the inherent unreliability of the former. Hence, the choice between different procedures should not be "stylistic" (Murtaugh 2013), but should depend on the questions of interest, the answers sought, and the reliability of the procedures. I would like to thank D. G. Mayo for numerous discussions on issues discussed in this paper.
Referência(s)