Revisão Acesso aberto Revisado por pares

Some Common Misperceptions About P Values

2014; Lippincott Williams & Wilkins; Volume: 45; Issue: 12 Linguagem: Catalão

10.1161/strokeaha.114.006138

ISSN

1524-4628

Autores

Yuko Y. Palesch,

Tópico(s)

Bayesian Modeling and Causal Inference

Resumo

HomeStrokeVol. 45, No. 12Some Common Misperceptions About P Values Free AccessResearch ArticlePDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toFree AccessResearch ArticlePDF/EPUBSome Common Misperceptions About P Values Yuko Y. Palesch, PhD Yuko Y. PaleschYuko Y. Palesch From the Department of Public Health Sciences, Medical University of South Carolina, Charleston. Originally published6 Nov 2014https://doi.org/10.1161/STROKEAHA.114.006138Stroke. 2014;45:e244–e246Other version(s) of this articleYou are viewing the most recent version of this article. Previous versions: January 1, 2014: Previous Version 1 A P value <0.05 is perceived by many as the Holy Grail of clinical trials (as with most research in the natural and social sciences). It is greatly sought after because of its (undeserved) power to persuade the clinical community to accept or not accept a new treatment into practice. Yet few, if any, of us know why 0.05 is so sacred. Literature abounds with answers to the question, What is a P value?, and how the value 0.05 was adopted, more or less arbitrarily or subjectively, by R.A. Fisher, in the 1920s. He selected 0.05 partly because of the convenient fact that in a normal distribution, the 5% cutoff falls around the second standard deviation away from the mean.1However, little is written on how 0.05 became the standard by which many clinical trial results have been judged. A commentary2 ponders whether this phenomenon is similar to the results from the monkeys in the stairs experiment, whereby a group of monkeys were placed in a cage with a set of stairs with some fruit at the top. When a monkey went on the steps, blasts of air descended on it as a deterrent. After a while, any monkey that attempted to get on the steps was dissuaded by the group. Eventually, the monkeys were gradually replaced by new monkeys, but the practice of dissuasion continued, even when the deterrent was no longer rendered. In other words, the new monkeys were unaware of the reason why they were not supposed to go up the steps, yet the practice continued.In the following, I first review what a P value is. Then, I address 2 of the many issues regarding P values in clinical trials. The first challenges the conventional need to show P<0.05 to conclude statistical significance of a treatment effect; and the second addresses the misuse of P values in the context of testing group differences in baseline characteristics in randomized trials. Many excellent articles and books have been published that address these topics; nevertheless, the intention of this article was to revive and renew them (using a less statistical language) to aid the clinical investigators in planning and reporting study results.What Is a P Value Anyway?We equate P<0.05 with statistical significance. Statistical significance is about hypothesis testing, specifically of the null hypothesis (H0) that means the treatment has no effect. For example, if the outcome measure is continuous, the H0 may be that the group difference in mean response (Δ) is equal to zero. Statistical significance is the rejection of the H0 based on the level of evidence in the study data. Note that failure to reject the H0 does not imply that Δ=0 is necessarily true; just that the data from the study provide insufficient evidence to show that Δ≠0.To declare statistical significance, we need a criterion. The α (also known as the type I error probability or the significance level) is that criterion. The α does not change with the data. In contrast, the P value depends on the data. A P value is defined as the probability of observing treatment effect (eg, group difference in mean response) as extreme or more extreme (away from the H0) if the H0 is true. Hence, the smaller the P value, the more extreme or rare the observed data are, given the H0 to be true. The P value obtained from the data is judged against the α. If α=0.05 and P=0.03, then statistical significance is achieved. If α=0.01 and P=0.03, statistical significance is not achieved. Intuitively, if the P value is less than the prespecified α, then the data suggest that the study result is so rare that it does not seem to be consistent with H0, leading to rejection of the H0. For example, if the P value is 0.001, it indicates that if the null hypothesis were indeed true, then there would be only a 1 in 1000 chance of observing data this extreme. So either unusual data have been observed or else the supposition regarding the veracity of the H0 is incorrect. Therefore, small P values (<α) lead to rejection of the H0.In the Interventional Management of Stroke (IMS) III Trial that compared the efficacy of intravenous tissue-type plasminogen activator (n=222) and intravenous tissue-type plasminogen activator plus endovascular (n=434) treatment of acute ischemic stroke, the α was specified as 0.05. The unadjusted absolute group difference in the proportion of the good outcome, defined as the modified Rankin Scale score of 0 to 2, was 2.1% (40.8% in endovascular and 38.7% in intravenous tissue-type plasminogen activator).3 Under the normal theory test for binomial proportions, this yields a P value of 0.30, meaning that if the H0 were true (ie, the treatment did not work), there would be a 30% chance of observing a difference between the treatment groups at least as large as 2.1%. Because this is not so unusual, we fail to reject H0: Δ=0 and conclude that the difference of 2.1% is not statistically significant.Thinking Outside the P<0.05 BoxAnother interpretation of the α is that it is the probability of rejecting the H0 when in fact it is true. In other words, α is the false-positive probability. Typically, we choose α of 0.05, and hence, our desire to obtain P<0.05. There is nothing magical about 0.05. Why not consider the risk (or cost) to benefit ratio in the choice of the false-positive probability, the research community is willing to tolerate for a particular study? For some studies, should one consider a more conservative (like 0.01) or more liberal (like 0.10) α? In the case of comparative effectiveness trial, where ≥2 treatments, similar in cost and safety profile, that have been adopted in clinical practice are tested to identify the best treatment, one might be willing to risk a higher likelihood of a false-positive finding with α of, say 0.10. In contrast, if a new intervention to be tested is associated with high safety risks or is expensive, one would want to be sure that the treatment is effective by minimizing the false-positive probability to, say 0.01. For a certain phase II clinical trial, where the safety and efficacy of a new treatment is still being explored, one can argue for a more liberal α to give the treatment a higher level of the benefit of doubt, especially when the disease or condition has only a few, if any, effective treatment options. If it should pass, it would be weeded out in a phase III trial with a more stringent significance level. Also, if the H0 is widely accepted as true (perhaps, eg, in the case of hyperbaric oxygen treatment for stroke), then one might wish to be more sure that rejecting the H0 implies that the treatment is effective by using α of 0.01 or even lower. Of course, this means a study with larger sample size has to be conducted.Although proposing to use anything greater than an α of 0.05 may be challenging, especially for studies to be submitted to the US Food and Drug Administration for New Drug Application approval, scientifically sound rationale and experienced clinical judgment should encourage one to think outside the box about the choice of the α. In doing so, one should ensure that scientific and ethical rationale is the driving argument for proposing a larger α and not only the financial savings (as a result of smaller required sample size with a larger α).P Values in the Group Comparison of Baseline Characteristics in Clinical TrialsTypically, primary publications of many clinical trials in the typical Table 1 a long list of baseline characteristics of the study sample and their summary statistics (eg, mean and standard deviation; median and interquartile range; or proportions). In addition, many include P values associated with statistical tests comparing the groups or denote with a variety of asterisks the variables where the comparison yields P<0.05, P<0.01, and P<0.001. Some authors assume that the journal editors require them. In the instructions to authors of prospective New England Journal of Medicine (NEJM) manuscripts, it states under the statistical methods:For tables comparing treatment groups in a randomized trial (usually the first table in the trial report), significant differences between or among groups (i.e. P < 0.05) should be identified in a table footnote and the P value should be provided in the format specified in the immediately preceding paragraph. The body of the table should not include a column of P values. (http://www.nejm.org/page/author-center/manuscript-submission; obtained on August 18, 2014)Meanwhile, according to the current Consolidated Standards of Reporting Trials (CONSORT) 2010 guidelines on the publications of clinical trials:Unfortunately significance tests of baseline differences are still common; they were reported in half of 50 RCTs trials published in leading general journals in 1997. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis testing is superfluous and can mislead investigators and their readers. Rather, comparisons at baseline should be based on consideration of the prognostic strength of the variables measured and the size of any chance imbalances that have occurred. (http://www.consort-statement.org/checklists/view/32-consort/510-baseline-data; obtained on August 18, 2014)The 2 are somewhat contradictory: one (NEJM) requiring statistical tests be performed on the baseline characteristics and the other (CONSORT) discouraging such tests.Recall that P values are associated with hypothesis testing. The hypotheses that are tested for these baseline characteristics evaluate whether the differences between the groups are statistically significant, but that does not necessarily equate to clinical significance or relevance of the difference. Note that the P value is partially influenced by sample size. Generally, the larger the sample size, the easier it is to obtain a smaller P value from the data for the same difference. For any study with large enough sample size, statistical significance can be achieved; however, the observed mean group difference is not necessarily clinically relevant. Conversely, one may note a clinically relevant difference in a baseline characteristic, but the P value from the test may not reach statistical significance with a small sample size. Therefore, an important clinical difference in a baseline characteristic may be overlooked.Suppose in a large (n=2100 per group) clinical trial of acute stroke to detect a difference of 5% in good outcome between 2 treatment groups, the typical Table 1 shows the mean baseline systolic blood pressure of 125 and 120 mm Hg, each with standard deviation of 15 mm Hg. The difference is 5 mm Hg, and the t test yields P<0.01. But one could hardly argue that this difference is clinically significant. In contrast, suppose a small study (say, n=40 in each group) to test intensive serum glucose control in acute stroke patients had enrolled subjects with history of diabetes mellitus: 20% in one group and 33% in the other. The χ2 test yields P=0.20, not a statistically significant difference at the α of 0.05. Nevertheless, a 13% difference in the proportion of subjects with history of diabetes mellitus is likely to be a clinically important factor to consider in the analysis and interpretation of the primary outcome. In other words, P values are meaningless at best, and potentially misleading, to ascertain whether the treatment groups are balanced in the baseline characteristics. The same issue of seeking statistical significance without consideration for clinical relevance also applies to analyses of outcomes data. Many articles have been published in both statistical and clinical journals addressing this topic and will not be addressed further here.4,5RecommendationSo what are we to do? Should we stop using P values altogether? No, but additional information, such as the prespecified minimum clinically important difference, the observed group differences, and their confidence intervals will enable other investigators to better assess the level of evidence for or against the treatment effect because they provide a range of plausible values for the unknown true difference between the groups.For example, in the IMS III Trial,3 the study investigators prespecified minimum clinically important difference of 10%. The reported adjusted (for baseline National Institutes of Health Stroke Scale score per the study analysis plan) difference was 1.5%, with 95% confidence interval of (−6.1%, 9.1%). Because the 95% confidence interval includes 0, the result is not statistically significant at α=0.05. In addition, if the confidence interval had included 10%, the study result can be interpreted as inconclusive because 10% may be a plausible value for the true but unknown group difference; otherwise, the study could be viewed as negative. Therefore, such information will allow the readers to apply their knowledge, experience, and judgment on the importance and relevance of the study results beyond whether it is statistically significant or not.ConclusionsIn conclusion, R.A. Fisher did not intend for the P value, much less P<0.05, to be the be-all and end-all of an experiment (or a clinical trial). He meant it as a guide to determine whether the study result is worthy of another look through replication. In spite of increasingly vocal criticism of our sole dependence on P values by the biostatistical and even some clinical communities, it will take some time to change the culture, but the change should be embraced.AcknowledgmentsI thank the 2 anonymous reviewers for their thorough and constructive comments to clarify and improve the discussions in this article.Sources of FundingThis work was partially supported by National Institutes of Health (NIH)U01-NS087748 and U01-NS077304.DisclosuresThe author is a Data Monitoring Committee member for a study of Brainsgate Ltd and for a study by Biogen Idec Inc.FootnotesCorrespondence to Yuko Y. Palesch, PhD, Department of Public Health Sciences, Medical University of South Carolina, 135 Cannon Street, Suite 303, Charleston, SC 29425. E-mail [email protected]References1. Cowles M, Davis C. On the origins of the 0.05 level of statistical significance.Am Psychol. 1982; 37:553–558.CrossrefGoogle Scholar2. Kelly M. Emily Dickinson and monkeys on the stair or what is the significance of 5% significance level.Significance. 2013; 10:21–22.CrossrefGoogle Scholar3. Broderick JP, Palesch YY, Demchuk AM, Yeatts SD, Khatri P, Hill MD, et al; Interventional Management of Stroke (IMS) III Investigators. Endovascular therapy after intravenous t-PA versus t-PA alone for stroke.N Engl J Med. 2013; 368:893–903.CrossrefMedlineGoogle Scholar4. Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems.Stat Med. 2002; 21:2917–2930.CrossrefMedlineGoogle Scholar5. Senn S. Seven myths of randomisation in clinical trials.Stat Med. 2013; 32:1439–1450.CrossrefMedlineGoogle Scholar Previous Back to top Next FiguresReferencesRelatedDetailsCited By Grjibovski A and Gvozdeckii A (2022) INTERPRETATION OF AND ALTERNATIVES TO P-VALUES IN BIOMEDICAL SCIENCES, Ekologiya cheloveka (Human Ecology), 10.17816/humeco97249 van der Weijst L, Lievens Y, Surmont V and Schrauwen W (2022) Neurocognitive functioning following lung cancer treatment: The PRO-Long Study, Technical Innovations & Patient Support in Radiation Oncology, 10.1016/j.tipsro.2022.02.004, 21, (36-40), Online publication date: 1-Mar-2022. Whitehead S, Dalton Barron N, Rennie G and Jones B (2021) The peak locomotor characteristics of Super League (rugby league) match-play, International Journal of Performance Analysis in Sport, 10.1080/24748668.2021.1968659, 21:6, (981-992), Online publication date: 2-Nov-2021. Mooney-Leber S, Zeid D, Garcia-Trevizo P, Seemiller L, Bogue M, Grubb S, Peltz G and Gould T (2021) Genetic Differences in Dorsal Hippocampus Acetylcholinesterase Activity Predict Contextual Fear Learning Across Inbred Mouse Strains, Frontiers in Psychiatry, 10.3389/fpsyt.2021.737897, 12 Pinto da Silva S, de Freitas C and Silva S (2021) Medical ethics when moving towards non-anonymous gamete donation: the views of donors and recipients, Journal of Medical Ethics, 10.1136/medethics-2020-106947, (medethics-2020-106947) Bonetti A, Bonetti L and Čipčić O (2021) Self-Assessment of Vocal Fatigue in Muscle Tension Dysphonia and Vocal Nodules: A Preliminary Analysis of the Discriminatory Potential of the Croatian Adaptation of the Vocal Fatigue Index (VFI-C), Journal of Voice, 10.1016/j.jvoice.2019.08.028, 35:2, (325.e1-325.e15), Online publication date: 1-Mar-2021. Edwardson M and Fernandez S (2020) Recruiting Control Participants into Stroke Biomarker Studies, Translational Stroke Research, 10.1007/s12975-020-00780-6, 11:5, (861-870), Online publication date: 1-Oct-2020. Dickins M, Joe A, Enticott J, Ogrin R and Lowthian J (2019) Trajectories of home nursing use for older women in Melbourne, Australia: 2006‐2015, Australasian Journal on Ageing, 10.1111/ajag.12735, 39:3, Online publication date: 1-Sep-2020. Joe A, Dickins M, Enticott J, Ogrin R and Lowthian J (2020) Community-Dwelling Older Women: The Association Between Living Alone and Use of a Home Nursing Service, Journal of the American Medical Directors Association, 10.1016/j.jamda.2019.11.007, 21:9, (1273-1281.e2), Online publication date: 1-Sep-2020. Tan E, Huang W and Chu Y (2020) Response: Effect of Tracheal Intubation Mode on Cuff Pressure During Retractor Splay and Dysphonia Recovery after Anterior Cervical Spine Surgery, Spine, 10.1097/BRS.0000000000003579, 45:16, (E1052-E1054), Online publication date: 15-Aug-2020. Cliffer I, Nikiema L, Langlois B, Zeba A, Shen Y, Lanou H, Suri D, Garanet F, Chui K, Vosti S, Walton S, Rosenberg I, Webb P and Rogers B (2020) Cost-Effectiveness of 4 Specialized Nutritious Foods in the Prevention of Stunting and Wasting in Children Aged 6–23 Months in Burkina Faso: A Geographically Randomized Trial, Current Developments in Nutrition, 10.1093/cdn/nzaa006, 4:2, Online publication date: 1-Feb-2020. Hayes-Larson E, Kezios K, Mooney S and Lovasi G (2019) Who is in this study, anyway? Guidelines for a useful Table 1, Journal of Clinical Epidemiology, 10.1016/j.jclinepi.2019.06.011, 114, (125-132), Online publication date: 1-Oct-2019. Xia T, Turner L, Enticott J, Mazza D and Schattner P (2019) Glycaemic control of Type 2 diabetes in older patients visiting general practitioners: An examination of electronic medical records to identify risk factors for poor control, Diabetes Research and Clinical Practice, 10.1016/j.diabres.2019.06.004, 153, (125-132), Online publication date: 1-Jul-2019. Courts C, Euteneuer J and Gosch A (2019) There is no evidence that dogs can smell DNA – Comment on "Individual human scent as a forensic identifier using mantrailing", Forensic Science International, 10.1016/j.forsciint.2019.02.013, 297, (e14-e15), Online publication date: 1-Apr-2019. Suárez-Fariñas M, Suprun M, Chang H, Gimenez G, Grishina G, Getts R, Nadeau K, Wood R and Sampson H (2019) Predicting development of sustained unresponsiveness to milk oral immunotherapy using epitope-specific antibody binding profiles, Journal of Allergy and Clinical Immunology, 10.1016/j.jaci.2018.10.028, 143:3, (1038-1046), Online publication date: 1-Mar-2019. Davis M, Sparrow H, Ikwuagwu J, Musick W, Garey K and Perez K (2018) Multicentre derivation and validation of a simple predictive index for healthcare-associated Clostridium difficile infection, Clinical Microbiology and Infection, 10.1016/j.cmi.2018.02.013, 24:11, (1190-1194), Online publication date: 1-Nov-2018. Govindarajan R and Narayanaswami P (2018) Evidence-based medicine for every day, everyone, and every therapeutic study, Muscle & Nerve, 10.1002/mus.26142, 58:4, (486-496), Online publication date: 1-Oct-2018. Norris K, Edwina Barnett M, Meng Y, Martins D, Nicholas S, Gibbons G and Lee J (2018) Rationale and design of a placebo controlled randomized trial to assess short term, high-dose oral cholecalciferol on select laboratory and genomic responses in African Americans with hypovitaminosis D, Contemporary Clinical Trials, 10.1016/j.cct.2018.07.006, 72, (20-25), Online publication date: 1-Sep-2018. Pollard T, Johnson A, Raffa J and Mark R (2018) tableone: An open source Python package for producing summary statistics for research papers, JAMIA Open, 10.1093/jamiaopen/ooy012, 1:1, (26-31), Online publication date: 1-Jul-2018. Ezhumalai G (2018) Confidence Interval, SBV Journal of Basic, Clinical and Applied Health Science, 10.5005/jp-journals-10082-01129, 1:A3, (42-44), Online publication date: 1-Jun-2018. Schober P, Bossers S and Schwarte L (2018) Statistical Significance Versus Clinical Importance of Observed Effect Sizes, Anesthesia & Analgesia, 10.1213/ANE.0000000000002798, 126:3, (1068-1072), Online publication date: 1-Mar-2018. Yeatts S, Palesch Y and Temkin N (2018) Biostatistical Issues in TBI Clinical Trials Handbook of Neuroemergency Clinical Trials, 10.1016/B978-0-12-804064-5.00009-6, (167-185), . Meinzer C, Martin R and Suarez J (2017) Bayesian dose selection design for a binary outcome using restricted response adaptive randomization, Trials, 10.1186/s13063-017-2004-6, 18:1, Online publication date: 1-Dec-2017. Kanabur P, Scovell J and Ramasamy R (2017) Re: Vasectomy and Prostate Cancer Incidence and Mortality in a Large US Cohort, European Urology, 10.1016/j.eururo.2016.11.008, 71:3, (494-495), Online publication date: 1-Mar-2017. Kanabur P, Scovell J and Ramasamy R (2016) WITHDRAWN: Re: Vasectomy and Prostate Cancer Incidence and Mortality in a Large US Cohort, European Urology, 10.1016/j.eururo.2016.10.045, Online publication date: 1-Nov-2016. Kuller L (2015) Epidemiology: Then and Now, American Journal of Epidemiology, 10.1093/aje/kwv158, 183:5, (372-380), Online publication date: 1-Mar-2016. Sandercock P (2015) Short History of Confidence Intervals, Stroke, 46:8, (e184-e187), Online publication date: 1-Aug-2015.Ellerbe C (2015) What Information Will a Statistician Need to Help Me With a Sample Size Calculation?, Stroke, 46:7, (e159-e161), Online publication date: 1-Jul-2015. December 2014Vol 45, Issue 12 Advertisement Article InformationMetrics © 2014 American Heart Association, Inc.https://doi.org/10.1161/STROKEAHA.114.006138PMID: 25378423 Manuscript receivedJuly 14, 2014Manuscript acceptedOctober 14, 2014Originally publishedNovember 6, 2014 Keywordsstatistical data interpretationPDF download Advertisement SubjectsTreatment

Referência(s)