Artigo Revisado por pares

Is that outcome different or not? The effect of experimental design and statistics on neurobehavioral outcome studies

2000; Elsevier BV; Volume: 70; Issue: 5 Linguagem: Inglês

10.1016/s0003-4975(00)02202-5

ISSN

1552-6259

Autores

David A. Stump, Robert L. James, John M. Murkin,

Tópico(s)

Aortic aneurysm repair treatments

Resumo

Experimental design and statistics are two different but related arts. The art of experimental design is how you ask the question, and how you ask the question dictates which statistics you use to answer the question.How you phrase the question depends upon your assumptions. For example, if your question is: “Are age and bypass time associated with negative outcomes after cardiac surgery?” then a common statistical approach is to make a tactical decision and say both sets of numbers are continuous variables and use a linear regression model. The problem is in the assumption that the variables are continuous. All of us know empirically that the 5 years between 40 and 45 do not impart the same risk or rate of deterioration as the 5 years between 70 and 75 (nor do we all age at the same rate, as Indiana Jones said, “its not the years, its the mileage”). Furthermore, during cardiopulmonary bypass (CPB), every minute past 100 carries a much greater risk than the preceding 100 minutes. Why, because extended bypass times generally indicates some complication, usually bleeding, has occurred and it is not the time on bypass per se, but what the time signifies, eg, some anomaly. Most surgeons have an average time to perform a three-vessel coronary artery bypass graft (CABG) with a relatively small standard deviation. The outliers are usually problem cases and skew the curve with an extended tail to the right. Therefore, age and CPB time are not continuous variables but categorical (young-old, long-short) and can be analyzed using a strategy like a 2 × 2 or 3 × 3 χ2 as in Table 1. Table 1The Use of Age and Bypass Time as Categorical (vs Continuous) Variables in Reporting Neurological InjuryAge Less Than 65 yearsAge Greater Than 65 yearsLess than 100- minute CPB% with neurologic injury% with neurologic injuryMore than 100-minute CPB% with neurologic injury% with neurologic injury Open table in a new tab In the past, we have seen many instances where data displayed as a continuous variable has shown no relationship with clinical reality, usually because the question was “phrased” improperly. For example, the question: “Is age a predictor of negative outcome after CPB?” is different from: “Do older people do worse than younger people after CPB?” Why, because in one instance you are using the persons age, ie, a continuous number, as opposed to grouping by category and using a cut-off score (older, over 65, risk factors such as postmenopausal, diabetic, hypertensive, retired, etc) (Fig 1). Unfortunately, or fortunately, depending on one’s viewpoint, there is no consensus by either researchers or statisticians on how to deal with the data in the field of neurobehavioral outcomes associated with cardiac surgery. One should remember that there is no one sensitive variable that adequately describes “normal” brain function [1Stump D.A Selection and clinical significance of neuropsychologic tests.Ann Thorac Surg. 1995; 59: 1331-1335Abstract Full Text PDF PubMed Scopus (102) Google Scholar]. The result is that any reviewer can orchestrate a field of articles to support whatever conclusion they want (ie, warm bypass is good for you, pH-stat and dysautoregulation does not cause harm, etc) [2Gill R Murkin J.M Neuropsychologic dysfunction after cardiac surgery What is the problem?.J Cardiothorac Vasc Anesth. 1996; 10: 91-98Abstract Full Text PDF PubMed Scopus (66) Google Scholar]. How do we sort out truth? Look at the data.In a typical drug trial, we expect everyone to have more or less the same response to the drug. In other words, after treatment, every score shifts in the same direction. The reason there is a mean and standard deviation (SD) is because some people have a greater reaction than others but they should all respond. This becomes a significant response if the overall curve shifts far enough (Fig 2). Fig 2The effect of the treatment in shifting the mean test scores (and entire distribution) by 20 test score points in comparison with the control group.View Large Image Figure ViewerDownload (PPT)This is not necessarily the case with cardiac surgery. A case in point is brain injury after cardiac surgery. If CPB caused significant brain injury in and of itself, then we would expect to see a consistent pattern of dysfunction, a “syndrome” similar, for example, to what we see after cardiac arrest. However, what we find after CPB is inconsistent. Some patients may have a right-sided motor dysfunction, others a verbal memory or visual abnormality, and yet others a speech disorder. Also, the “severity” of the lesion, ie, the social significance of the deficit, is not proportional to the volume of the injury. A large right frontal lobe infarction may go undetected whereas a very small capsule lesion resulting in a right arm paralysis would be considered a catastrophic stroke. What this suggests is that something has happened to a subset of patients that has not happened to the others. For example, in Figure 3, a simulation, we see that after cardiac surgery most people exhibit one of three behaviors on a given test: improve, stay the same, or get worse. Because the brain is a heterogeneous organ, on any given test of a specific behavior, most patients (eg, 85%) will exhibit a practice effect due to learning and show a slight improvement in test score. This increase in the group mean performance score is offset by those patients (eg, 15%) who experience a brain injury as evidenced by a decline in cognitive dysfunction. What results is a small change in the mean and a large increase in the SD.Fig 3Simulated distributions of preoperative and 1-month postoperative test scores in both the drug treatment and placebo groups. The preoperative distribution is assumed to be normal (100 ± 20). Both the postoperative treatment and placebo groups, however, are assumed to be a mixture of two normal distributions comprised of one subgroup experiencing a postoperative decline (−40 ± 10%) in test scores and the other subgroup experiencing an improvement (+10 ± 10%) in test scores. The overall distributions resulting from a mix of these two normal distributions are shown with a higher proportion of the placebo group experiencing the decline in test scores than the treatment group.View Large Image Figure ViewerDownload (PPT)It has been suggested that “regression to the mean” explains much of the neurobehavioral deficits exhibited after surgery [3Browne S.M Halligan P.W Wade D.T Taggart D.P Cognitive performance after cardiac operation implications of regression toward the mean.J Thorac Cardiovasc Surg. 1999; 117: 481-485Abstract Full Text Full Text PDF PubMed Scopus (48) Google Scholar, 4Mee R.W Chua T.C Regression towards the mean and the paired sample t test.The American Statistician. 1991; 45: 39-41Google Scholar]. Regression to the mean is an important factor in physiology and anatomy; for example, very tall parents can expect to have relatively shorter children. For cognitive testing, regression to the mean would suggest that high scoring individuals will do worse on tests the second time, and that lower scoring people would do better. What we see in reality is a greater improvement in the brighter group.A psychological test is designed for maximum test retest reliability, or repeatability. The idea is to minimize or control for practice effects. Most individuals will get better after repeating a task; surgeons, for example, fortunately do not regress to an average level of performance with practice. So if a typical neurobehavioral test was given to normal subjects, most would show a modest improvement in performance. In fact, in order to reliably determine the influence of practice effect in a given study protocol, it is important to incorporate into the design of the study an age-, gender-, and education-matched control group, and administer the same test battery at a similar interval to the study population. This was discussed at length in the first Consensus Statement [5Murkin J.M Newman S Stump D.A Blumenthal J Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery.Ann Thorac Surg. 1995; 59: 1289-1295Abstract Full Text PDF PubMed Scopus (527) Google Scholar].After cardiac surgery, we see a pattern where the variability in the group scores increases because some patients are uninjured and show improved scores due to practice effects, other patients scores stay the same, and a further subset shows marked deterioration in test scores. The result is that the overall group mean performance changes little. The “up goers” and the “down goers” offset each other so that the mean stays the same but the SD increases due to the greater dispersion of scores (variance).This increase in the SD decreases our ability to detect group differences using parametric statistics (t tests). For example, if we were to compare two groups, “treated” and “untreated,” we may find no significant difference between the mean performance of the two groups when comparing their pre- to postperformance (change score) either between or within groups. However, there may actually be a difference between the groups. In this example of performance on a single test, 25% of the untreated group actually showed a greater than 20% decline in performance, something never seen in normal controls. Furthermore, only 15% of the treated group showed a decline of 20% in performance (Fig 3).What that means is that 50 of 200 people were impaired in one group versus 30 of 200 in the other. In order for a t test to show a difference of in the overall group score, performance of the additional 20 impaired subjects would have to be bad enough that it would pull down the scores of the other 150 subjects. Furthermore, all one could say is that the group mean performance was better in the treated group (ie, the use of drug X resulted in an 8% lesser decline in trail-making A scores than in the control group), raising the important issue of clinical relevance.However, there is another way to look at these numbers, eg, using χ2 analysis. The results could be described as follows. The treatment resulted in a 40% reduction in the number of patients with evidence of brain injury as exhibited by the performance on this test (p = 0.02).One way to phrase a hypothesis would be: “Does the treatment have an effect on the overall cognitive test performance of a group of patients undergoing cardiac surgery?” (ie, group mean analysis). Another might be: “Does the treatment result in significantly fewer patients suffering brain injury due to cardiac surgery?” (ie, incidence analysis). Same data, different question, different statistic, different answer. Incidence analysis enables us to examine which specific patients are at risk, and what factors are risk associated. For further reference, the issue of group mean versus incidence analysis is discussed at length in the second Consensus Statement [6Murkin J.M Stump D.A Blumenthal J.A McKhann G Defining dysfunction group means versus incidence analysis—a statement of consensus.Ann Thorac Surg. 1997; 64: 904-905Abstract Full Text Full Text PDF PubMed Scopus (75) Google Scholar].Which approach best serves our patients and the surgical team [7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]? As we have tried to demonstrate, the answer to that is clearly determined by the nature of the question to be answered. Our purpose is to improve outcome after cardiac surgery by making a safe operation safer. We do this by providing feedback to the team about the association between the patients’ risk factors and the modifiable risks that are associated with the circuits, anesthetics, and surgical techniques. We can serve our patients best by asking the right questions.AddendumGiven the heterogeneous nature of brain anatomy and function, no single test adequately describes normal function, which is why a typical assessment battery includes 10 or more tests [1Stump D.A Selection and clinical significance of neuropsychologic tests.Ann Thorac Surg. 1995; 59: 1331-1335Abstract Full Text PDF PubMed Scopus (102) Google Scholar, 7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. Every effort should be made to maximize individual test independence, to minimize overlap between cognitive domains [5Murkin J.M Newman S Stump D.A Blumenthal J Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery.Ann Thorac Surg. 1995; 59: 1289-1295Abstract Full Text PDF PubMed Scopus (527) Google Scholar]. Statistically, a categorical approach (deficit/no-deficit) approach using a nonparametric analysis is more powerful.There are several ways neurobehavioral deficit studies can be analyzed. The choice of statistical test determines what question is being answered. Analysis of variance designs including t tests, repeated-measures analysis, etc, test for shifts in the mean effect between groups. In contrast, χ2 2 × 2 tables can test for a difference in the proportion of patients experiencing substantial deficits (ie, >20% decline from preoperative scores).A power analysis was performed using simulated data to compare the ability of a t test (the most simple of the analysis of variance designs) versus a 2 × 2 χ2 test to detect preoperative changes in total test scores. Total test scores were the sum of 10 different independent tests. For the 2 × 2 χ2 tests, a patient was defined as having an overall deficit if he scored a 20% decline in test score in two or more tests [7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. To simplify the power calculations, these tests were assumed independent in that none of the brain regions evaluated by the several tests overlapped. Thus, the effects of any specific neurologic lesion would only be detectable on only one of the tests. Although this requirement of test independence is perhaps experimentally unrealistic, it was necessary to avoid having to make elaborate assumptions about correlations between each pair of tests.The power of analysis was based upon the following simplifying assumptions. (1) Each of the 10 tests were independent (described above). (2) The preop test scores were normally distributed (mean ± SD: 100 ± 20). (3) All postop test scores improved 10 ± 10% due to test learning. (4) In the placebo group, for each test taken, there was an additional 9.64% probability of a test score deficit of −40 ± 10%, giving a net individual test score decrement of −30% (10% learning minus 40% deficit). The 9.64% probability of test score deficit for each of the 10 different tests results in a 25% probability that a patient will be classified as having an overall deficit (ie, two or more of the test scores will simultaneously have 20% decline, based upon the binomial distribution).(5) Likewise in the treatment group, there was a 6.95% probability of a individual test score deficit (−40 ± 10% test score decrease) giving a 15% probability of being classified as having an over all neurobehavioral deficit. (6) Sample size: 200 patients receiving the placebo and 200 patients receiving treatment.Based on performance of 2,000 data simulations, the above χ2 analysis found significant treatment effects compared with placebo 66% of the time, but t test found significance only 6.7% of the time. Experimental design and statistics are two different but related arts. The art of experimental design is how you ask the question, and how you ask the question dictates which statistics you use to answer the question. How you phrase the question depends upon your assumptions. For example, if your question is: “Are age and bypass time associated with negative outcomes after cardiac surgery?” then a common statistical approach is to make a tactical decision and say both sets of numbers are continuous variables and use a linear regression model. The problem is in the assumption that the variables are continuous. All of us know empirically that the 5 years between 40 and 45 do not impart the same risk or rate of deterioration as the 5 years between 70 and 75 (nor do we all age at the same rate, as Indiana Jones said, “its not the years, its the mileage”). Furthermore, during cardiopulmonary bypass (CPB), every minute past 100 carries a much greater risk than the preceding 100 minutes. Why, because extended bypass times generally indicates some complication, usually bleeding, has occurred and it is not the time on bypass per se, but what the time signifies, eg, some anomaly. Most surgeons have an average time to perform a three-vessel coronary artery bypass graft (CABG) with a relatively small standard deviation. The outliers are usually problem cases and skew the curve with an extended tail to the right. Therefore, age and CPB time are not continuous variables but categorical (young-old, long-short) and can be analyzed using a strategy like a 2 × 2 or 3 × 3 χ2 as in Table 1. In the past, we have seen many instances where data displayed as a continuous variable has shown no relationship with clinical reality, usually because the question was “phrased” improperly. For example, the question: “Is age a predictor of negative outcome after CPB?” is different from: “Do older people do worse than younger people after CPB?” Why, because in one instance you are using the persons age, ie, a continuous number, as opposed to grouping by category and using a cut-off score (older, over 65, risk factors such as postmenopausal, diabetic, hypertensive, retired, etc) (Fig 1). Unfortunately, or fortunately, depending on one’s viewpoint, there is no consensus by either researchers or statisticians on how to deal with the data in the field of neurobehavioral outcomes associated with cardiac surgery. One should remember that there is no one sensitive variable that adequately describes “normal” brain function [1Stump D.A Selection and clinical significance of neuropsychologic tests.Ann Thorac Surg. 1995; 59: 1331-1335Abstract Full Text PDF PubMed Scopus (102) Google Scholar]. The result is that any reviewer can orchestrate a field of articles to support whatever conclusion they want (ie, warm bypass is good for you, pH-stat and dysautoregulation does not cause harm, etc) [2Gill R Murkin J.M Neuropsychologic dysfunction after cardiac surgery What is the problem?.J Cardiothorac Vasc Anesth. 1996; 10: 91-98Abstract Full Text PDF PubMed Scopus (66) Google Scholar]. How do we sort out truth? Look at the data. In a typical drug trial, we expect everyone to have more or less the same response to the drug. In other words, after treatment, every score shifts in the same direction. The reason there is a mean and standard deviation (SD) is because some people have a greater reaction than others but they should all respond. This becomes a significant response if the overall curve shifts far enough (Fig 2). This is not necessarily the case with cardiac surgery. A case in point is brain injury after cardiac surgery. If CPB caused significant brain injury in and of itself, then we would expect to see a consistent pattern of dysfunction, a “syndrome” similar, for example, to what we see after cardiac arrest. However, what we find after CPB is inconsistent. Some patients may have a right-sided motor dysfunction, others a verbal memory or visual abnormality, and yet others a speech disorder. Also, the “severity” of the lesion, ie, the social significance of the deficit, is not proportional to the volume of the injury. A large right frontal lobe infarction may go undetected whereas a very small capsule lesion resulting in a right arm paralysis would be considered a catastrophic stroke. What this suggests is that something has happened to a subset of patients that has not happened to the others. For example, in Figure 3, a simulation, we see that after cardiac surgery most people exhibit one of three behaviors on a given test: improve, stay the same, or get worse. Because the brain is a heterogeneous organ, on any given test of a specific behavior, most patients (eg, 85%) will exhibit a practice effect due to learning and show a slight improvement in test score. This increase in the group mean performance score is offset by those patients (eg, 15%) who experience a brain injury as evidenced by a decline in cognitive dysfunction. What results is a small change in the mean and a large increase in the SD. It has been suggested that “regression to the mean” explains much of the neurobehavioral deficits exhibited after surgery [3Browne S.M Halligan P.W Wade D.T Taggart D.P Cognitive performance after cardiac operation implications of regression toward the mean.J Thorac Cardiovasc Surg. 1999; 117: 481-485Abstract Full Text Full Text PDF PubMed Scopus (48) Google Scholar, 4Mee R.W Chua T.C Regression towards the mean and the paired sample t test.The American Statistician. 1991; 45: 39-41Google Scholar]. Regression to the mean is an important factor in physiology and anatomy; for example, very tall parents can expect to have relatively shorter children. For cognitive testing, regression to the mean would suggest that high scoring individuals will do worse on tests the second time, and that lower scoring people would do better. What we see in reality is a greater improvement in the brighter group. A psychological test is designed for maximum test retest reliability, or repeatability. The idea is to minimize or control for practice effects. Most individuals will get better after repeating a task; surgeons, for example, fortunately do not regress to an average level of performance with practice. So if a typical neurobehavioral test was given to normal subjects, most would show a modest improvement in performance. In fact, in order to reliably determine the influence of practice effect in a given study protocol, it is important to incorporate into the design of the study an age-, gender-, and education-matched control group, and administer the same test battery at a similar interval to the study population. This was discussed at length in the first Consensus Statement [5Murkin J.M Newman S Stump D.A Blumenthal J Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery.Ann Thorac Surg. 1995; 59: 1289-1295Abstract Full Text PDF PubMed Scopus (527) Google Scholar]. After cardiac surgery, we see a pattern where the variability in the group scores increases because some patients are uninjured and show improved scores due to practice effects, other patients scores stay the same, and a further subset shows marked deterioration in test scores. The result is that the overall group mean performance changes little. The “up goers” and the “down goers” offset each other so that the mean stays the same but the SD increases due to the greater dispersion of scores (variance). This increase in the SD decreases our ability to detect group differences using parametric statistics (t tests). For example, if we were to compare two groups, “treated” and “untreated,” we may find no significant difference between the mean performance of the two groups when comparing their pre- to postperformance (change score) either between or within groups. However, there may actually be a difference between the groups. In this example of performance on a single test, 25% of the untreated group actually showed a greater than 20% decline in performance, something never seen in normal controls. Furthermore, only 15% of the treated group showed a decline of 20% in performance (Fig 3). What that means is that 50 of 200 people were impaired in one group versus 30 of 200 in the other. In order for a t test to show a difference of in the overall group score, performance of the additional 20 impaired subjects would have to be bad enough that it would pull down the scores of the other 150 subjects. Furthermore, all one could say is that the group mean performance was better in the treated group (ie, the use of drug X resulted in an 8% lesser decline in trail-making A scores than in the control group), raising the important issue of clinical relevance. However, there is another way to look at these numbers, eg, using χ2 analysis. The results could be described as follows. The treatment resulted in a 40% reduction in the number of patients with evidence of brain injury as exhibited by the performance on this test (p = 0.02). One way to phrase a hypothesis would be: “Does the treatment have an effect on the overall cognitive test performance of a group of patients undergoing cardiac surgery?” (ie, group mean analysis). Another might be: “Does the treatment result in significantly fewer patients suffering brain injury due to cardiac surgery?” (ie, incidence analysis). Same data, different question, different statistic, different answer. Incidence analysis enables us to examine which specific patients are at risk, and what factors are risk associated. For further reference, the issue of group mean versus incidence analysis is discussed at length in the second Consensus Statement [6Murkin J.M Stump D.A Blumenthal J.A McKhann G Defining dysfunction group means versus incidence analysis—a statement of consensus.Ann Thorac Surg. 1997; 64: 904-905Abstract Full Text Full Text PDF PubMed Scopus (75) Google Scholar]. Which approach best serves our patients and the surgical team [7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]? As we have tried to demonstrate, the answer to that is clearly determined by the nature of the question to be answered. Our purpose is to improve outcome after cardiac surgery by making a safe operation safer. We do this by providing feedback to the team about the association between the patients’ risk factors and the modifiable risks that are associated with the circuits, anesthetics, and surgical techniques. We can serve our patients best by asking the right questions. AddendumGiven the heterogeneous nature of brain anatomy and function, no single test adequately describes normal function, which is why a typical assessment battery includes 10 or more tests [1Stump D.A Selection and clinical significance of neuropsychologic tests.Ann Thorac Surg. 1995; 59: 1331-1335Abstract Full Text PDF PubMed Scopus (102) Google Scholar, 7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. Every effort should be made to maximize individual test independence, to minimize overlap between cognitive domains [5Murkin J.M Newman S Stump D.A Blumenthal J Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery.Ann Thorac Surg. 1995; 59: 1289-1295Abstract Full Text PDF PubMed Scopus (527) Google Scholar]. Statistically, a categorical approach (deficit/no-deficit) approach using a nonparametric analysis is more powerful.There are several ways neurobehavioral deficit studies can be analyzed. The choice of statistical test determines what question is being answered. Analysis of variance designs including t tests, repeated-measures analysis, etc, test for shifts in the mean effect between groups. In contrast, χ2 2 × 2 tables can test for a difference in the proportion of patients experiencing substantial deficits (ie, >20% decline from preoperative scores).A power analysis was performed using simulated data to compare the ability of a t test (the most simple of the analysis of variance designs) versus a 2 × 2 χ2 test to detect preoperative changes in total test scores. Total test scores were the sum of 10 different independent tests. For the 2 × 2 χ2 tests, a patient was defined as having an overall deficit if he scored a 20% decline in test score in two or more tests [7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. To simplify the power calculations, these tests were assumed independent in that none of the brain regions evaluated by the several tests overlapped. Thus, the effects of any specific neurologic lesion would only be detectable on only one of the tests. Although this requirement of test independence is perhaps experimentally unrealistic, it was necessary to avoid having to make elaborate assumptions about correlations between each pair of tests.The power of analysis was based upon the following simplifying assumptions. (1) Each of the 10 tests were independent (described above). (2) The preop test scores were normally distributed (mean ± SD: 100 ± 20). (3) All postop test scores improved 10 ± 10% due to test learning. (4) In the placebo group, for each test taken, there was an additional 9.64% probability of a test score deficit of −40 ± 10%, giving a net individual test score decrement of −30% (10% learning minus 40% deficit). The 9.64% probability of test score deficit for each of the 10 different tests results in a 25% probability that a patient will be classified as having an overall deficit (ie, two or more of the test scores will simultaneously have 20% decline, based upon the binomial distribution).(5) Likewise in the treatment group, there was a 6.95% probability of a individual test score deficit (−40 ± 10% test score decrease) giving a 15% probability of being classified as having an over all neurobehavioral deficit. (6) Sample size: 200 patients receiving the placebo and 200 patients receiving treatment.Based on performance of 2,000 data simulations, the above χ2 analysis found significant treatment effects compared with placebo 66% of the time, but t test found significance only 6.7% of the time. Given the heterogeneous nature of brain anatomy and function, no single test adequately describes normal function, which is why a typical assessment battery includes 10 or more tests [1Stump D.A Selection and clinical significance of neuropsychologic tests.Ann Thorac Surg. 1995; 59: 1331-1335Abstract Full Text PDF PubMed Scopus (102) Google Scholar, 7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. Every effort should be made to maximize individual test independence, to minimize overlap between cognitive domains [5Murkin J.M Newman S Stump D.A Blumenthal J Statement of consensus on assessment of neurobehavioral outcomes after cardiac surgery.Ann Thorac Surg. 1995; 59: 1289-1295Abstract Full Text PDF PubMed Scopus (527) Google Scholar]. Statistically, a categorical approach (deficit/no-deficit) approach using a nonparametric analysis is more powerful. There are several ways neurobehavioral deficit studies can be analyzed. The choice of statistical test determines what question is being answered. Analysis of variance designs including t tests, repeated-measures analysis, etc, test for shifts in the mean effect between groups. In contrast, χ2 2 × 2 tables can test for a difference in the proportion of patients experiencing substantial deficits (ie, >20% decline from preoperative scores). A power analysis was performed using simulated data to compare the ability of a t test (the most simple of the analysis of variance designs) versus a 2 × 2 χ2 test to detect preoperative changes in total test scores. Total test scores were the sum of 10 different independent tests. For the 2 × 2 χ2 tests, a patient was defined as having an overall deficit if he scored a 20% decline in test score in two or more tests [7Stump D.A Rogers A.T Hammon J.W Neurobehavioral tests are monitoring tools used to improve cardiac surgery outcome.Ann Thorac Surg. 1996; 61: 1295-1296Abstract Full Text PDF PubMed Scopus (16) Google Scholar]. To simplify the power calculations, these tests were assumed independent in that none of the brain regions evaluated by the several tests overlapped. Thus, the effects of any specific neurologic lesion would only be detectable on only one of the tests. Although this requirement of test independence is perhaps experimentally unrealistic, it was necessary to avoid having to make elaborate assumptions about correlations between each pair of tests. The power of analysis was based upon the following simplifying assumptions. (1) Each of the 10 tests were independent (described above). (2) The preop test scores were normally distributed (mean ± SD: 100 ± 20). (3) All postop test scores improved 10 ± 10% due to test learning. (4) In the placebo group, for each test taken, there was an additional 9.64% probability of a test score deficit of −40 ± 10%, giving a net individual test score decrement of −30% (10% learning minus 40% deficit). The 9.64% probability of test score deficit for each of the 10 different tests results in a 25% probability that a patient will be classified as having an overall deficit (ie, two or more of the test scores will simultaneously have 20% decline, based upon the binomial distribution). (5) Likewise in the treatment group, there was a 6.95% probability of a individual test score deficit (−40 ± 10% test score decrease) giving a 15% probability of being classified as having an over all neurobehavioral deficit. (6) Sample size: 200 patients receiving the placebo and 200 patients receiving treatment. Based on performance of 2,000 data simulations, the above χ2 analysis found significant treatment effects compared with placebo 66% of the time, but t test found significance only 6.7% of the time.

Referência(s)