Designing and interpreting HIV noninferiority trials in naive and experienced patients
2008; Lippincott Williams & Wilkins; Volume: 22; Issue: 8 Linguagem: Inglês
10.1097/qad.0b013e3282f5556d
ISSN1473-5571
Autores Tópico(s)HIV/AIDS Research and Interventions
ResumoIntroduction The efficacy of HAART has improved over the past 10 years, with the introduction of more potent drugs with improved safety profiles and lower pill counts [1]. In the most recent clinical trials of naive patients, 48 week HIV RNA suppression rates below 50 copies/ml of over 80% have been recorded [2,3]. Similar improvements in viral suppression rates have been reported from clinic cohorts [4]. Where combinations of novel agents have been used, short-term efficacy rates for treatment-experienced patients have now risen to levels similar to those seen in naive patients [5]. Studies of treatment naive patients suggest an upper limit of virological efficacy, beyond which the inclusion of new or additional drugs may provide only small incremental benefits. For example in the ACTG 5095 trial, there was no additional benefit for naive patients given zidovudine/abacavir/lamivudine/efavirenz versus the control arm of zidovudine/lamivudine/efavirenz [6]. Given these developments, any superiority trial designed to demonstrate improved efficacy with a new drug combination would have to be large. Consequently the ‘noninferiority’ trial is emerging as a new standard design for HIV drug development among antiretroviral-naive individuals. These trials are designed to demonstrate that a new treatment shows efficacy not substantially worse than the current standard, within a prespecified margin (also called the noninferiority margin, or ‘delta’). The new treatment may offer other benefits, such as simplified dosing, an improved safety profile or a lower incidence of drug resistance at treatment failure [7,8]. This outcome is best illustrated with an example from the EPV20001 trial (results shown in Table 1), in which efficacy rates of 64% for lamivudine once daily and 63% for lamivudine twice daily (both in combination with zidovudine and efavirenz) were observed. This result is expressed as a point estimate (a 1% benefit for the lamivudine once daily arm) and a 95% confidence intervals (between a 7.1% disadvantage and an 8.9% advantage for the lamivudine once daily arm). In this case, the lower 95% confidence limit of a 7.1% disadvantage for the once-daily arm is less than the 12% delta set in the original protocol, so lamivudine once-daily was declared noninferior to the then standard treatment of lamivudine twice daily (Table 1). The benefit of lamivudine once daily was then more convenient dosing with no substantial loss of efficacy compared with the control arm of twice daily lamivudine dosing. This outcome is also shown as scenario C in Fig. 1.Table 1: Summary of phase 3 company-sponsored noninferiority trials (2000–2007): statistical design and 48-week efficacy [intent-to-treat (ITT) – time to loss of virological response (TLOVR) analysis].Fig. 1: Examples of the four main outcomes from noninferiority trials.There is, however, much inconsistency in the design and interpretation of HIV noninferiority trials [7]. In this review, we describe the design of these studies and their interpretation, and discuss the implications of this design for the choice of endpoints and sample size calculations. Our aim is both to educate and review the use of such trials in the HIV setting. Review methods The present review includes data from company-sponsored Phase 3 noninferiority trials conducted between 2000 and 2007, defined as Phase III/IV trials conducted with company sponsorship for both trial conduct and study drugs. Only the company-sponsored trials were included as they normally follow US Food and Drug Administration (FDA) guidelines on design and reporting of HIV RNA endpoints [9], and could therefore be interpreted in a standardized way. We used a MEDLINE search with the search terms of each antiretroviral, followed by ‘clinical trial’ (e.g. ‘lamivudine clinical trial’). In addition, we searched the FDA product labels for registrational trials of each approved antiretroviral, and searched for abstracts on clinical trials presented at the following conferences: Annual Conference on Retroviruses and Opportunistic Infections, International Conference on Antimicrobial Agents and Chemotherapy (ICAAC), European AIDS Clinical Society, International AIDS Conference (including IAS Pathogenesis Conference) and International Conference on Drug Therapy in HIV Infection. The search identified 17 randomized trials with a noninferiority design that used an endpoint of HIV RNA suppression below either 400 or 50 copies/ml (the design of these trials and summary efficacy data are shown in Table 1). In two cases [10,11], the trials were powered to show equivalence but can be interpreted in terms of noninferiority. All but three of the trials [11–13] were conducted in treatment naive patients. The BMS-034 trial (atazanavir versus efavirenz in naive patients) was excluded from this analysis owing to problems in validation of the HIV RNA assays used [14]. There were four additional trials which were powered on noninferiority but used continuous log reduction as the primary endpoint [15–18]. The design of noninferiority trials Standard superiority trials are designed to ensure that the smallest true difference between the new and standard treatments thought to be clinically relevant has a high chance of being detected as statistically significant. For clinical trials of antiretrovirals, the most commonly used endpoint is HIV RNA suppression below 50 copies/ml without treatment discontinuation. Decisions about the likely efficacy of the new treatment are generally based on a standard hypothesis test and the resulting p-value (with confidence intervals provided to aid the clinical interpretation of the findings). In contrast, noninferiority trials are designed to show that a new treatment is not substantially inferior to the current standard. Instead of focussing on the results of a statistical test, emphasis is placed on ensuring that the lower limit of the confidence interval for the observed difference in outcomes between the two regimens does not cross the prespecified ‘delta’ [19]. Table 2 shows sample sizes required to show noninferiority for an experimental treatment versus control, given percentage response rates in the control arm ranging from 50 to 90%, and delta of either 10 or 12%, consistent with current FDA and European guidelines [8,9,19]. These calculations assume that the true response rate is equal in the experimental and control groups. In some trials, the response rate is assumed to be slightly lower in the experimental arm, which leads to larger sample sizes – this design was included in the KLEAN trial [20].Table 2: Sample sizes per arm for noninferiority trials, by power, delta and expected response rate in the control arm; the efficacy of the new drug is assumed to be equivalent for the purposes of calculating sample sizes.The choice of an appropriate delta may be problematic: delta is usually chosen to reflect the largest difference in outcomes between the arms that could reasonably be assumed to be clinically equivalent [19]. A few trials have been designed with a high delta – for example the BI 118.33 trial was designed to show that tipranavir/ritonavir was no more than 15% worse than lopinavir/ritonavir [21]; the design of the Abbott 418 trial of once versus twice daily lopinavir/ritonavir also included a delta of 15% [22] (Table 1). Trials powered with a delta of this size may not, however, be able to exclude the possibility that a true difference exists between the arms which may be considered to be clinically significant. For example, if a noninferiority trial showed that the efficacy of first-line tenofovir/emtricitabine/efavirenz was 80% but that the true efficacy of a new combination treatment may be as low as 65%, would this convince clinicians to choose the new combination over the current standard, even if it were deemed to be noninferior? In addition, the noninferiority margin (delta) should be smaller if the control arm is already highly efficacious (i.e. with response rates above 90%). In future, differences between treatment arms of less than 10% may be considered clinically significant, and therefore a delta of 10–12% could be considered too large. For example in the Gilead 934 trial, there was a 7% advantage in efficacy of tenofovir/emtricitabine/efavirenz over the control arm of zidovudine/lamivudine/efavirenz at week 48 [2], driven mainly by higher rates of anaemia and gastrointestinal toxicties in the zidovudine arm. As a result, zidovudine is no longer recommended for first-line use in Europe [23]. Once the delta falls below 10%, however, Phase III trial sample sizes could rise to levels where the economics of HIV drug development become unsustainable. For example if a new experimental drug was compared to tenofovir/emtricitabine/efavirenz with a predicted success rate of 80%, and a delta of 5%, the trial sample size would be 1005 patients per arm for a power of 80%, and 1345 patients per arm for a power of 90%. Intent-to-treat and per protocol analysis When performing a superiority trial, the primary analysis uses the intent-to-treat population, including all patients randomized, irrespective of whether they have taken their study medication as randomized. Such an approach tends to bias the results towards the null hypothesis (i.e. no difference in outcome between the treatment arms). Thus, if there is still a difference in outcome when the trial is analysed in this way, it is likely that the real difference (if all patients were able to take the drugs as planned) would be greater. Unlike a superiority trial, noninferiority trials usually favour a ‘per protocol’ analysis. This analysis excludes patients with major protocol violations, such as not receiving at least one dose of study drug or using a disallowed medication in the background regimen [8]. By excluding these patients (who would be expected to make the two groups more alike), it is thought that analysis of the per protocol population may be more likely to show differences between treatments. For noninferiority trials, demonstration that the new treatment is noninferior on both the intention-to-treat and per protocol populations is usually required. There is, however, no standard predefined list of the exclusions for a per protocol analysis. Some per protocol analyses only exclude patients with the strongest protocol violations, such as not taking one dose of randomized treatment [12,24], whereas other analyses exclude from the analysis all individuals experiencing a nonvirological endpoint [13]. In addition, in order for the results from the per protocol analysis to be of value, it is important that a large proportion of the patients randomized in the trial fall into the per protocol population - demonstration of noninferiority on only a small minority of randomized patients is unlikely to convince many clinicians that the new regimen is genuinely noninferior. Furthermore, poorly conducted trials also tend to obtain results that are biased towards ‘no effect’. Therefore, both careful support for patient adherence and attendance, including documentation to support this, and evidence of strict adherence to the study protocol, are of utmost importance in a noninferiority trial. The choice of endpoints for noninferiority trials HIV RNA suppression <50 copies/ml has been adopted as the primary objective of antiretroviral treatment in the most recent guidelines reflecting the sensitivity of most currently available commercial assays. The FDA TLOVR (time to loss of virological response) algorithm has been used to analyse HIV RNA data from registrational trials [9]. This algorithm classifies patients either as virological successes while taking randomized treatment (with HIV RNA below the detection limit on two consecutive study visits around the 48 week timepoint), or treatment failure, divided into three categories: Virological failure – either failure to suppress HIV RNA, or virological rebound after initial suppression. Discontinuation of randomized study treatment for adverse events or death. Discontinuation of randomized study treatment for other reasons (for example withdrawal of consent or loss to follow-up). When using composite endpoints such as this, it is assumed that all components of the endpoint are viewed as being equally detrimental – it can be argued that discontinuation of treatment due to withdrawal of consent may have less clinical relevance for future virological suppression than a virological failure. One particular limitation with composite endpoints such as these is that it can be difficult to interpret intent-to-treat analyses where the virological and nonvirological endpoints are imbalanced across treatment arms. For example in the MERIT trial the experimental treatment of zidovudine/lamivudine/maraviroc showed an excess of virological failure endpoints, whereas the zidovudine/lamivudine/efavirenz arm showed an excess of discontinuations for adverse events [24]. In a recent survey, only 27% of endpoints in trials of naive patients were due to virological failure, with the remaining 73% being due to discontinuation of study medication (32% for adverse events and 41% for loss to follow up) [25]. Given that trial outcomes can be dominated by nonvirological endpoints when analysed by the FDA TLOVR algorithm, it is important for HIV clinical trials to be re-analysed including only virological endpoints. The FDA guidelines state that, in addition to analyses using the TLOVR algorithm, an analysis comparing only the documented virological failures should be presented and any inconsistencies between the different analyses should be explored [9]. This is also called a ‘nonvirological failures censored’ analysis. Data from patients is censored after discontinuation for reasons other than virological failure. Interpreting the results of noninferiority trials Figure 1 shows the four most common outcomes of a noninferiority trial. This figure shows the difference in efficacy between the experimental and control arms, with 95% confidence intervals. Examples of these results for HIV trials are also shown in Table 1. Inferiority shown despite a noninferiority design (scenario A in Fig. 1). One example where this outcome was seen was the CONTEXT trial of fosamprenavir/ritonavir versus lopinavir/ritonavir [15]. The trial had been powered (using an endpoint of continuous log reduction in HIV RNA) to demonstrate noninferiority of the fosamprenavir arm, the 95% confidence interval for the true difference between the two arms did not overlap zero, suggesting that lopinavir actually out-performed fosamprenavir; a formal statistical test confirmed the inferiority of fosamprenavir in this trial. In situations such as these, the same statistical procedures as for unexpected superiority can be used. Failure to show noninferiority (scenario B). This result occurs when the 95% confidence interval for the true difference between the treatment arms overlaps zero, but the lower limit of this confidence interval falls below the prespecified delta value. This result was seen for tenofovir versus stavudine in the Gilead 903 trial [26], nevirapine versus efavirenz in the 2NN trial [27], and maraviroc versus efavirenz in the MERIT trial (which was reported with a one-sided confidence interval) [24]. A common misinterpretation of the results from these trials is that, since the confidence interval for the difference overlaps zero, the arms are showing similar efficacy. The correct interpretation is that the new treatment failed to show noninferiority, and so the new treatment should not be accepted as an alternative to the current standard of care. Demonstration of noninferiority (scenario C). This result occurs when the 95% confidence interval for the true difference in outcomes between treatment groups overlaps zero, but the lower limit of this confidence interval remains above the predefined delta value (10 or 12%). Note that the observed efficacy of the experimental arm needs to be very close to that of the control arm (typically within 2–3%) for noninferiority to be demonstrated with this level of power and noninferiority margin. Also, a new treatment may not be accepted even if noninferiority is shown. For example in the BI 1182.33 trial, tipranavir/ritonavir 500/200 mg twice daily was shown to be noninferior to lopinavir/ritonavir, but the tipranavir arm was closed down by the Data Safety Monitoring Board owing to excess elevations in liver enzymes [21]. Superiority shown despite a noninferiority design (scenario D). This result has been seen for the Gilead 934 trial of tenofovir/emtricitabine/efavirenz versus zidovudine/lamivudine/efavirenz [2], and also in the TITAN trial of darunavir/ritonavir versus lopinavir/ritonavir [12]. In each case, the trials had been designed to show noninferiority, but the 95% confidence intervals for the true difference between arms did not overlap zero, suggesting superiority. The European guidelines on noninferiority trials state that it is acceptable to calculate the P-value associated with a test of superiority within a noninferiority trial, provided that the superiority finding is not simply a consequence of a lower rate of discontinuations for adverse events in the experimental arm [28]. It is important that the test of superiority be performed on the intention-to-treat population. For cases like these, the treatment arms are first compared for noninferiority, and then for superiority, in two separate statistical tests [8,28]. Another potential outcome is if a treatment is shown to be significantly worse than the control, but the confidence intervals of the difference fall within the limits for noninferiority (ie. not overlapping zero or the noninferiority margin). We are not aware of an example of this in Phase III HIV efficacy trials, although this outcome has been seen in bioequivalence studies [29]. The outcomes described above may arise when a trial has been designed to show noninferiority. A further possible outcome of a trial, where noninferiority is demonstrated in a study originally designed to show superiority, is not recommended by European guidelines [8,28]. Other issues for noninferiority studies The validity of the control arm There needs to be well controlled data to show that the comparator arm is in itself an effective treatment, with proven benefits over placebo or current standard of care. In addition, the performance of the comparator arm needs to be similar to that seen in previous reference trials [8,28]. This is to minimize the chance of the ‘outcome drift’ that may be seen if successive noninferiority trials are performed that each use as a comparator the regimen shown to be noninferior in the previous trial. For treatment of naive patients, a recent meta-analysis has suggested gradual improvements in the efficacy of HAART over the past 10 years [1]. It is important to compare new treatments with control arms that have shown the strongest efficacy, and have not been superseded. For example, once lopinavir/ritonavir showed efficacy benefits over nelfinavir in the Abbott 863 trial [30], nelfinavir was no longer used as a control arm. In treatment-experienced patients, new treatments with proven efficacy benefits are added to the optimized background regimen, which should lead to incremental improvements of efficacy in control arms. The efficacy of control arms has improved between the TORO, POWER and DUET trials, as the use of first enfuvirtide and then darunavir/ritonavir was allowed in the control arms [31–34]. There are examples of noninferiority trials with control arms which are either not approved by regulatory authorities, or no longer recommended in international treatment guideline documents. For example, the SOLO trial [35] was powered to show noninferiority of fosamprenavir/ritonavir versus a control arm of nelfinavir, which is no longer recommended for first-line treatment. When trials are designed, it is difficult to predict the future standard of care at the time the results will be presented. In situations like this, it may be necessary to conduct follow-up trials, to re-evaluate drugs against new standards of care. The KLEAN trial [20] re-assessed the efficacy of fosamprenavir/ritonavir against a more reliable control, lopinavir/ritonavir, after the results of the SOLO trial were assessed. The TITAN trial was a comparison of darunavir/ritonavir 600/100 mg twice daily versus lopinavir/ritonavir in treatment-experienced patients [12]. Figure 2 shows a systematic review of reference trials to show that the control arm of the TITAN trial, lopinavir/ritonavir, performed in a similar way to reference trials in experienced patients. This approach can be used to show the validity of the efficacy in the control arm. Lopinavir/ritonavir showed significantly higher efficacy than nelfinavir for naive patients in the Abbott 863 trial [30], and significantly higher efficacy than control protease inhibitors for experienced patients in the Abbott 888 trial [36]. The efficacy of the lopinavir/ritonavir arm of the TITAN trial appears similar to the efficacy of lopinavir/ritonavir in other trials of experienced patients [15–17,37,38], which supports the noninferiority conclusion for darunavir/ritonavir in the TITAN trial.Fig. 2: HIV RNA < 50 copies/ml at week 48 in the TITAN trial compared with historical controls. DRV/rtv, darunavir plus ritonavir; ITT, intent-to-treat; LPV/rtv, lopinavir plus ritonavir; VL, viral load; TLOVR, time to loss of virological response.Noninferiority trials in treatment-experienced patients The standard design for trials in highly experienced patients has been to use an ‘optimized background’ arm for all patients and then to randomize to use or not use a new experimental drug. These trials have been designed to show an efficacy benefit for the new treatment. This model was used for the TORO trials of enfuvirtide [34], the MOTIVATE trials of maraviroc [39,40], the BENCHMRK trials of raltegravir [5,41] and the DUET trials of etravirine [31,32]. The POWER and RESIST trials of darunavir and tipranavir included investigator-selected protease inhibitors in the control arms, but baseline resistance testing predicted marginal efficacy for the control PIs chosen [33,42]. These trial designs have recently been criticized, given the excess risk of virological failure in the control arm, and the subsequent risk of developing drug resistance, which could compromise future treatment options [43,44]. Combinations of new antiretrovirals with a low potential for cross-resistance are likely to lead to full suppression of HIV RNA in the majority of treatment-experienced patients. For example, 16-week data from the BENCHMRK trials of the integrase inhibitor raltegravir, showed at least 90% of patients with HIV RNA < 400 copies/ml when raltegravir was initiated either with enfuvirtide, darunavir or both drugs [5]. Given these new developments, there may be ethical issues in conducting new clinical trials in which a control arm is expected to underperform virologically, except in the most treatment-experienced patients. In future noninferiority trials could be used to identify drug combinations which achieve a consistently high efficacy rate in treatment-experienced patients, but with a lower pill burden, lower numbers of drugs required and fewer adverse events. For example, patients with current suppression on more complex combinations (such as multiple nucleoside reverse transcriptase inhibitors, dual boosted protease inhibitors, enfuvirtide) could be randomized either to continue their current treatment, or to transfer onto a more simple combination of new drugs. Alternatively, patients could be transferred onto combinations of new drugs, and randomized to either continue or stop parts of their optimized background regimen (such as nucleoside reverse transcriptase inhibitors). These noninferiority trial designs are less likely to lead to excess virological failures in control arms, and could allow access to new treatments for the majority of trial participants. If the efficacy of single drugs is compared in experienced patients also given an optimized background of other active drugs, it is important to assess the extent to which the optimized background used would dominate the efficacy profile, and to what extent the randomized treatment component is contributing to the overall efficacy profile. Summary and recommendations for trial design, analysis and interpretation Noninferiority studies are designed to show similar efficacy in the experimental and control arms. Therefore, wherever possible, this type of design should be adopted for new trials of treatment-experienced patients, in preference to previous trials powered to show differences versus the control arm. Difference trials have the disadvantage of potentially exposing the control group to a higher risk of virological failure, which may no longer be necessary given the range of new treatment options available. The HIV RNA 50 copy endpoint should be used as the primary endpoint wherever possible. The primary objectives and endpoints of a noninferiority trial need to be clearly stated in the protocol, together with a justification for the control arm used, and statistical procedures to use in case unexpected superiority is seen for the experimental versus the control arm. A standardized ‘per protocol’ analysis needs to be defined, with similar exclusions across clinical trials. As a minimum, the following patients should be excluded from the per protocol analysis: (i) patients who do not take at least one dose of randomized medication; (ii) patients who receive the wrong randomized treatment; (iii) patients taking an inadequate optimized background treatment; or (iv) who take unauthorized experimental drugs during the trial (from a prospectively defined list). The results from per-protocol and intent-to treat analysis should be consistent to conclude noninferiority. Both the intent-to-treat and per protocol analyses of noninferiority trials may be dominated by nonvirological endpoints. A ‘nonvirological failures censored’ analysis, should be conducted, including all virological failures while patients are receiving randomized treatment, and censoring data after treatment discontinuation for nonvirological reasons. These ‘nonvirological failures censored’ analyses could identify treatments which are virologically inferior but better tolerated than the control arm. Results of new randomized trials designed to show noninferiority need to be carefully communicated, to show whether this has or has not been shown. The design and delta should be stated in the Methods section. Where the 95% confidence interval for the true treatment effect overlaps zero but noninferiority was not shown, the study results should not be interpreted as demonstrating ‘no difference’ or equivalence. Furthermore, the conclusions should be stated in the context of the population recruited, including baseline resistance profiles. Proving noninferiority in naive patients may not necessarily mean the same result would occur in experienced patients, and vice versa.
Referência(s)