NCDEU II: New Research Methodology Leads Issues

October 1, 1998

New methods of conducting and evaluating research were as intriguing as their results at the National Institute of Mental Health (NIMH)-sponsored New Clinical Drug Evaluation Unit Program's (NCDEU) 38th annual meeting in Boca Raton, Fla., June 10-13. The meeting has grown from a forum of NIMH-funded researchers reporting on their progress into a convention of approximately 1,000 clinicians, industry and regulatory personnel, and investigators marking the progress in psychopharmacology.

New methods of conducting and evaluating research were as intriguing as their results at the National Institute of Mental Health (NIMH)-sponsored New Clinical Drug Evaluation Unit Program's (NCDEU) 38th annual meeting in Boca Raton, Fla., June 10-13. The meeting has grown from a forum of NIMH-funded researchers reporting on their progress into a convention of approximately 1,000 clinicians, industry and regulatory personnel, and investigators marking the progress in psychopharmacology. However, the meeting continues to include assessments of research methodology.

To answer the question posed in their presentation title, "Do clinical trials reflect drug potential?" Mary Hooper, B.A., and Jay Amsterdam, M.D., of the Depression Research Unit at the University of Pennsylvania, invoked the Freedom of Information Act to obtain and examine the U.S. Food and Drug Administration reviews of New Drug Applications (NDAs) for paroxetine (Paxil), venlafaxine (Effexor), nefazodone (Serzone), mirtazapine (Remeron) and sertraline (Zoloft). Although the efficacy of these agents is established in each NDA by statistical differentiation from placebo, Hooper and Amsterdam suspect that a heightened placebo effect from such factors as a high dropout rate, poor site selection or poor protocol design can mask the potential of the active drugs.

Hooper reported on their review of 37 clinical trials, all of which had used the Hamilton Depression Scale (HAM-D), the HAM-D "depressed mood" item, the Clinical Global Impression scale (CGI) severity and CGI improvement scores to measure improvement from baseline at the end of week 6 and week 8 of the acute treatment phase. They found that efficacy was not demonstrated on any of the four measures in 13 (35%) of the trials. Paroxetine efficacy was demonstrated on all four measures in three of six trials (50%). Venlafaxine studies did not employ CGI, but in two of six trials (33%), success was found on each of the other three measures. Nefazodone demonstrated significant improvement in all four measures in one of nine trials (11%). Mirtazapine was found effective in all measures in four trials, but failed efficacy criteria in four other trials. Sertraline demonstrated efficacy in three of six trials (50%).

"These data," the investigators related, "emphasize the importance of controlling for confounding variables in the development of new antidepressant compounds."

Douglas Feltner, M.D., and colleagues at Solvay Pharmaceuticals sought to confirm that factors other than the spontaneous improvement of depressive symptoms contribute to the placebo effect in antidepressant trials. The group made retrospective calculations of between-visit changes in 17-item Hamilton Depression Scale (HAM-D-17) total scores for patients receiving placebo in several similarly designed antidepressant trials. They found that the mean HAM-D-17 scores in the single-blind phase of the trial, from screening to baseline visit, were not reduced (mean change was an increase of 0.18 1.79); but a reduced mean score, indicating symptom improvement, occurred in patients with high, medium and low baseline HAM-D-17 scores from the first postrandomization week to the third.

"These data suggest," Feltner and colleagues concluded, "that in placebo-controlled depression trials, the placebo effect is more prominent during the double-blind phase than during the single-blind placebo run-in phase." The investigators emphasized that this difference is unlikely to be due to different rates of spontaneously improved depression between the two trial phases.

In addition to validating a new agent against placebo, the studies in an NDA customarily employ fixed-dosage comparisons to establish a minimal effective dose for the new agent and, perhaps, to discourage subsequent use of excessively high doses with associated heightened side effects. However, Jeffrey Mattes, M.D., of the Psychopharmacology Research Association of Princeton, N.J., suggested these goals might not be achieved by fixed-dose studies that may actually obfuscate the response by some patients to unusually low doses.

"Thus, a finding in a fixed-dose study that only patients receiving a specified high dose improve significantly does not necessarily mean that lower doses are not beneficial for some patients," Mattes explained.

Mattes offered as an example a recent study of paroxetine efficacy in panic disorder, which indicated that a dosage of 40 mg/day was effective while 10 mg/day and 20 mg/day were not. "Results of this type might lead to some patients receiving a higher dose than they need, the very outcome that fixed-dosage studies were intended to avoid," he warned.

Mattes proposed that variable-dose rather than fix-dose designs could often serve, and would more closely resemble clinical treatment. Variable-dose design can also identify minimal effective dose, Mattes indicated, with slow-dose titration to therapeutic effect and tapering following stabilization.

"Variable-dose studies are also more cost-effective when attempting to demonstrate efficacy," Mattes argued, "since fixed-dose studies require larger sample sizes, anticipating that some patients will receive either too low or too high a dose."

The possibility that the dosages approved by the FDA from fixed-dose studies are not the optimal dose range was raised by Robert Hamer, Ph.D., of the Robert Wood Johnson Medical School, Piscataway, N.J., and Pippa Simpson, Ph.D., of the University of Arkansas for Medical Sciences, Little Rock, Ark. Hamer and Simpson went beyond ascribing this possible disparity to the deficiencies in fixed-dose designs to suggest, "It may be in the sponsor's interest to design the clinical trials in such a way as to minimize the finding of a dose-response effect."

In addition, Hamer and Simpson questioned the adequacy of the statistical methods used to account for subjects who drop out in clinical trials. "These techniques themselves often make untenable assumptions, and frequently have difficulty with the estimation process." They warn that even if a dose-response effect exists, it may not be discerned in the face of confounding factors in the trial design, dropouts, causes of dropouts and in the chosen methods of statistical analysis.

Michael Borenstein, Ph.D., of the Hillside Hospital, Teaneck, N.J., described two statistical analysis computer programs he developed to help ensure accurate meta-analysis of multiple studies and power analysis in survival studies. "Meta-analysis can be used," Borenstein explained, "to set policy by allowing us to develop a clear picture from existing data. It can also be used to help plan future research by pinpointing areas for which the existing information is not sufficient." His meta-analysis program aids this research by facilitating the establishment of a detailed database for individual studies.

Borenstein's computerized survival analysis performs computations similar to those that determine the power of a study with a single time-point, but can incorporate the complexity of the parameters over time. "Survival analysis," Borenstein explained, "provides a clear picture of outcome that is not available at any single time point." Such analysis, for example, distinguishes between patients who remit at different times during a trial, or identifies those who fail to remit by a particular time point and require different follow-up.

"The program allows the researcher to plan for precision [i.e., the confidence interval width] as well as power," Borenstein indicated. "In some cases the study goal is not only to test the null hypothesis of no effect, but also to estimate the size of the treatment effect."Rating the Rating Scales

Antidepressant effectiveness is gauged in clinical trials with validated rating scale measures of depressive symptom improvement. Researchers at Duke University point out, however, that the choice of a particular scale is often arbitrary, and that depression symptom scales vary in their psychometric properties, conceptual focus, response burden and discriminating power.

Bernard Carroll, M.D., Ph.D., and colleagues at Duke University Medical Center, Durham, N.C., conducted a community control comparison of four scales in 688 subjects (559 patients with major depression and 129 normal volunteers). They employed the Montgomery-Asberg Depression Rating Scale (MADRS) and the Center for Epidemiologic Studies of Depression Scale (CES-D); as well as the Carroll Depression Scale (CDS, 52-items) and the Brief Carroll Depression Scale (BCDS, 12-items), developed by Bernard Carroll approximately 17 and 12 years ago, respectively.

In the comparison, Carroll's group assessed the sensitivity and specificity of the scales, the effect size (as the difference between group means divided by the pooled standard deviation), the case/control ratio (as the mean score of patients divided by the mean score of controls) and a computed "D-score" (Blakely et al., 1995). Carroll indicated that his was the first group in psychiatry to adopt this last statistic which is, he explained, "a standardized measure of test effectiveness that allows tests to be compared in relative and absolute terms."

The investigators also performed a meta-analysis of the scale performances in primary care, and assessed the response burden of self-rated scales in terms of the number of cognitive decisions required. Carroll reported findings that the four scales intercorrelated highly, diagnostic specificity was high and the mean scores of patients were significantly greater than those of the normal volunteers.

The D-score calculations were very high (above 4.0) for the CDS and BCDS, but were not calculated for the MADRS and CES-D because, Carroll indicated, the depressed patient sample did not meet criteria to do so. For comparison, however, the investigators calculated the D-score for several scales from previously published data. This included the Zung SDS Depression Scale, used in a similar community control comparison, which was determined to have a lower D-score of 2.2. In their meta-analysis of data from primary care, Carroll's group found the computed D-scores of several depression symptom scales, including that of the BCDS, were uniformly lower.

Carroll concluded that while validated depression scales effectively separate clinically depressed patients from community control subjects, they vary in the cognitive burden placed on the patient and, on one effectiveness measure, appear less robust when used in primary care.

The sensitivities of the HAM-D-17, used extensively in depression research, and the six-item version (HAM-D-6) more applicable to clinical practice were compared by Cynthia Hooper, M.A., and David Bakish, M.D., of the Royal Ottawa Hospital, Ottawa, Canada. The scales were found comparable in patients with major depressive disorder in an earlier study, and this investigation extended those results to a population that included patients with concurrent dysthymic disorder or with melancholic features.

Hooper and Bakish found the two scales to be strongly correlated at both baseline and termination of treatment. Further, the HAM-D-6 was found to be effective in distinguishing symptom change over the treatment period for all patient subgroups. While characterizing these results as preliminary, the researchers indicated that the HAM-D-6 appeared to be widely applicable. "The ability to use a shorter version of the HAM-D may allow clinicians to incorporate the scale into their practice and develop objective databases. It may also prove useful in the current practice of using more intense, high-volume clinical trials to test new antidepressants."

Kenneth Kobak, Ph.D., and colleagues at the Dean Foundation, Middleton, Wis., studied the feasibility of obtaining computerized HAM-D assessments directly from patients using interactive voice response (IVR) technology over a phone line.

"IVR technology enables remote evaluation of treatment response 24 hours a day from any touch-tone telephone," Kobak explained. "Such accessibility allows a more frequent monitoring of the speed of onset of medication in clinical trials."

In Kobak's validation study, 76 subjects with affective or anxiety disorders and 37 community controls were given IVR HAM-D and clinician-administered HAM-D in a counter-balanced order. Subjects answered questions on the IVR HAM-D by pressing number keys on a touch-tone telephone. Kobak reported a high degree of convergent validity in the two methods, with the IVR successfully discriminating between subjects with a diagnosis of major depression, subjects with a diagnosis of dysthymia and the nonpsychiatric controls.

In the subsequent multisite feasibility study with 449 subjects, Kobak's group used the IVR HAM-D to evaluate speed of improvement onset. The scale was administered by a clinician at the study site at baseline, and on days 2, 7, 14, 21 and 28; and by IVR from the patient's home on days 1, 2, 4 and 11. In this study, the wording was modified to evaluate the time interval since the subject's last evaluation. In addition, the system contained a paging feature that automatically paged the study site if a patient's suicide score from home was elevated.

The subjects found the IVR HAM-D easy to understand and complete, and Kobak indicated the results support the feasibility of using IVR technology in clinical trials.




Blakely DB, Oddone EZ, Hasselblad V et al. (1995), Noninvasive carotid artery testing. A meta-analytic review. Ann Int Med 122:360-367.