Throwing Out the Gold? Reconsidering the HAM-D

April 1, 2005

Is the HAM-D still viable? What are its flaws and is it possible to compensate for them?

Psychiatric Times

April 2005


Issue 4

The most commonly used measure of depression is under increasing scrutiny as some antidepressant studies fail to distinguish active drug from placebo or ascertain greater benefit than risk. A recently published review of studies evaluating the Hamilton Rating Scale for Depression (HAM-D) posed the question, "Has the gold standard become a lead weight?" (Bagby et al., 2004).

Michael Bagby, Ph.D., and colleagues reviewed the studies evaluating the HAM-D that have been published since the last major review in 1979 (Hedlund and Vieweg, 1979). Bagby and colleagues (2004) concluded that the instrument, which is based on a 45-year-old construction, was "psychometrically and conceptually flawed" and, in their estimate, beyond repair.

"The breadth and severity of the problems militate against efforts to revise the current instrument," Bagby and colleagues wrote. "After more than 40 years, it is time to embrace a new gold standard for assessment of depression."

The HAM-D was also criticized during several presentations at the 2004 National Institute of Mental Health-sponsored New Clinical Drug Evaluation Unit (NCDEU) conference (see related story, p68 of the print edition--Ed.). Most of the criticism, however, has been directed at current versions of the HAM-D, rather than the scale Hamilton (1960) originally validated, according to Leon Rosenberg, M.D., who spoke at the NCDEU meeting. Scoring guidelines and anchors have been changed from the original, Rosenberg noted, and Hamilton's explicit direction that the instrument be applied by two raters has not been followed.

Bagby and colleagues (2004) did encounter at least 20 different published forms of the HAM-D, including both longer and shorter versions; but limited their review to studies of the 17-item version. The HAM-D-17 was the most studied, and the 17 items were contained in most other versions.

One proposal to fix the HAM-D was presented at the NCDEU by Amir Kalali, M.D., and colleagues of the International Society for CNS Drug Development (ISCDD) Depression Rating Scale Standardization Team (DRSST). Their GRID-HAMD, available online at , increases standardization of the scoring and provides scoring guidelines and structured interview prompts. Kalali noted that the GRID HAM-D is already in use in several NIMH-funded studies.

"Our goal for the DRSST process was not to change the items but to improve the current scale as much as possible by standardization, operationalization of concepts, better anchors and structured questions, as well as conventions," Kalali told Psychiatric Times.

Bagby and colleagues (2004) considered the GRID-HAMD, but did not find it sufficiently improved nor to be a suitable replacement, as it retains the original 17 items. "The ... Team failed to address many of the flaws of the original instrument," they asserted. "Most of the items still measure multiple constructs; items that have consistently been shown to be ineffective have been retained, and the scoring system still includes differential weighting of items."

Kalali explained to PT that it was not the intention of the DRSST to replace the HAM-D with an entirely different scale, but to offer an improvement until such a replacement could be developed. "We agree that the Hamilton Depression Rating Scale is flawed," he remarked. "However, while we await other scales to be developed and widely accepted, it is a practical reality that the Hamilton Depression Rating Scale will continue to be widely used in both regulatory and academic trials for at least a few years."

Kalali characterized the DRSST effort as a demonstration project for the ISCDD collaboration between pharmaceutical manufacturers and academic centers. He anticipates that a new ISCDD-funded initiative, the Depression Inventory Development Team, which will include Bagby, will produce a widely accepted depression scale which incorporates appropriate, data-driven items, consistent item response, and is sensitive to contemporary depression management interventions.

Bagby and colleagues (2004) derived their unfavorable assessment of the HAM-D-17 from 70 studies that examined one or more of the psychometric properties of reliability, item response and validity. Reliability was evaluated as internal reliability between instrument items, in retesting and when applied by different raters. Item response analysis ascertained sensitivity to different levels of, and changes in, symptom severity.

Studies examined content, convergent, discriminant, factorial and predictive validity. Content validity reflects the scale items corresponding with known factors of depression. Convergent validity is the correlation with other measures of depression. Discriminant validity corresponds to distinguishing between groups with and without depression. Factorial validity is derived from factor analysis of the empirical structure of the scale, ascertaining whether each item loads on the factor for which it was designed. Predictive validity occurs in the prediction of change in symptom severity with treatment.

Bagby and colleagues (2004) found agreement in studies that the majority of HAM-D items have adequate internal reliability; although loss of insight had the most variable rankings. Retest reliability at the item level was negligible for some items, but this varied considerably across studies and was enhanced with use of structured interview guides.

The inter-rater reliability at the individual item level was deemed poor. In one study in which the scale was applied at baseline and after 16 weeks of treatment, only early insomnia was adequately reliable before treatment and only depressed mood was reliable after treatment (Cicchetti and Prusoff, 1983). In two samples of another study, loss of insight was found to have the lowest inter-rater agreement (Rehm and O'Hara, 1985). Inter-rater reliability was increased in a different study, however, when the scale was administered with interview guidelines (Moberg et al., 2001).

Validity studies examining item content and scaling found many items failed to measure single symptoms or contain response options corresponding to degrees of severity. "The problems inherent in the heterogeneity of these rating descriptors reduce the potential meaningfulness of these items," Bagby et al. (2004) remarked, "a problem exacerbated if the different components of an item actually measure multiple constructs and thus measure different effects."

While Bagby and colleagues acknowledged that most items are scaled so that increasing scores reflect increasing severity, they noted, "It is less clear whether the anchors used for different scores on certain items actually assess the same underlying construct/syndrome."

They pointed out that it is standard psychometric practice for each item to contribute equally to the total score or to have evidence supporting differential weighting. In contrast, Hamilton originally included both three-point and five-point items; not with justified differential weighting, but with the assumption that certain items would be difficult to anchor dimensionally and so should have fewer response options.

As an illustration of how validity can be compromised when certain items contribute more to the total score than others, Bagby and colleagues offered:

A severe manifestation of ... [psychomotor retardation] contributes 4 points, whereas an equally severe manifestation of ... [psychomotor agitation] contributes 2 points. Similarly, someone who weeps all the time can contribute 3 or 4 points on depressed mood, whereas someone who feels tired all the time can contribute only 2 points, on the general somatic symptoms response.

Although item-response theory was not available at the time the HAM-D was developed, the scale has been tested by the theory in several studies. Most HAM-D items were found to be sensitive to depression severity. In one study of individual items in a combined sample of primary care and depressed patients from the NIMH Treatment of Depression Collaborative Research Program, however, 12 of the 17 items had at least one problematic response option (Santor and Coyne, 2001).

"These findings confirm that the rating scheme is not ideal for many items on the Hamilton depression scale," Bagby and colleagues (2004) explained, "with the unfortunate effect of decreasing the capacity of the Hamilton depression scale to detect change."

In examining content validity, they noted that most HAM-D versions contain core items that have remained unchanged for more than 40 years, while the DSM diagnostic criteria have been revised three times in response to developments in clinical trial research and expert consensus. Several symptoms in the HAM-D are not among the current DSM-IV criteria and several key DSM-IV depression features are "buried within more complex [HAM-D] items and sometimes are not captured at all."

There has been generally good convergent validity with other measures, although not with the major depression section of the Structure Clinical Interview for DSM-IV. This is further indication of the divergence of the HAM-D from current diagnostic criteria. In one evaluation of discriminant validity, four items (psychomotor agitation, gastrointestinal symptoms, loss of insight and weight loss) failed to differentiate depressed from healthy patients (Rehm and O'Hara, 1985).

Factorial validity was evaluated with factor analysis in 15 studies with 17 samples. There was not a single unifying structure replicated across studies, according to Bagby and colleagues. They noted that while the HAM-D is clearly not unidimensional, with separate sets of items reliably representing general depression and insomnia factors, "the exact structure of the Hamilton Depression Scale's multidimensionality remains unclear."

The HAM-D has shown predictive validity and been found more sensitive to change than the self-reporting Beck Depression Inventory (BDI) or Zung Self-Rating Depression Scale (ZSDS). Bagby and colleagues (2004) qualified the apparent predictive validity of the HAM-D, however.

"One disadvantage of a multidimensional instrument such as the Hamilton Depression Scale in detecting change is that specific treatments may affect only a single dimension," they stated. "If the total score includes somatic symptoms that actually reflect treatment side effects, estimates of treatment response will be spuriously low."

Calling for change, Bagby and colleagues contrasted the progress and sophisticated methods in developing new antidepressant medications with continued reliance on antiquated methods to assess change in depression severity and response to treatment.

"Effort in both areas is critical to the accessibility of new medications for patients with depression," the researchers concluded. "The field needs to move forward and embrace a new gold standard that incorporates modern psychometric methods and contemporary definitions of depression."

Kalali concurred with this assessment and is optimistic that the newly formed Depression Inventory Development Team will produce that instrument. "We are well aware of the need for a new depression rating scale based on our more recent understanding of depression and modern psychometric methodology," Kalali commented. "This effort is well underway with item development and field testing already in progress."


Bagby MR, Ryder AG, Schuller DR, Marshall MB (2004), The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry 161(12):2163-2177.

Cicchetti DV, Prusoff BA (1983), Reliability of depression and associated clinical symptoms. Arch Gen Psychiatry 40(9):987-990.

Hamilton M (1960), A rating scale for depression. J Neurol Neurosurg Psychiatry 23(1):56-62.

Hedlund JL, Vieweg BW (1979), The Hamilton Rating Scale for Depression: a comprehensive review. J Operational Psychiatry 10:149-165.

Moberg PJ, Lazarus LW, Mesholam RI et al. (2001), Comparison of the standard and structured interview guide for the Hamilton Depression Rating Scale in depressed geriatric inpatients. Am J Geriatr Psychiatry 9(1):35-40.

Rehm LP, O'Hara MW (1985), Item characteristics of the Hamilton Rating Scale for Depression. J Psychiatr Res 19(1):31-41.

Santor DA, Coyne JC (2001), Examining symptom expression as a function of symptom severity: item performance on the Hamilton Rating Scale for Depression. Psychol Assess 13(1):127-139.