The inter-rater reliability at the individual item level was deemed poor. In one study in which the scale was applied at baseline and after 16 weeks of treatment, only early insomnia was adequately reliable before treatment and only depressed mood was reliable after treatment (Cicchetti and Prusoff, 1983). In two samples of another study, loss of insight was found to have the lowest inter-rater agreement (Rehm and O'Hara, 1985). Inter-rater reliability was increased in a different study, however, when the scale was administered with interview guidelines (Moberg et al., 2001).
Validity studies examining item content and scaling found many items failed to measure single symptoms or contain response options corresponding to degrees of severity. "The problems inherent in the heterogeneity of these rating descriptors reduce the potential meaningfulness of these items," Bagby et al. (2004) remarked, "a problem exacerbated if the different components of an item actually measure multiple constructs and thus measure different effects."
While Bagby and colleagues acknowledged that most items are scaled so that increasing scores reflect increasing severity, they noted, "It is less clear whether the anchors used for different scores on certain items actually assess the same underlying construct/syndrome."
They pointed out that it is standard psychometric practice for each item to contribute equally to the total score or to have evidence supporting differential weighting. In contrast, Hamilton originally included both three-point and five-point items; not with justified differential weighting, but with the assumption that certain items would be difficult to anchor dimensionally and so should have fewer response options.
As an illustration of how validity can be compromised when certain items contribute more to the total score than others, Bagby and colleagues offered:
A severe manifestation of ... [psychomotor retardation] contributes 4 points, whereas an equally severe manifestation of ... [psychomotor agitation] contributes 2 points. Similarly, someone who weeps all the time can contribute 3 or 4 points on depressed mood, whereas someone who feels tired all the time can contribute only 2 points, on the general somatic symptoms response.
Although item-response theory was not available at the time the HAM-D was developed, the scale has been tested by the theory in several studies. Most HAM-D items were found to be sensitive to depression severity. In one study of individual items in a combined sample of primary care and depressed patients from the NIMH Treatment of Depression Collaborative Research Program, however, 12 of the 17 items had at least one problematic response option (Santor and Coyne, 2001).
"These findings confirm that the rating scheme is not ideal for many items on the Hamilton depression scale," Bagby and colleagues (2004) explained, "with the unfortunate effect of decreasing the capacity of the Hamilton depression scale to detect change."