Inter-Rater Reliability in Psychiatric Diagnosis
Inter-Rater Reliability in Psychiatric Diagnosis
Nearly 50 years ago, psychiatric diagnosis was described as the “soft underbelly” of psychiatry because of poor inter-rater agreement.1 Historically, psychiatric diagnoses were interpretive and somewhat subjective. Relative to medicine as a whole, psychiatry has few laboratory markers or imaging data to confirm a preliminary diagnosis. However, subsequent iterations of DSM have aspired to more accurate, reliable diagnoses. Despite this, the inter-rater reliability of psychiatric diagnosis remains a challenge as American psychiatry prepares to welcome DSM-5.
Inter-rater (or intercoder) reliability is a measure of how often 2 or more people arrive at the same diagnosis given an identical set of data. While diagnostic criteria help establish reliable diagnoses, the methods of gathering and interpreting patient data have a tremendous effect on how likely it is that 2 examiners will come to the same diagnostic conclusion for a given patient. Here we review inter-rater reliability research, identify factors that threaten diagnostic reliability, and offer examples of assessment strategies that may provide increased diagnostic reliability in general clinical settings.
Reliability of psychiatric diagnoses today
In diagnostic assessment, perfect inter-rater reliability would occur when psychiatric practitioners could always arrive at the same diagnosis for a given patient. When evaluating and interpreting inter-rater reliability, the measurement statistic used is kappa; the higher the kappa ranking, the stronger the degree of agreement is between raters. Indeed, no field of medicine is able to achieve perfect agreement—there are degrees of variance among diagnosticians in other specialties.2
We are not saying that psychiatric diagnoses wholly lack objectivity, nor do we dismiss the “art” of psychiatry. Dependent on the skill of the rater and category of disorder, the inter-rater reliability of psychiatric diagnoses can be quite high and comparable to the kappas obtained for several diagnoses in other medical specialties. For example, Ruskin and colleagues3 demonstrated a kappa of 0.83 for 4 common psychiatric diagnoses. However, lacking some of the more objective measures enjoyed by other specialties, psychiatry faces additional challenges with inter-rater reliability.
Similar to goals of other fields of medicine, a central goal of psychiatric nosology is diagnostic reliability and consistency. Despite this goal, however, diagnostic reliability for some psychiatric disorders remains low. For example, studies have revealed a strong discordance among the diagnoses of psychotic disorders. McGorry and colleagues4documented 66% to 76% inter-rater reliability for this class of disorders and a corresponding discordance of 24% to 34%. Likewise, Maj and colleagues5 demonstrated a low kappa of 0.22 for the diagnosis of schizoaffective disorder.
Despite the ongoing revision of diagnostic criteria with each subsequent iteration of DSM, clinicians report concerns that diagnostic reliability generally remains poor. In a small-survey study in 2007, Aboraya 6 asked how clinicians’ attitudes and beliefs might explain low concordance in psychiatric diagnosis. In contrast to a similar study done in 1962, clinicians in the 2007 study attributed low inter-rater reliability more to individual clinician factors than to problems with nosology.7 Clinician-related factors that affect psychiatric diagnosis are heterogeneous, and the following is by no means a complete list.
Differences in clinical experience may affect knowledge about psychiatric phenomenology or alter the landscape of the clinical interview. For example, comfort and familiarity with diagnostic criteria, confidence in interviewing, and time management can each affect diagnostic conclusions. A loosening of the diagnostic criteria may occur if the clinician only asks about some of the diagnostic criteria. Availability of data (eg, collateral information) may also affect the ultimate diagnosis. Other important sources of inter-rater divergence are external pressures, such as different reimbursement practices by insurance companies, exposure to popular diagnoses through the media or pharmaceutical marketing, and the cognitive biases and heuristics that affect the practice of all medical specialties.8
Clearly, interview styles vary widely and no single approach is the right one. Clinical interviews are often informal with variable structure. The aid of rating scales or structured instruments, therefore, can limit omissions or biases during history taking. While some validated structured instruments (eg, the Structured Clinical Interview for DSM Disorders-Clinician Version [SCID-CV]) require too much time for administration to be practical in everyday clinical work, others (eg, the Diagnostic Interview Schedule for Children [DISC]) may be successfully administered by lay providers or by computer, allowing the clinician to subsequently focus on one or more identified areas of concern.9,10 Similarly, systematic screening instruments are available for all major categories of psychiatric illness.
In light of these data, would an approach that offers more structure and objectivity increase the reliability of diagnoses? We offer 3 examples of approaches that may improve inter-rater reliability:
• A structured approach to collateral data
• The use of validated instruments to collect patient information
• A cognitive approach that anticipates critical challenges to diagnosis