Levels of Evidence

The term “evidence” has become about as controversial as the word “unconscious” had been in its Freudian heyday, or as the term “proletariat” was in another arena.

It means many things to many people, and for some, it elicits reverent awe-or reflexive aversion. This is because, like those other terms, it is linked to a movement-in this case, evidence-based medicine (EBM), which is currently quite influential, with both supporters and critics.

A clinical controversy

In the previous article in this series, “Why You Cannot Believe Your Eyes,” I described a study in which antidepressants were judged effective in the treatment of bipolar depression, based on observational data that indicated that stopping antidepressants would lead to depression relapse.² I said that we should not take those results at face value because they do not pass the hurdle of bias, specifically confounding by indication.

A single article never stands alone, however; the context of the whole literature is relevant. A critic might object that other observational studies have also “replicated” that initial study.³ I put replicated in quotes, because observational studies can be expected to replicate other observational studies if they have the same sources of bias. Recall that bias is systematic error; therefore, the same mistakes will be made systematically.

My research group conducted the same study as those observational trials, except our study was randomized.⁴ We partially confirmed the earlier findings but with a much smaller amount of benefit (smaller effect size): there was mild, but not marked, depressive symptom benefit with antidepressant continuation; there was a small (not large) delay to depressive relapse with antidepressant continuation. And, notably, we found that there was worsening of depressive episodes in those with rapid-cycling bipolar disorder who continued antidepressant treatment. Which results should we believe-the observational data that show more benefits with antidepressants or the randomized data that show less (and some harm)? The answer is the latter, and this is not a matter of opinion but is based on an understanding of the concept of levels of evidence.

Origins of EBM

It may be worthwhile to note that the originators of the EBM movement in Canada (such as David Sackett⁵) toyed with different names for what they wanted to do; they initially thought to use “science-based medicine” but opted for “evidence-based medicine” instead. This is perhaps unfortunate, since “science” tends to engender respect, while “evidence” seems more vague.⁶ Hence, we often see proponents of EBM (mistakenly, in my view) saying things like: “that opinion is not evidence-based” or “those articles are not evidence- based.” The folly of this kind of language is evident if we use the term “science” instead. Once we use the term “science,” it becomes clear that such statements raise the question of what “science” means.

In my reading of EBM, the basic idea is that we need to understand what kinds of evidence we use, and we need to use the best: this is the concept of levels of evidence. EBM is not about an opposition between having evidence or not having evidence; since we always have some kind of evidence or another, it is about ranking different kinds of evidence.

With a somewhat ready assumption of cause and effect and, equally, a neglect of the laws of chance, the literature be-comes filled with conflicting cries and claims, assertions and counterassertions.

-Austin Bradford Hill^1(p4)

Specific levels of evidence

The EBM literature has various definitions for levels of evidence. The main EBM text uses letters (A through D).⁵ I prefer using 1 through 5, and I think the specific content of the levels should vary depending on the field of study. The basic idea is that randomized studies are higher levels of evidence than nonrandomized studies (because of bias, as discussed in the previous article), and that the lowest level of evidence consists of case reports, expert opinion, or the consensus of the opinion of clinicians or investigators.

Levels of evidence provide clinicians and researchers with a road map that allows consistent and justified comparison of different studies so as to adequately compare and contrast their findings. Various disciplines have applied the concept of levels of evidence in slightly different ways and, in psychiatry, no consensus definition exists. In my view, in mental health, there are 5 levels of evidence that best apply (Table), ranked from 1 (highest) to 5 (lowest).⁷

The key feature of levels of evidence to keep in mind is that each level has its own strengths and weaknesses, and as a result, no single level is completely useful or useless. All other things being equal, however, as one moves from level 5 to level 1, increasing rigor and probable scientific accuracy improves.

Level 5 refers to a case report or a case series (a few case reports strung together), or an expert’s opinion, or the consensus of experts or clinicians’ or investigators’ opinions (such as in treatment algorithms), or the personal clinical experience of clinicians, or the words of wise men (eg, Sigmund Freud, Emil Kraepelin, Karl Marx, Adam Smith). All of this is the same lowest level of evidence. This does not mean that such evidence is wrong, nor does it mean that it is not evidence; it is a kind of evidence, just a weak kind. It could turn out that a case report is correct, and a randomized study wrong, but, in general, randomized studies are much more likely to be correct than case reports. We simply cannot know when a case report, or an expert opinion, or a saying of Freud or Marx, is right, and when it is wrong. Authority is not, as with Rome, the last word.

All of medicine functioned on level 5 until the revolutionary work of Pierre Louis in the 1840s, whose numerical method introduced level 4-the small observational study.⁸ Observational studies are neither randomized nor open-label. Level 3 is the large observational study, such as the cohort study, the staple of epidemiology. Here we would place such large and highly informative studies as the Framingham Heart Study and the Nurses Health Study. Such observational studies (at level 3 as well as at level 4) can be prospective or retrospective-prospective studies are considered more valid because of a priori specification of outcomes as well as the usual careful rating and assessment of outcomes.

Levels 2 and 1 take us to the highest levels of evidence because of randomization, which is the best tool to minimize or remove confounding bias. Level 2 represents open randomized clinical trials, and level 1 represents double-blind randomized clinical trials.

Judging between conflicting evidence

The recognition of levels of evidence allows one to have a guiding principle by which to assess medical literature.⁷Basic rules are (1) all other things being equal, a study at a higher level of evidence provides more valid (or powerful) results than one at a lower level; (2) judgments should be based as much as possible on the highest levels of evidence; (3) levels 2 and 3 are often the highest level of evidence attainable for complex conditions and are to be valued in those circumstances; (4) higher levels of evidence do not guarantee certainty (one study can be wrong), thus look for replicability; and (5) within any level of evidence, studies may conflict based on other methodological issues not captured by the parameters used to provide the general outlines of levels of evidence.

Regarding the clinical controversy discussed at the beginning of this article, the randomized data that showed less benefit with antidepressants are likely to be more valid than the data from the observational study that found otherwise.

A caveat: The levels of evidence approach does not mean that observational data are worthless; it only means that when randomized data are available, they should be given precedence. But when randomized data are not available, lower levels of evidence, if they are the best available, become important. In other words, there is not a huge leap between double-blind, placebo-controlled studies and other, less rigorous studies. Many academics imagine that all studies that are not double-blind randomized clinical trials are equivalent in terms of rigor, accuracy, reliability, and information. This is ivory-tower EBM. In reality, there are many intermediate levels of evidence, each with particular strengths as well as limits. Open randomized studies and large observational studies, in particular, can be extremely informative and sometimes as accurate as double-blind randomized clinical trials. The concept of levels of evidence can also help clinicians who are loath to rely on level 1 controlled clinical trials, especially if those results contradict their own level 5 clinical experiences. While the advantages to level 5 data mainly revolve around hypothesis generation, to devalue higher levels of evidence is unscientific and dangerous.

In my view, the concept of levels of evidence is the key concept of EBM. With it, EBM is valuable; without it, EBM is misunderstood.

References:

References

1. Hill AB. Statistical Methods in Clinical and Preventive Medicine. New York: Oxford University Press; 1962.
2. Altshuler L, Suppes T, Black D, et al. Impact of antidepressant discontinuation after acute bipolar depression remission on rates of depressive relapse at 1-year follow-up. Am J Psychiatry. 2003;160:1252-1262.
3. Joffe RT, MacQueen GM, Marriott M, Young LT. One-year outcome with antidepressant-treatment of bipolar depression. Acta Psychiatr Scand. 2005;112: 105-109.
4. Ghaemi SN, El-Mallakh RS, Baldassano CF, et al. A randomized clinical trial of efficacy and safety of long-term antidepressant use in bipolar disorder. Presented at: the annual meeting of the American Psychiatric Association; May 2008; Washington, DC.
5. Sackett D, Straus S, Richardson W, et al. Evidence-Based Medicine: How to Teach and Practice EBM. Edinburgh: Churchill-Livingstone; 2000.
6. Silverman W. Where’s the Evidence? Debates in Modern Medicine. New York: Oxford University Press; 1998.
7. Soldani F, Ghaemi SN, Baldessarini R. Research methods in psychiatric treatment studies. Critique and proposals. Acta Psychiatr Scand. 2005;112:1-3.
8. Salsburg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: WH Freeman; 2001.