An effect size tells us how much difference a treatment makes.
OUTCOMES & ADVANCES
When we read studies and meta-analyses, we often see the terms statistical significance and effect size. Most of us are familiar with the concept of statistical significance. A finding is said to be statistically significant if it is unlikely to occur by chance. Most often, experimenters set a level of statistical significance of 0.05 or P = .05. This means that we accept the finding as statistically significant only if there is a less than 5 in 100 or 5% probability of the finding occurring by chance. Whether a finding is statistically significant depends in part on how many participants are in the study. Larger studies are more likely to find statistical significance than smaller ones.
Medical researchers have long used the concept of statistical significance. However, in recent years, many researchers have argued that we overvalue statistical significance and that we need to pay more attention to effect size.
An effect size is a measure of (1) the amount of change in a sample of patients who undergo a treatment or, more commonly, (2) the amount of change in a sample of patients who undergo a treatment compared to a control group. An effect size tells us how much difference a treatment makes.
A treatment effect can be statistically significant but so small as to be clinically meaningless. For example, a meta-analysis evaluated the effect of treating patients with Alzheimer disease with the combination of a cholinesterase inhibitor and memantine.1 In terms of both cognitive functions and behavioral disturbance, the effect was statistically significant but small enough to be clinically meaningless.
There are many ways to measure effect size. We could measure the change in a biological parameter such as weight lost in kilograms when a patient takes semaglutide versus placebo. We could measure the percentage of weight lost or the percentage of body fat lost.
We could measure the change in a rating scale such as the Patient Health Questionnaire–9 (PHQ-9), used to rate severity of depression, when we treat patients with depression with psychotherapy versus a waiting list control. The range of scores on the PHQ-9 runs from 0 to 27, where an increasing score represents increased depression. We could measure the percentage of patients who meet a defined end point. For example, we could determine the percentage of patients taking an antidepressant who have a 50% reduction in PHQ-9 score. A 50% reduction would be considered a response. On the PHQ-9, a reduction to 4 or less would be considered remission.
We could also assess how many patients taking a medication versus placebo we would have to treat so that 1 additional active-treatment patient would meet a predetermined end point, such as remission. This is called a number needed to treat (NNT). For example, if 40% of patients taking an antidepressant go into remission and 20% of placebo-treated patients go into remission, the NNT is calculated as:
This means we would have to treat 5 individuals with the antidepressant to achieve 1 remission attributable to the antidepressant. Typically, an NNT of 3 would be considered a strong effect, 5 a medium effect, and 7 a small effect.
Researchers sometimes report a risk ratio. Despite incorporation of the word risk, a high-risk ratio is not always bad. Imagine, for example, a group of patients taking an antidepressant for depression and another group taking a placebo. If 40% of the patients taking the antidepressant achieved remission compared with 20% of the patients taking the placebo, then the risk ratio is:
The risk ratio of 2 says that the patients treated with the antidepressant were twice as likely to go into remission as the patients who took the placebo.
Experimenters sometimes report odds ratios. Odds ratios are mathematically more complicated than risk ratios; they are nonlinear and not intuitive. It is difficult to make much sense of odds ratios in deciding whether to offer a particular treatment.
Many studies and meta-analyses report effect sizes as standardized mean differences (SMDs). Doing this makes the most sense in meta-analyses. A meta-analysis combines the effect sizes from 2 or more trials. Sometimes different trials use different rating scales. One trial involving a particular antidepressant might use the Montgomery-Åsberg Depression Rating Scale (MADRS), and another might use the Hamilton Depression Rating Scale (HDRS). The highest score on the MADRS is 60, and the highest on the HDRS is 53. For this and other reasons, we cannot simply synthesize the numbers from the 2 scales. To combine the results from different scales, we convert the results from each into an SMD and then we can combine the results.
An SMD is a difference measured as a proportion of a standard deviation. As an example, consider a single trial using 1 scale: the PHQ-9. Imagine a sample of patients with depression who have an average PHQ-9 score of 19. Some patients are more depressed than others, and the sample is distributed on the bell curve in black, as shown in Figure 1.
If we treat some patients with an antidepressant and some with placebo, the sample gets better overall. The bell curve shifts to the left, as illustrated in yellow in Figure 2. The average score is now 12.
In fact, the yellow curve includes the patients treated with the active drug and those treated with placebo. We need to separate those 2 groups for comparison. Assume the drug-treated patients improved more, so their curve is illustrated in green in Figure 3. The red curve depicts the placebo patients.
Now assume the standard deviation of the yellow curve is 4. The peaks of the curves are at their average values. The peak of the red curve is at 13.5 and the peak of the green curve is at 10.5, so the green, active-treatment group improved 3 points more than did the red placebo group. The SMD between the active treatment and placebo groups then is:
The statistician Cohen stated that an SMD of 0.2 was a small effect size, 0.5 was medium, and 0.8 was large. However, to be clinically meaningful, an effect size as an SMD should usually at least approach 0.5.
Remember the earlier discussion of the effect of the combination of a cholinesterase inhibitor and memantine in Alzheimer disease? For both cognition and behavior, the SMD was 0.13.1Although the effects were statistically significant, the effect size was so small as to not be clinically meaningful on average. Of course, some patients will improve more than others, and a small percentage might achieve a clinically meaningful benefit. There are other ways to measure effect sizes, but we will not discuss those here.
We could discuss many other important factors that matter when we think about applying the results of a study or meta-analysis to our patients. We should not overemphasize statistical significance. It is more important to know whether patients in a study are similar to one for whom you must make a treatment decision. If so, then you want to know whether on average the treatment produces a clinically meaningful effect size.
In addition to knowing the average effect size, you would really like to know the prediction interval. If the level of statistical significance is 0.05, we would have a 95% prediction interval. That would tell us that 95% of patients in the study had effects that ranged from the lower to the higher end of the prediction interval. For example, if the 95% prediction interval was 0.4 to 0.8 SMDs, that would tell us 95% of patients had an effect between 0.4 and 0.8. Unfortunately, randomized clinical trials rarely report the prediction interval.
Dr Moore is clinical associate professor of psychiatry at Texas A&M University College of Medicine and is affiliated with Baylor Scott & White Health Mental Health Clinic.
1. Matsunaga S, Kishi T, Iwata N. Memantine monotherapy for Alzheimer’s disease: a systematic review and meta-analysis. PLoS One. 2015;10(4):e0123289. ❒