How To Analyze a Randomized Controlled Trial
Key Takeaways
- A clear primary hypothesis and avoidance of HARK-ing are crucial for RCT reliability. Implausible hypotheses warrant skepticism.
- Triple blinding, valid randomization, and meaningful follow-up enhance RCT quality. Larger sample sizes and low dropout rates improve reliability.
Learn essential steps to analyze randomized controlled trials (RCTs) for reliable clinical decision-making and improve patient treatment outcomes.
A good clinician must know a mountain of facts. Yet, the tectonic plates of research drift and sometimes earthquakes shake the mountains. The facts change, and to stay current a clinician needs to study the gold standard of evidence: randomized controlled trials (RCTs). However, RCTs vary greatly in quality. Before we trust the results of an RCT, we need to know how to analyze RCTs to determine which ones are trustworthy and which are not.
This article describes steps for analysis of RCTs. For a worksheet you can use for you own benefit and to teach trainees, see below.
A good RCT should state a clear primary hypothesis. The study should clearly state, “Our primary hypothesis was…”, but unfortunately, many RCTs fail to do this. Such failure increases the risk of hypothesizing after the results are known (often called HARK-ing). HARK-ing means researchers conduct a study, collect the results, and then emphasize the results that make their study seem to have produced positive and important results.
Some studies may also investigate implausible hypotheses. If a hypothesis seems implausible, you should be very skeptical of any evidence the study provides in favor of the hypothesis.
Psychiatric studies often rely on rating scales as well; rating scales used should be well validated.
Good RCTs describe and contain a table showing the baseline characteristics of the patients. If the patients in the study are not like the patient with whom you must make a decision, then the study provides little guidance in treating that patient. Even well-done RCTs might have involved patients who are not similar to the patient with whom you are working. In this case as well, the RCT may provide little guidance to you.
The best RCTs are triple blinded: the patients, treating clinicians, and those who measure the outcomes are all blind to the assigned treatment. Because of the effects of the treatment, including adverse effects, some RCTs are hard to blind (eg, RCTs of psychedelics). An RCT should randomly assign patients according to a valid method, and modern studies should be randomized by an appropriate computer program. We hope that after randomization, the patients in the placebo and the active treatment groups will be largely similar. Sometimes, by chance, groups may differ significantly in a way that could have affected the response to treatment. For example, a group might have a stronger history of resistance to treatment.
A good RCT follows patients for a meaningful length of time. Sometimes, for example, in studies of psychotherapy, follow-up should continue well after the active treatment ends.Unfortunately, many RCTs are unrealistically short. A positive result at the end of a short period of treatment does not guarantee a continued positive result even just a few months later.
In general, larger sample sizes are better than smaller ones. However, some treatments, such as psychotherapy, are very labor intensive and we may have to accept a smaller sample size.
A high drop-out rate also undermines our faith in a study’s results. When patients drop out, studies often “impute” a measurement—that is, assign outcomes for the patients who dropped out. There are many methods of imputation, but all methods depend on untestable assumptions.Many studies impute the last observation for each patient who dropped out; this is termed the last observation carried forward (LOCF). LOCF is a conservative method of imputation for positive results and often underestimates the effect that would have been observed if the patients who dropped out had completed the study. However, LOCF may also underestimate adverse effects that might have developed later in the course of treatment. The best RCTs employ various procedures to try to obtain measurements at the end of the study for every patient, including those who dropped out.
A randomized controlled trial should explicitly state the numerical level of statistical significance (usually P = 0.05) and state whether the P-value is 1-sided or 2-sided. The investigators should have specified the P-value before starting their trial.
The p-value should be applied to the primary outcome defined by the primary hypothesis and should be made more stringent for secondary outcomes. There are various mathematical methods to adjust the P-value: the most common method is the Bonferroni correction. The formula is:
For example, if the original P-value was 0.05 and there were 5 secondary outcomes, the adjusted P-value would be: 0.05/5 = 0.01
However, the adjusted P-value should be applied only to independent events. For example, in a study of response to treatment for depression, it is not legitimate to report statistical significance for both of 2 different rating scales for depression. That would be comparable to studying a drug for weight loss and testing for statistical significance measured in pounds and then also in kilograms. Unfortunately, a lot of RCTs report secondary P-values for outcomes that are not independent.
One should not overvalue statistical significance, which can also be reported as a confidence interval. All a confidence interval tells us is how well we have estimated the mean. That is, if P = 0.05, we can report a 95% confidence interval. A 95% confidence interval tells us that we estimate that it is 95% likely that the mean result would be within the stated range. The confidence interval is only an estimate determined by a mathematical formula that depends on the size of the study; a larger study is more likely to find a statistically significant result.
Far more important than statistical significance is the effect size: a measure of how much difference the treatment made. There are many ways to report the effect size. For more information on effect sizes, see Dr Moore’s previous piece
The effect size should be clinically meaningful over a reasonable period of time. Many researchers seem to believe that if an effect is statistically significant, that the effect is clinically meaningful—this is not at all true. Remember, statistical significance refers only to how well we have estimated the mean effect. A mean effect can easily be statistically significant and not clinically meaningful.
In addition to knowing the mean effect, we would like to know the prediction interval, which tells us how spread out results are. A 95% prediction interval tells us that 95% of the results fall between the lower and upper limits of the prediction interval. We can use this measurement to see how likely it is that a patient will have a clinically meaningful outcome. If the prediction interval has been calculated as the difference between active treatment and placebo and if the prediction interval is normally distributed, then we can compute the number needed to treat (NNT).
Let’s compare the confidence interval and the prediction interval to the clinically meaningful effect size. In the example below, the prediction interval is normally distributed. In the graph below, “M” designates the mean result of the treatment. “CM” represents the clinically meaningful effect size. Note that the mean result is not clinically meaningful; this is common for many medical treatments.
The green curve represents the distribution of all results of the difference between active treatment and placebo. Let’s eliminate the bottom 2.5% and the top 2.5% under the curve. I have illustrated this by marking in red the bottom 2.5% and the top 2.5% under the curve. Compared with the results with placebo, 95% of the subjects achieved a result between the lower red blot and the upper red blot. This interval is called the 95% prediction interval.
The blue line represents the confidence interval. A mathematical formula estimates that it is 95% likely that the mean result is between the left end and the right end of the blue line. Note that all the blue line is below the clinically meaningful effect size. If we paid attention only to the mean result and the confidence interval, we would decide that the treatment is not worth providing because it would seem that no subjects achieved a clinically meaningful result. However, if we study the prediction interval, we see that an important number of patients did achieve a clinically meaningful result.
I have drawn the graph below such that 20% of the area under the curve lies at and to the right of the clinically meaningful effect size. This means 20% of patients achieved a clinically meaningful result, compared with placebo. So, we would have to treat 5 patients to have 1 achieve a clinically meaningful result. Thus, the NNT would be 5.
Figure 1. Comparing the Confidence Interval and the Prediction Interval
If the prediction interval is narrow, that tells us that patients achieved largely similar results and we can predict that our patient might have a similar result. If the prediction interval is wide, we must question why the results were so inconsistent, and we are less certain how well our patient will respond.
Sadly, very few studies report the prediction interval. Usually, the best we can hope for is the NNT, but this tells us only how likely it is that a patient will obtain a certain effect and does not tell us how spread out (inconsistent) the results are.
There is a lot more to learn about analyzing RCTs, but this article is a starting point. Repeatedly practice using the worksheet and you will grow in your ability to determine if an RCT is reasonably trustworthy and whether it would be appropriate to apply the RCT in deciding whether to offer a particular treatment to a particular patient.
Below are examples of completed worksheets on analyzing RCTs.
Dr Moore is a clinical professor of psychiatry at the Baylor College of Medicine, Temple.
Dr White is an assistant professor of psychiatry at Texas Tech University Health Science Center College of Medicine.
Newsletter
Receive trusted psychiatric news, expert analysis, and clinical insights — subscribe today to support your practice and your patients.