Commentary|Articles|June 30, 2026

Signal in Noise: How Can Artificial Intelligence Benefit Psychiatric Practice?

Brain Trust: Conversations in Psychopharmacology
Listen
0:00 / 0:00

AI may assist psychopharmacology by aiding suicide risk, digital phenotyping, and TRD care—while exposing EHR data limits and the need for smarter integration.

BRAIN TRUST: CONVERSATIONS IN PSYCHOPHARMACOLOGY

Joseph F. Goldberg, MD, sat down with Roy Perlis, MD, to discuss the role of artificial intelligence (AI) in clinical psychopharmacology, including its applications in treatment-resistant depression (TRD), suicide risk stratification, digital phenotyping, and the challenges of integrating AI tools into everyday psychiatric practice. Watch the full video podcast here!

Joe Goldberg, MD: Welcome to today's edition of “Brain Trust Conversations and Clinical Psychopharmacology.” I'm Dr Joe Goldberg, clinical professor of psychiatry at the Icahn School of Medicine at Mount Sinai in New York. It's a real pleasure for me to have us joined today by a longtime friend and colleague, Dr Roy Perlis. Our topic for today is the role of artificial intelligence in clinical psychopharmacology and what that means for all of us. I think Roy speaks the language of both science and technology, but also clinical applicability. Roy, thank you for joining us today.

Roy Perlis, MD: It is my pleasure, Joe.

Goldberg: One of the reasons I became a psychiatrist was at some time in my youth I had the idea that if I was a really good psychiatrist, I could make predictions about the future about what's going to happen to people. I had this idea in my head that if you really understand everything there is to understand psychiatrically, you'd have some prescient knowledge of what is going to happen. So my first question to you: does the knowledge of artificial intelligence and machine learning give us chance that if we could really comprehend big data, we would be able to make better predictions about what's going on with the patient diagnostically or with treatment?

Perlis: I would say the last 2 decades have been watching the technology mature, watching the data mature from an era when we had data on 20 people or 100 people or 200 people, to the STAR-D era when all of a sudden we had several thousand people to now when we can do these studies across hundreds of thousands or even millions of people with depression. I think we still struggle to find things with AI that if we thought deeply as clinicians, we could not figure out on our own.

I do think AI is going to change how we practice medicine, but I am also something of a realist. And I know that if I asked you when you see someone in your office, “how do you decide how they're going to respond or fare,” you probably would do about as well as any of the models that we might build. But the question is then so why bother building the models? And I am going to give you 2 answers. One is that even the act of aggregating things we already know can be useful. So you might make predictions based on simple things: how many past medications have they tried or do they have anxiety? Things like that. But it may be harder to quantify. What the models might do for you is put a number on it, which can be helpful.

Goldberg: Like a weighted coefficient? I put a little more emphasis on severity, a little less emphasis on age at onset, etc.

Perlis: Exactly. And the other thing to understand is not everybody can be Joe Goldberg or one of the other experts who we work with. I suspect that one of the other major benefits of these models is to kind of democratize that knowledge so that at least everyone has that knowledge base in treatment resistant depression, for example. That right now is sort of hard to access. I don't know that AI can do better than a clinician at this point, but we can make more people maybe do as well in terms of predictions.

Goldberg: Now accolades for anybody aside, artificial intelligence involves training or teaching a system to learn from what's inputted into it, right? So big databases could be taught by who we would say are experts. What do they do and how do we imitate what they do? How do we learn nuances, while big data could also come from zillions of data points that may or may not have accuracy?

I wanted to ask you when you have done studies, for instance, looking at natural language processing based on, say, chart notes—is that as good as the accuracy with which the note taker is detecting something? Does that mean that if you've got a gazillion people, the results will come out pretty much in the wash because you got enough power and descriptors that you can make those generalizations—or are we really at the mercy of how accurate our observations are from which the system is learning?

Perlis: I think you put your finger on the key issue. You know people used to say in the computer world “garbage in, garbage out,” but our notes are what they are. I would argue, and often do find myself having to argue, that they still have face validity, right? A clinician evaluated this patient and the clinician thought they were psychotic. And absent the kinds of very systematic assessments which are still pretty imperfect, I'm inclined to go with face validity—I am inclined to trust that the person seeing this individual thought they were psychotic. But I think it does place something of a ceiling on our ability to make predictions.

That's something I think even people trying to build these models tend to forget, that if it's not in the note it cannot be used; these models are good, but they are not magic. If the clinician doesn't write something down, the model can't necessarily guess that it is there. For example, Tom McCoy and I, one of my collaborators, we're building models using narrative notes to predict suicide or suicide risk.1 And the issue is we do not have anything other than what the clinician thought was important to put in the note. So all of the things that you might think about in evaluating someone with depression or anxiety should be in the note. But if it turns out there is some other predictor that we never thought of, if it's not in the note, we're not going to find it. We can only find the things that people write down, and what they tend to write down are the things that someone taught them are important to document in mood disorders.

Goldberg: Does this also get into the power issue, that there will be excellent people and not so excellent people but if you’ve got enough eyes (and in big data we are talking astronomical numbers or at least far more than a few thousand), does that make AI invaluable in the sense that you are getting closer to the population as opposed to the sample? How much does that sort of sample size, so to speak, make AI indispensable to these questions?

Perlis: I think AI lets us make better use of the data that we have. So the predictive models that we use are not all that much better in general, when applied to data at the scale of a few hundred thousand people, than logistic regression or linear regression. And I always sort of chuckle as an editor when I get submissions that now talk about a machine learning approach to X or an AI approach to X—and it is regression. Regression works wonderfully well for a lot of things.

I don't think at this point it is a power question. I think we are coming up against the limits of documentation. For example we have done some work on using health records to predict treatment resistant depression, and we don't do very well. It really looks like we are coming up against a ceiling when we look at what's in the health record. It is still better than chance, there is some signal there. But something I hope we can talk about is instead of pursuing these perfectly predictive models, there is a ton we could do with models that are better than chance in terms of targeting resources. So the chasing the AU of 0.8 for treatment-resistant depression, it isn’t going to happen. On the other hand, to tell me who is high risk or low risk and therefore who should go see the specialist versus who can stay in primary care, I think those are the kinds of tools we could have right now.

As far as this TRD collaboration goes, I think the problem is the data—whatever the predictors are that would let us do really well, it's not coded clinical data, and it is probably not in the narrative notes. Maybe it's a blood test, maybe it's EEG (although I doubt it is EEG, with apologies to some colleagues). I think that there are probably other biomarkers out there or even other passive measures that would be useful to incorporate that we just don't collect routinely. I do not think the answer is to give me a million electronic health records and I'll build you a better predictor.

Goldberg: Let me throw another wrinkle into this issue of the substrate from which information is extracted. Depending on whose notes and what context one is looking at, I often think about the extent to which notes are fair and accurate portrayals of what the clinician thinks, as opposed to written from the standpoint of things like medical necessity or procuring ongoing support, say, to keep somebody in a certain level of care. Back in the days when I would do inpatient work, I always marveled that the patient continues to deny suicidal intent but we are not quite sure and they're still at risk, but then they’re getting released and the SI is gone from their chart. Or if there's a medication I think could be very beneficial for somebody, but it's not reimburseable under a particular diagnosis or procedure code; you get my sense as to how much notes get written for reasons other than clinical data.

Perlis: So, remember that Epic was billing software. Epic evolved from billing software that was intended to be sure that hospitals could collect as much as possible from a given service. So it follows that the notes in many ways are serving multiple purposes, only one of which is clinical communication. I would like to think there is still signal in there. But yes, between medical legal and billing and all the other things that these notes are expected to do, the amount of signal in all that noise is getting smaller.

Since we're talking about AI, a preview of coming attractions is that with ambient scribes increasingly part of clinical care (where an AI scribe records the visit, transcribes the visit, and generates the note), something we're seeing is an incredible amount of note bloat— where the length of notes in the electronic health record is growing, but more and more of it is the kind of writing that you will be familiar with from ChatGPT where it's sort of empty or not especially specific. We may soon be at a point where you actually need AI to read the AI generated notes and pull that signal out.

Goldberg: It may not be necessarily all the face value documented notes and it may not be an EEG or a traditional biomarker, but if we're thinking big in AI and machine learning, what about digital phenotyping? In other words, could AI help refine what is actually going on in ways that the observer may not even realize, but something like a smartphone could help?

Perlis: I am a tremendous believer in these easily collected passive measures. I think when someone suddenly stops using their phone or starts using their phone more or starts using their phone at 2 in the morning— that's telling you something. I think the key insight over the last few years is that people have finally shifted from “we're going to build a model that looks at your phone data and says you are depressed” to “everyone is different, what we really need to know is: is this different for you?” In other words, I think it's hard to look at somebody's phone use and say they are manic because some of us use our phones too much, some of us use them very little. I think it is way easier and probably more valuable to say: I notice you usually use your phone for about 15 minutes a day, but you are using your phone for 5 hours a day now. That doesn't necessarily tell me that you are manic, but it sure tells me something is going on. You do not need fancy AI to make those kinds of predictions, right? Actually, we were talking about linear regression or logistic regression, and simple linear classifiers would actually do pretty well in saying something's going on.

I absolutely believe in the passive data. The trick with passive data is linking it to the rest of our health system in a way that is actually useful to clinicians and patients. Years ago, people started bringing in their Fitbit data or their sleep data and they hand me these printouts and so on. And it's great that people are taking that much responsibility, but I think the real breakthrough will be when I sit down at a visit, I open whatever the health record is and there's a red light because someone's phone use has changed substantially or someone's number of steps per day has changed substantially.

In other words, the problem is integration. We have the data. We have the tools to process it. It is how do you get it into the record in a way that is helpful to patients, preserves their privacy, and does not create an undue burden on the system to respond to all sorts of false alarms.

Goldberg: You make me think also about if there's a way to know erratic adherence. If a patient did not fill their prescription last month, could this be withdrawal that you're identifying as agitation versus something else? Ideally, it would be nice to have some way to capture all these elements and be able to make inferences.

Perlis: Years ago, I was giving a talk on some of our first natural language processing stuff at the Broad and a fairly senior scientist stood up and said “But Dr. Perlis, is this really science or is this engineering?” And the reality is at this point the these are really important questions of engineering—in the sense that the data are there, the technology is there, the integration with clinical practice is not there. That I think is where the real gap is.

Goldberg: Now, let me ask you a question about what people like you and I would call hypothesis generating vs hypothesis testing kinds of questions. There's been much in the news lately about the FDA Adverse Event Reporting System (FAERS), this big database where anything that gets reported in pharmacovigilance gets identified, and there is been an uptick in people kind of using that database to report associations of things that may or may not have meaningful connectedness.2 So, if I discover patients that are taking VMAT 2 inhibitors seem to have a lot of involuntary movements, I think maybe there's a causal effect, as opposed to seeing the treatment is failing to remedy the problem.

My question is how much does having a hypothesis a priori in mind need to be on the minds of anyone who's reading the literature or dabbling in the AI world? They might read something somewhere about an association between this drug and this consequence but it is not really a consequence, and some big database says patients that take drug X seem to have more of Y consequence—one could be misled into a causal connection as opposed to an association.

Perlis: The first point I would make is to their great credit our colleagues at FDA have been very aware of the limitations of FAERS for a long time and have encouraged efforts to develop ways of doing similar work using electronic health records that are a lot less susceptible to certain kinds of reporting biases. In a sense we're collecting much of this data in real time every time we start someone on a new medicine and so you know I think FAERS has its place, but as you point out there are a lot of challenges in interpreting the data in FAERS I think though that the underlying difficulty of confounding by indication with electronic health records is still a problem. We have a little bit more control over it because we know more about people so we can adjust for the indication for the medicine.

The example that comes to mind is you look at the adverse event profile for psychiatric medications and they have a lot of psychiatric adverse events associated with them. That's because perhaps the symptoms that are being treated are what is being reported or we're looking at them in a population that is at greater risk because of their illness for certain kinds of outcomes. So I think confounding by indication is a is a tremendous problem. I think it's important to consider the plausibility when you are doing pharmacovigilance work and ideally to have a hypothesis.

I think people get a little bit confused about why we do pharmacovigilance. One reason to do pharmacovigilance is that when a drug is marketed, we don't necessarily know everything that we're going to come to know about it. So, we need some way to continue to learn. There is a need for unbiased or non-hypothesis-driven surveillance. That said, I think in electronic health records in particular, I can do an analysis one way and show you that this drug increases risk and then I can turn some knobs and all of a sudden it decreases risk. People do not necessarily understand how susceptible these approaches are to bias. We need both, but we need to be really careful about what we're doing. And for the surveillance part, we need an appropriate level of skepticism, which unfortunately seems to be lacking lately.

Goldberg: So, let's take a step back. You have been working in this space and contributing to it long enough to tell us something about are there viable candidate markers to predict, let us say, antidepressant response—or if I was going to ask that question of ChatGPT, what are the known predictors of response—is there an algorithm that's going to be out there at some point that the average practitioner might leverage AI to help them think about anticipating responsivity in a given patient?

Perlis: Where I think it can be helpful at this point is just making sure we haven't missed anything; is my diagnosis right? From all the years treating people with TRD, sometimes it is treatment-resistant because that's not the right diagnosis. Likewise with bipolar: sometimes it's treatment resistant because we are treating the wrong thing. I think what AI can be helpful for is making sure we haven't missed something. Did we look for symptoms of hypomomania? Did we look for substance use? What's going on with their thyroid?

I think AI is a little bit like an autopilot on a plane, where hopefully the person flying the plane does know how to fly the plane, but the autopilot makes sure that you do not do something dumb or inadvertently try to take off without the right settings in place. To me, that's a lot of the initial value, less so than the prediction. I do think there are maybe opportunities to identify who's much less likely to come back for a follow-up visit or much less likely to be adherent to treatment. So in our work we have actually sort of moved away from truly predicting response as a biological outcome and more trying to predict the aspects of care that will contribute to response.

One of the things I wanted to mention earlier is that when we started doing treatment resistant depression work in health records, naively I thought that people in the real world might look like they do in clinical trials—where you come in, you get started on a medicine or some other intervention and 8 weeks later you're either better or you're still depressed. That is not in health records and the real world—that's not how it looks. What we found when we started doing that work was most people ended up somewhere in between. That makes it hard to look for predictors because I thought we would build a classifier of either remitted and still depressed, but really we end up with a few remitters, a few still depressed, and everybody else kind of in the middle. That's not ideal when you are trying to build a classifier, especially because we don't really do a great job of documenting ongoing symptoms, to your earlier point. The point is it's often hard to build things that predict that middle status, but that is where a lot of patients seem to end up.

Continue reading: see part 2 of this discussion here.

Dr Goldberg is a clinical professor of psychiatry at The Icahn School of Medicine at Mount Sinai in New York, NY and the immediate-past president of the American Society of Clinical Psychopharmacology.

Dr Perlis is chair and professor of psychiatry at Harvard Medical School and Massachusetts General Hospital, director of the Center for Quantitative Health at Massachusetts General Hospital, and editor in chief for artificial intelligence at JAMA Network Open.

References

1. McCoy TH, Perlis RH. Reasoning language models for more transparent prediction of suicide risk. BMJ Ment Health. 2025;28(1):e301654.

2. Ferdous Al-Faruque. FDA consolidates adverse events reporting systems. Regulatory Affairs Professional Society. March 11, 2026. Accessed June 30, 2026. https://www.raps.org/resource/fda-consolidates-adverse-events-reporting-systems.html