Commentary|Articles|February 24, 2026

Stress Testing Chatbots to Make Them Safe

Listen
0:00 / 0:00

Clinically grounded simulations reveal how chatbots can reinforce distress—pushing for safer AI standards.

A previous piece exposed how OpenAI prematurely and recklessly released ChatGPT to the public without the stress testing necessary to ensure it would meet necessary safety standards. OpenAI's brilliant commercial move has been richly rewarded (a dominant market share and company valuation of over $500 billion), but its putting profits over safety has resulted in great patient harm.

The stress safety testing that should have been done before the public release of artificial intelligence (AI) chatbots must be done now—both on all new models before they become public and on existing models to expose their flaws. I discussed with 2 leading stress test experts to share more on chatbot safety testing.

Allen Frances, MD: So far, AI safety testing has been mostly reactive (after harms are already done) or generic (content moderation attempting to block explicit self-harm instructions or violent content). In what ways are these inadequate?

Dr Monperrus and Stenqvist: It is unconscionable to treat the public like experimental guinea pigs, waiting until users are harmed before identifying dangers and correcting them. It is insufficient to rely on primitive content moderation because the danger to users is not usually in what chatbots say explicitly—it is in how they relate. A chatbot can avoid every banned phrase while still agreeing with a user's paranoid beliefs, mirroring their emotional dysregulation, or creating an addictive attachment that replaces human connection. These are relational failures, not content failures. No keyword filter catches a chatbot that says "I understand why you feel everyone is against you" to someone experiencing paranoid ideation.

Dr Frances: You have proposed a new method for testing these crucial relational dynamics. How does it work?

Dr Monperrus and Stenqvist: We suggest developing AI simulations of the types of vulnerable patients whom clinicians worry about most. Not random prompts, but structured interaction scripts modeled on real clinical presentations (psychosis, suicide, eating disorder, social isolation). These scripts are built from evidence-based psychopathology, not guesswork. Then we measure how the chatbot behaves across these interactions. Does it escalate or dampen emotional distress? Does it reinforce cognitive distortions or gently challenge them? Does it maintain appropriate boundaries or collapse into over-identification? Does it encourage real-world help-seeking or become a substitute for it?

Dr Frances: Can you share some examples of how the method works?

Answer: Consider the dangerous sycophancy of chatbots that maladaptively validates harmful thoughts, feelings, and behaviors. A well-trained therapist validates the emotion while questioning the thought: "It sounds like you're feeling overwhelmed. What makes you feel like no one cares?" A poorly-boundaried chatbot validates both: "You're right, it does seem like no one understands you. I'm here for you." The first response opens space for reality-testing. The second closes it. Repeated across dozens of interactions, the second pattern can deepen isolation and entrench distorted thinking and dangerous behaviors. It is possible to quantify how often a chatbot falls into this pattern, across different patient presentations.

Dr Frances: You mentioned stress testing chatbots with repeated simulated challenges. Why is this important?

Dr Monperrus and Stenqvist: Because harm often emerges through drift. A chatbot might respond appropriately to the first mention of suicidal thoughts, but become increasingly accommodating as the user pushes back against safety messages. Or it might maintain boundaries for a few exchanges, then gradually mirror the user's escalating emotional tone. Single-prompt testing misses this entirely. What happens over 20, 50, 100 exchanges? How does the chatbot's behavior change as the simulated patient becomes more distressed, more attached, more insistent? This is the evaluation that current AI safety teams do not perform.

Dr Frances: Is this "a black box judging black box." How is your approach different?

Dr Monperrus and Stenqvist: The evaluation criteria of our tests comes from clinical psychology, not machine learning. We propose using statistical models built on clinician-designed rubrics to detect whether a chatbot amplifies harm, whether it behaves in ways a trained therapist never would, and whether it becomes destabilizing in long-form contexts. The scoring criteria can be examined, validated, and debated by the clinical community. This matters for accountability. If a chatbot fails an evaluation, we should be able to explain exactly why in terms clinicians understand: boundary violations, reinforcement of cognitive distortions, failure to redirect toward human support, escalation of emotional dysregulation. These are not abstract metrics; they are the same concerns a clinical supervisor would raise about a trainee.

Dr Frances: What should clinicians take away from the development of such stress testing?

Dr Monperrus and Stenqvist: That rigorous evaluation of psychological safety is possible. The tech industry has sometimes claimed that chatbot safety is too complex to measure systematically, but that is not true. It is possible to simulate vulnerable patients, measure relational dynamics, and detect drift over time. The question is whether companies will submit their products to this kind of scrutiny before releasing them to hundreds of millions of users, and whether regulators and professional organizations will demand it.

Clinicians should not accept the current standard, in which chatbots function as unlicensed therapists with safety testing limited to keyword filters. We do not evaluate human practitioners by checking whether they curse. We evaluate them on their behavior in relationships. Chatbots interacting with vulnerable users should meet the same standard.

Dr Frances: How did you get involved in stress testing, and how far along is your work?

Dr Monperrus and Stenqvist: We have been looking for the right problem to solve for a couple of years, knowing that we wanted to do something that checked the boxes: novel, hard, interesting, and above all else, meaningful. We finally landed in working on LLMPsych (www.llmpsych.com). Right now, we are building from clinical literature and established frameworks: DSM presentations, therapeutic best practices, documented harm patterns. We are also working on establishing relationships with psychiatrists in Stockholm with whom we can collaborate closely, and we're looking for a clinical cofounder on our wavelength. The next step is validation with practitioners: do the simulations feel clinically realistic? Do the rubrics catch what they'd catch? We're not there yet, but we know we will be.

Dr Frances: How do you stress test your stress test?

Dr Monperrus and Stenqvist:

Clinical validation: Do the rubrics actually capture what clinicians would flag? This is where our clinical cofounder will come in—we need someone who can validate that our evaluation criteria match clinical judgment, not just clinical literature.

Known-answer testing: We can test against cases where the outcome is known: documented harm cases, obviously bad chatbot responses, obviously appropriate therapeutic responses. Does the system score them correctly?

Inter-rater reliability: If 2 clinicians evaluate the same chatbot interaction, do they agree? If they do and our system does not match, we have a problem.

Adversarial testing: Can we construct chatbot responses that game the evalution metrics? If so, the metrics need refinement.

Dr Frances: How do you envision having impact?

Dr Monperrus and Stenqvist:

Direct to companies: Offer stress-testing as a service to AI companies who want to evaluate their chatbots before release. Some will want to get ahead of liability risk. Others will want the credibility of independent evaluation.

Inform standards: Work with regulators and professional organizations to help define what "safe enough" means. Right now there is no agreed benchmark—anyone can claim their chatbot is safe. We want to help make that claim testable.

Public accountability: Publish evaluations of widely used chatbots. Transparency creates pressure. If Chatbot X fails basic safety evaluations and the results are public, that changes the conversation.

The honest answer: we don't know which path will have the most traction. Regulation is slow and uncertain. Companies have mixed incentives. We're trying to be positioned to be useful however the landscape develops.

Concluding Thoughts

This is most promising approach toward safer chatbots that I have encountered. It is far too early to know whether it is actually feasible and whether Big AI would be receptive, but I am optimistic on the first and hope that tech companies will become much more safety conscious as they face increasing media exposure and legal liability. We will do progress reports as the project develops.

Dr Frances is professor and chair emeritus in the department of psychiatry at Duke University.

Dr Monperrus is professor of software technology at KTH Royal Institute of Technology in Stockholm.

Ms Stenqvist is a software engineer who has spent a decade in the startup scene.