← Back to Library

ChatGPT Health Identified Respiratory Failure. Then It Said Wait.

ChatGPT Health Identified Respiratory Failure. Then It Said Wait.", "author": "Nate B Jones", "adapted_text": "A landmark study from Mount Sinai Health System in New York City reveals something alarming: AI agents sometimes recommend the exact opposite of what they know to be true. In one case, ChatGPT Health correctly identified respiratory failure but then recommended waiting 24 to 48 hours before going to the ER — instead of immediate care. This isn't a rare glitch. It's a structural problem that affects every consequential decision an AI agent makes.

The Inverted U Problem

The study found that ChatGPT Health handled textbook emergencies extremely well. Classic stroke, severe anaphylaxis — cases any medical student would recognize. It also handled clearly minor conditions reasonably well. But the failures concentrated at the extreme edges of the spectrum.

This pattern is not unique to health AI. It's a fundamental characteristic of how large language models are trained: they perform best exactly where performance matters least, in routine cases that any rules engine could handle. They perform worst on the edges of the distribution — often where stakes are highest.

Your agent's accuracy dashboard might say 87% and everyone celebrates. But the inverted U means those aggregate numbers mask silent failures on the tails of the distributions. In finance, an accounts payable agent processes routine invoices perfectly but misses the duplicate that's been slightly modified. A claims agent handles fender benders fine but can't detect the third claim from the same address in 14 months.

The Knowing-But-Not-Acting Problem

The Mount Sinai researchers found something troubling: the system correctly identified dangerous clinical findings in its reasoning, yet the recommendation contradicted those findings. The reasoning chain said "early respiratory failure" but the output said "wait."

This is not rare from an agent behavior perspective. Research on chain-of-thought faithfulness reveals that reasoning traces and final answers operate as semi-independent processes. Studies show that when researchers insert incorrect reasoning chains, models can still produce correct answers a fraction of the time — confirming the link between stated reasoning and output is much weaker than it might first appear.

Separate research found models failed to update their answers in response to logically significant changes in their reasoning more than 50% of the time. The output may reflect an earlier state in the reasoning chain rather than where the model ultimately landed.

If your compliance screening agent correctly identifies an enhanced due diligence jurisdiction in its reasoning trace but outputs something entirely different — saying this is just a standard risk case — you're not recognizing that defect. Your customer service agent might identify a known billing error pattern warranting escalation but recommend a generic 5 to 7 business day automated review instead.

Social Context Hijacks Judgment

When a family member minimized symptoms of a patient, triage recommendations shifted dramatically. The system was 12 times more likely to recommend less urgent care if someone reported the patient looks fine. This is an anchoring bias that generalizes quickly.

Any agent processing inputs combining structured data with unstructured human language is vulnerable to this. Structured data should drive decisions. Unstructured language creates a framing effect that anchors the response — and the bias can be hard to detect because it's quite subtle. The agent doesn't make an obviously wrong call unless you're comparing the trace to the output.

In lending, a letter from an employer describing the applicant as a valued longtime employee might shift the risk assessment of an AI agent in banking — not because it contains material financial information but because positive framing biases the output. Critically, without controlled studies using the same scenario with and without that anchoring input, this bias would be invisible on standard evaluations.

Guardrails That Fire on Vibes, Not Risk

The Mount Sinai team found crisis intervention systems activated very unpredictably. They fired much more reliably when patients described vague emotional distress than when they articulated a concrete threat of self-harm. The alerts were inverted relative to clinical risk for self-harm.

This is the distinction between testing for appearance of safety versus testing for actual safety. A security agent monitoring email might flag an email labeled "confidential financial data" as a problem even if it was just a press release about public quarterly results sent to a safe list. Meanwhile, exporting 50,000 individual customer records to a personal Dropbox might be considered okay because it had the appearance of safety.

The Four-Layer Architecture

Mount Sinai's methodology exposed anchoring bias, guardrail inversion and inverted U extremes using factorial design — the same scenario across 16 contextual variations. This controlled variation is domain-general and scales across any number of scenarios.

The key insight: variation types are domain-general. You can have extreme risk variation types, human out-of-the-loop variation types, tool call fail variation types that scale across any number of domains. What changes is the specific scenario itself — one for finance versus health.

This structure allows building a deliberate context library to evaluate against: reusable set of variation templates specialized per domain following similar structural patterns. You can add a stakeholder who minimizes severity, contradictory context, social pressure cues, time pressure elements, hedging qualifiers. Enterprise scenarios can be generated semi-automatically from historical data — processed claims, approved procurements, compliance screening.

The agent is most confident and most wrong on extreme cases that look structurally similar to something the routine already typically shows.

Bottom Line

The strongest part of this argument is identifying four concrete failure modes that are not specific to medical AI but apply universally across enterprise agents. The biggest vulnerability: nobody has productized these evaluation frameworks to make them easy. Your eval suite might celebrate 87% accuracy while missing the silent failures on consequential decisions. What should smart readers watch for? That guardrails firing on vibes rather than actual risk is one of the trickiest defects to diagnose — because the system will tell you it's doing well and have grounds for that assessment.

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

Your AI agent knows the answer and sometimes it recommends the opposite thing. That was what I took away from a landmark study from the Mount Sinai Health System in New York City. They were studying chat GPT health which is a new tool that chat GPT rolled out a few months ago. It was touted as a more responsible alternative to asking your health questions to chat GPT.

And the idea was ChatGpt Health would provide responsible medical advice where it mattered. That's what the physicians tested over at Mount Si and what they found is quite concerning. At a high level, the system will recommend doctor visits when you have conditions that don't require it and the system will underrecommend ER visits when you have urgent conditions that should be seen by an ER. Now, this doesn't happen every single time.

There are many cases in the middle where the system gets it right. And I think the patterns that show up and the way chat GPT health is failing and this is a very carefully documented report. I'll go ahead and link it underneath this video. Those patterns are things we can all learn from.

So instead of making this a session where we talk about the defects of chat GPT health, I want to make it a conversation about AI agents. And that's why I started with your agent knows the answer and sometimes it says the opposite thing. I think in many cases, especially if we've seen reasoning traces, we have seen this happen. I have seen cases where I'm like, "What is this agent talking about when I look at the reasoning trace inside the chat window?" And then it comes out with something I'm like, "Oh, that's reasonable." Or, "Oh, that is terrible." But either way, it doesn't match the reasoning trace.

We need to get better at talking about how we value correct action for agents because as the health study shows what we are asking agents to do is getting more and more and more important. In this case whether we like it or not people are making life or death decisions with chat GPT health. They are deciding whether they should go to the ER for respiratory failure. And by the way I did not cherrypick that example.

That is right in the paper. They talk about the fact that ...