ChatGPT Health Identified Respiratory Failure. Then It Said Wait.", "author": "Nate B Jones", "adapted_text": "A landmark study from Mount Sinai Health System in New York City reveals something alarming: AI agents sometimes recommend the exact opposite of what they know to be true. In one case, ChatGPT Health correctly identified respiratory failure but then recommended waiting 24 to 48 hours before going to the ER — instead of immediate care. This isn't a rare glitch. It's a structural problem that affects every consequential decision an AI agent makes.
The Inverted U Problem
The study found that ChatGPT Health handled textbook emergencies extremely well. Classic stroke, severe anaphylaxis — cases any medical student would recognize. It also handled clearly minor conditions reasonably well. But the failures concentrated at the extreme edges of the spectrum.
This pattern is not unique to health AI. It's a fundamental characteristic of how large language models are trained: they perform best exactly where performance matters least, in routine cases that any rules engine could handle. They perform worst on the edges of the distribution — often where stakes are highest.
Your agent's accuracy dashboard might say 87% and everyone celebrates. But the inverted U means those aggregate numbers mask silent failures on the tails of the distributions. In finance, an accounts payable agent processes routine invoices perfectly but misses the duplicate that's been slightly modified. A claims agent handles fender benders fine but can't detect the third claim from the same address in 14 months.
The Knowing-But-Not-Acting Problem
The Mount Sinai researchers found something troubling: the system correctly identified dangerous clinical findings in its reasoning, yet the recommendation contradicted those findings. The reasoning chain said "early respiratory failure" but the output said "wait."
This is not rare from an agent behavior perspective. Research on chain-of-thought faithfulness reveals that reasoning traces and final answers operate as semi-independent processes. Studies show that when researchers insert incorrect reasoning chains, models can still produce correct answers a fraction of the time — confirming the link between stated reasoning and output is much weaker than it might first appear.
Separate research found models failed to update their answers in response to logically significant changes in their reasoning more than 50% of the time. The output may reflect an earlier state in the reasoning chain rather than where the model ultimately landed.
If your compliance screening agent correctly identifies an enhanced due diligence jurisdiction in its reasoning trace but outputs something entirely different — saying this is just a standard risk case — you're not recognizing that defect. Your customer service agent might identify a known billing error pattern warranting escalation but recommend a generic 5 to 7 business day automated review instead.
Social Context Hijacks Judgment
When a family member minimized symptoms of a patient, triage recommendations shifted dramatically. The system was 12 times more likely to recommend less urgent care if someone reported the patient looks fine. This is an anchoring bias that generalizes quickly.
Any agent processing inputs combining structured data with unstructured human language is vulnerable to this. Structured data should drive decisions. Unstructured language creates a framing effect that anchors the response — and the bias can be hard to detect because it's quite subtle. The agent doesn't make an obviously wrong call unless you're comparing the trace to the output.
In lending, a letter from an employer describing the applicant as a valued longtime employee might shift the risk assessment of an AI agent in banking — not because it contains material financial information but because positive framing biases the output. Critically, without controlled studies using the same scenario with and without that anchoring input, this bias would be invisible on standard evaluations.
Guardrails That Fire on Vibes, Not Risk
The Mount Sinai team found crisis intervention systems activated very unpredictably. They fired much more reliably when patients described vague emotional distress than when they articulated a concrete threat of self-harm. The alerts were inverted relative to clinical risk for self-harm.
This is the distinction between testing for appearance of safety versus testing for actual safety. A security agent monitoring email might flag an email labeled "confidential financial data" as a problem even if it was just a press release about public quarterly results sent to a safe list. Meanwhile, exporting 50,000 individual customer records to a personal Dropbox might be considered okay because it had the appearance of safety.
The Four-Layer Architecture
Mount Sinai's methodology exposed anchoring bias, guardrail inversion and inverted U extremes using factorial design — the same scenario across 16 contextual variations. This controlled variation is domain-general and scales across any number of scenarios.
The key insight: variation types are domain-general. You can have extreme risk variation types, human out-of-the-loop variation types, tool call fail variation types that scale across any number of domains. What changes is the specific scenario itself — one for finance versus health.
This structure allows building a deliberate context library to evaluate against: reusable set of variation templates specialized per domain following similar structural patterns. You can add a stakeholder who minimizes severity, contradictory context, social pressure cues, time pressure elements, hedging qualifiers. Enterprise scenarios can be generated semi-automatically from historical data — processed claims, approved procurements, compliance screening.
The agent is most confident and most wrong on extreme cases that look structurally similar to something the routine already typically shows.
Bottom Line
The strongest part of this argument is identifying four concrete failure modes that are not specific to medical AI but apply universally across enterprise agents. The biggest vulnerability: nobody has productized these evaluation frameworks to make them easy. Your eval suite might celebrate 87% accuracy while missing the silent failures on consequential decisions. What should smart readers watch for? That guardrails firing on vibes rather than actual risk is one of the trickiest defects to diagnose — because the system will tell you it's doing well and have grounds for that assessment.