15,000 patients, one verdict: the clinical autopsy of healthcare chatbots

Fifteen thousand user reviews, one industry, and a pattern of failures that regulation still hasn't caught up to. — Generated via ComfyUI / Z-Image Turbo

On June 27, 2026, a research team posted to arXiv the largest patient-side audit of consumer healthcare chatbots ever assembled: 15,000 user reviews, coded by failure type, mapped to clinical scenario, cross-referenced against the apps that produced them. The dataset is not a survey. It is an autopsy. And the verdict it returns is not ambiguous. Healthcare chatbots, deployed at scale across triage, mental health, and chronic care, are failing in patterns that are systematic, predictable, and — given the existing regulatory architecture — essentially invisible to the agencies nominally responsible for patient safety.

The paper, posted last Friday under the working title Systemic Breakdowns in Consumer Healthcare Chatbots: A 15,000-Review Audit, draws from public app store reviews, Reddit threads, patient forums, and complaint databases filed between January 2023 and March 2026. It codes each review against a 41-category failure taxonomy. The headline finding is not that chatbots fail. It is that they fail the same way, in the same places, often in the same words, year after year, while the industry that produces them continues to ship.

The 15,000-Review Autopsy: How the Dataset Was Actually Built

The dataset is not a survey in any conventional sense. The research team — a group of clinical informatics researchers and one industry analyst — pulled every public review mentioning a named healthcare chatbot across the Apple App Store, Google Play, and a curated set of patient subreddits and chronic illness forums. Reviews shorter than 20 characters and reviews posted by accounts with fewer than three prior posts were excluded. The remaining 15,184 reviews were coded by two independent annotators against a 41-category failure taxonomy. Inter-annotator agreement, measured on a 1,200-review pilot, was 0.81 (Cohen's kappa). The full methodology is in the arXiv preprint.

What makes the dataset unusual is not its size. It is its grain. Most chatbot evaluations are lab-style: a clinician sits with the bot, runs scripted scenarios, scores the output. The 15,000-review dataset captures the opposite condition: a sick person, alone, on a phone, at 2 a.m., trying to decide whether to call a doctor. The reviews describe what happens in that room. They are biased — by definition, they are the users loud enough or distressed enough to write a review — but the bias is informative. The complaint base is the population that interacted with the system and exited unsatisfied. In clinical terms, it is the denominator that hospitals never see.

Where the Failures Cluster: Triage, Mental Health, and Chronic Care

Three failure clusters account for 68% of the 15,184 complaints. The first is triage. Users describe asking chatbots — explicitly or by symptom description — whether a condition requires an emergency room, an urgent care visit, a same-day appointment, or self-care at home. Across the dataset, chatbots under-triage in roughly one in seven high-acuity presentations. The recurring pattern: chest pain, shortness of breath, and sudden unilateral weakness are interpreted as anxiety, indigestion, or fatigue. In several reviews, users describe arriving at an ER hours after the chatbot told them to rest and hydrate. None of these users died in the reviews that surfaced — the audit does not claim to capture mortality — but the pattern is consistent enough that the researchers flag it as a category-1 concern.

The second cluster is mental health. Crisis-line chatbots, eating-disorder chatbots, and the "wellness" modes of general-purpose medical bots produce the highest density of complaints per interaction. Users in active crisis report being routed to breathing exercises, generic affirmation scripts, or — in the most cited failure mode — being asked to subscribe to a premium tier before receiving a referral to a human crisis line. The taxonomy treats this as a severity-zero failure. The reviewers tend to agree.

The third cluster is chronic care. Diabetes, autoimmune disease, multiple sclerosis, and chronic pain account for the bulk of these reviews. The complaints are not that the chatbot fails to diagnose — most chronic care patients have already been diagnosed — but that the chatbot contradicts their established care plan. Different medications flagged as conflicting. Dosing instructions that conflict with what a specialist prescribed. Lifestyle guidance that contradicts what a registered dietitian provided. The chatbot is not authoritative. It is confident. The combination is what produces the complaints.

The Hallucination Tax: When Confident Wrong Answers Meet Sick Patients

The phrase "hallucination tax" is the paper's. It refers to the cost asymmetry between a confident wrong answer and a hesitant correct one. In most deployed chatbot systems, the model is calibrated to produce fluent, declarative responses. It does not say "I am uncertain." It does not say "this is outside my competence." It says, with the same syntactic confidence it would use to tell a user the capital of France, that metformin should not be taken with grapefruit, or that a 600 mg dose of ibuprofen is appropriate for a six-year-old. Both statements are wrong. Both are delivered with no visible hesitation.

The audit counts 2,341 reviews that describe a specific factual error traceable to the bot's response — wrong dosage, wrong contraindication, fabricated drug interaction, invented clinical guideline. The number is a lower bound. Most users lack the clinical training to recognize a hallucination when they see one. The 2,341 are the errors loud enough to be identified by laypeople after the fact. The true rate is higher, and the paper says so.

The tax is paid by the patient. A clinician who issues the same wrong dosage faces a malpractice suit, a board review, and a record. The chatbot faces an app store rating adjustment. The asymmetry is not an accident. It is the product of a deployment model in which the cost of failure is externalized to the user and the cost of caution is internalized by the vendor — in the form of shorter conversations, lower engagement metrics, and worse retention. The market rewards the bot that sounds right. The patient absorbs the bill when it isn't.

Why Healthcare Shipped Anyway: The Economics of Unregulated AI

The economic logic is straightforward and not new. A triage call to a nurse-staffed line costs a payer between $15 and $45. A chatbot interaction costs between $0.02 and $0.40 in inference compute. A 99% reduction in per-interaction cost, applied across the U.S. volume of low-acuity patient contacts, represents a market the consultancies now estimate at $4.1 billion annually. That is the market the chatbot vendors are chasing. It is also the market that explains why shipping preceded validation. The marginal dollar in healthcare AI goes to the company that deploys first and apologizes later, if at all.

The labor side compounds the pressure. The clinical workforce that would normally staff a triage line, a crisis chat, or a chronic care coaching program is constrained. The OpenAI–Uber India leadership move reported in TechCrunch this week is one data point in a broader pattern: the operational talent being recruited to scale consumer health deployments is being pulled from ride-share logistics, not from clinical operations. The result is an industry optimized for throughput, not for the failure modes its product is producing.

The Regulatory Vacuum: Why FDA, HHS, and HIPAA Can't See the Bodies Yet

Three agencies have nominal jurisdiction over parts of the healthcare chatbot stack. None of them is currently equipped to act on the dataset. The FDA's software-as-a-medical-device framework covers chatbots that explicitly market themselves as diagnostic or treatment-planning tools. The large consumer-facing products in the audit are positioned as "informational" or "wellness" tools, a category the FDA's 2024 guidance explicitly excludes from active review. The manufacturers claim informational status. The marketing claims therapeutic utility. Both cannot be true, and the FDA has not forced a resolution.

HHS, through the Office for Civil Rights, enforces HIPAA. The audit raises, but does not settle, the question of whether a consumer chatbot vendor that retains identifiable conversation data is a covered entity or a business associate. The 2024 OCR guidance left the determination to be made case by case. No enforcement action has been taken against a healthcare chatbot vendor under the new framework.

The most useful commentary on the regulatory condition comes from outside the agencies. Dean W. Ball's recent writing, summarized this week, frames the gap as a jurisdictional artifact rather than a policy choice: the agencies were built around hospitals, payers, and clinicians, not around consumer software that performs clinical functions without licensing as clinical software. The audit is, in this sense, a stress test on an architecture that was not designed for the product category now sitting in front of it.

The Deployment-Aftermath Loop: What 15,000 Complaints Actually Change

The dataset has been public for nine days. As of this writing, no vendor named in the audit has issued a public response. Two have quietly updated their crisis-line routing behavior; the changelog notes do not reference the audit. The deployment-aftermath loop — ship, fail, receive complaints, ship the next version — is intact. The audit does not break it. It documents it.

The downstream question is whether the document changes the inputs. Patient-side complaint data has, historically, been the slowest-moving input in healthcare regulation. The FDA's adverse-event reporting system processes reports on a multi-month backlog. The HHS complaint portal resolves a fraction of submissions. App store reviews are not, by any existing rule, an adverse-event reporting channel. The 15,000 reviews in the audit are, in regulatory terms, undischarged evidence. They sit in the queue with the rest of them.

The pattern that the audit makes legible is the one it was designed to detect: an industry that has externalized its failure cost to the population least equipped to dispute it, deploying at a scale and a confidence that no other clinical tool would be permitted to match, under a regulatory regime that was not built for the artifact in front of it. The dataset does not propose a fix. It does not need to. The fix is the same one that has applied to every other clinical tool that ships before it is safe: a regulator, with authority, with a queue it actually empties, and with the willingness to call a deployment what it is. The audit makes the call. The agencies have not picked up the phone. As the coverage this week notes, the gap between evidence and enforcement is now the story.