1 History of AI in Medicine
The idea that machines might assist physicians in diagnosis and treatment is not new. It predates the personal computer, the internet, and the smartphone. For over seven decades, researchers have pursued the dream of artificial intelligence in medicine—and for most of that time, the dream remained tantalizingly out of reach.
Understanding this history matters. When you see a headline claiming AI will “revolutionize medicine” or “replace doctors,” you’re witnessing the latest iteration of a prediction that has been made, in nearly identical terms, since the 1970s. The technologies have changed; the hype cycle has not. By understanding what came before—what worked, what failed, and why—we can approach today’s AI tools with appropriate skepticism and excitement.
This chapter traces the arc of medical AI from its origins to the present day. We’ll meet systems that performed brilliantly in controlled tests but never touched a patient. We’ll examine billion-dollar failures that remind us technical capability alone doesn’t guarantee clinical impact. And we’ll see how the current moment, while genuinely transformative, fits into a longer story of incremental progress punctuated by periods of inflated expectations and disappointment.
1.1 We’ve Been Here Before
Clinical Context: In 2023, when GPT-4 passed the United States Medical Licensing Examination, headlines proclaimed a new era of AI in medicine. But this wasn’t the first time a machine had outperformed medical students on standardized tests—or the first time such achievements had been heralded as transformative.
The dream of automated medical reasoning is older than modern computing itself. In 1959, just three years after the term “artificial intelligence” was coined, researchers were already speculating about machines that could diagnose disease. The appeal was obvious: medicine involves pattern recognition, probability assessment, and the synthesis of vast amounts of information—exactly the tasks where computers should excel.
Yet decade after decade, the breakthrough remained perpetually five to ten years away. Expert systems in the 1980s could match specialists in narrow domains but couldn’t handle real-world complexity. Neural networks showed promise in the 1990s before entering a long winter. Each wave brought genuine advances, but also inflated expectations that inevitably deflated.
What’s different now? Perhaps nothing—we may be in another hype cycle that will end in disappointment. Or perhaps everything—deep learning and large language models may finally provide the general-purpose intelligence that earlier approaches lacked. The honest answer is that we don’t yet know. What we do know is that understanding the previous attempts helps us ask better questions about the current one.
The history of medical AI teaches several recurring lessons:
Technical success doesn’t guarantee clinical deployment. Systems that perform brilliantly in research settings often fail to survive contact with real clinical workflows.
Integration matters more than accuracy. A slightly less accurate system that fits seamlessly into existing workflows will beat a more accurate system that disrupts them.
The human factors haven’t changed. Questions of trust, liability, and physician acceptance that challenged expert systems in the 1980s challenge today’s AI deployments.
Hype cycles are real. Each era has seen predictions that AI would replace doctors. The technology has changed; the prediction remains unfulfilled.
With these themes in mind, let’s trace the journey from the earliest dreams to today’s reality.
1.2 The Dawn of Medical AI (1950s-1960s)
Clinical Context: In an era when a “computer” often meant a room full of vacuum tubes, a few visionaries began asking whether machines might someday reason about disease.
The field of artificial intelligence was formally born at the Dartmouth Summer Research Project in 1956, where John McCarthy, Marvin Minsky, and colleagues gathered to explore whether “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” Medicine, with its combination of pattern recognition, probabilistic reasoning, and life-or-death stakes, immediately attracted attention.
1.2.1 Early Pioneers
Among the first to apply computational methods to medical diagnosis was Homer Warner at the University of Utah. In the late 1950s and early 1960s, Warner developed a system for diagnosing congenital heart disease based on Bayesian probability. Given a set of symptoms and test results, the system calculated the probability of various cardiac malformations. It worked—sometimes outperforming physicians in controlled comparisons.
Warner’s approach was mathematically principled: assign prior probabilities to diseases, update based on evidence using Bayes’ theorem, and report the posterior probabilities. This framework would influence medical decision support for decades. But it also revealed the first of many challenges: where do the probabilities come from? Warner laboriously extracted them from case records, a process that didn’t scale.
Across the Atlantic, F.T. de Dombal at the University of Leeds tackled acute abdominal pain using similar Bayesian methods. His system, developed in the late 1960s, could distinguish between appendicitis, small bowel obstruction, perforated ulcer, and other surgical emergencies. In prospective trials, it achieved diagnostic accuracy exceeding that of senior clinicians.
De Dombal’s work demonstrated something important: computers could match or exceed physician performance on well-defined diagnostic tasks when given appropriate data. But it also demonstrated the limits of this approach. The system required structured input—specific symptoms and findings in specific formats. It couldn’t read a clinical note or conduct a patient interview. It was a calculator, not a colleague.
1.2.2 The Knowledge Problem
The fundamental challenge of this era was what would later be called the “knowledge acquisition bottleneck.” Creating a diagnostic system required explicitly encoding medical knowledge as probabilities, rules, or logical relationships. This knowledge lived in textbooks, journal articles, and—most importantly—the heads of experienced physicians. Extracting it was slow, expensive, and incomplete.
Moreover, medical knowledge wasn’t static. New diseases emerged, treatments evolved, and understanding of pathophysiology deepened. Any system based on encoded knowledge would require continuous updating—a maintenance burden that proved prohibitive for most early systems.
The pioneers of this era proved that computational diagnosis was possible in principle. They did not solve the practical challenges of building systems that could work in real clinical environments. Those challenges would occupy the next generation of researchers.
1.3 The Expert Systems Era (1970s-1980s)
Clinical Context: By the 1970s, AI researchers believed they had found the path to machine intelligence: capture human expertise in explicit rules. In medicine, this approach produced systems of remarkable sophistication—and instructive failure.
The 1970s and 1980s were the golden age of “expert systems”—programs that encoded human expertise as collections of if-then rules. The approach seemed ideally suited to medicine, where experienced clinicians appeared to follow recognizable patterns of reasoning: “If the patient has fever and productive cough and consolidation on chest X-ray, then consider pneumonia.”
1.3.1 MYCIN: The Canonical Medical Expert System
No discussion of medical AI history is complete without MYCIN. Developed at Stanford University by Edward Shortliffe and colleagues in the mid-1970s, MYCIN was designed to recommend antibiotic therapy for patients with bacterial infections, particularly bacteremia and meningitis.
MYCIN’s architecture became the template for medical expert systems. It contained approximately 600 “production rules”—if-then statements encoding infectious disease expertise. For example:
IF: The stain of the organism is gram-positive, AND the morphology is coccus, AND the growth conformation is chains THEN: There is suggestive evidence (0.7) that the identity of the organism is streptococcus
The numbers represented “certainty factors”—a scheme Shortliffe developed to handle the uncertainty inherent in medical reasoning. Rules could increase or decrease the certainty of conclusions, with the system tracking and combining certainties across multiple rules.
What made MYCIN remarkable was its performance. In a famous 1979 evaluation, MYCIN’s recommendations were compared to those of Stanford infectious disease experts on a set of meningitis cases. Outside evaluators, who didn’t know which recommendations came from the computer, rated MYCIN’s therapy as acceptable in 65% of cases—compared to 62.5% for the best human expert. The machine had matched or exceeded the specialists.
Yet MYCIN was never deployed clinically. Not once. Why?
1.3.2 Why Expert Systems Failed
MYCIN’s fate illustrates the gap between technical capability and clinical utility. Several factors prevented deployment:
Integration challenges: MYCIN was a standalone system in an era before electronic health records. Using it required manually entering patient data—laboratory values, culture results, clinical findings—into the computer. In a busy hospital, this was impractical. The system answered questions no one had time to ask it.
The closed-world assumption: MYCIN knew only what it knew. Its 600 rules covered specific infections with specific organisms. Present it with a case outside its domain—a viral infection, an unusual organism, a complex patient with multiple problems—and it had nothing to offer. Real clinical practice is full of cases that don’t fit neat categories.
Explanation and trust: MYCIN could explain its reasoning by listing the rules it had fired. But physicians found these explanations unsatisfying. Listing rules isn’t the same as providing the kind of clinical reasoning physicians were trained to expect.
The maintenance burden: Medical knowledge evolves. New antibiotics are introduced, resistance patterns change, guidelines are updated. Keeping an expert system current required continuous attention from both domain experts and knowledge engineers—a commitment that rarely survived the transition from research project to deployed system.
Liability and responsibility: Who was responsible if MYCIN’s recommendation harmed a patient? The physicians who followed it? The researchers who built it? The hospital that deployed it? These questions were never satisfactorily answered.
1.3.3 Other Expert Systems
MYCIN was the most famous, but not the only, medical expert system. INTERNIST-1, developed at the University of Pittsburgh by Harry Pople and Jack Myers, took on the vastly more ambitious task of diagnosing diseases across all of internal medicine. It contained knowledge about over 500 diseases and 3,500 manifestations. INTERNIST-1 could engage in diagnostic reasoning of impressive sophistication, considering and ruling out multiple competing hypotheses.
Like MYCIN, INTERNIST-1 performed well in evaluations. And like MYCIN, it was never widely deployed. The system required extensive manual data entry, couldn’t handle the ambiguity and incompleteness of real clinical data, and demanded more time than busy physicians could spare. A later version, QMR (Quick Medical Reference), was commercialized but never achieved widespread adoption.
DXplain, developed at Massachusetts General Hospital, took a different approach—functioning more as a clinical decision support tool than an autonomous diagnostician. Given a set of findings, it generated a ranked list of possible diagnoses. DXplain survived longer than most expert systems, remaining in use at some institutions into the 2000s, precisely because it positioned itself as an aid to human reasoning rather than a replacement for it.
1.3.4 The AI Winter
By the late 1980s, enthusiasm for expert systems had cooled. The fundamental limitations—the knowledge acquisition bottleneck, brittleness outside narrow domains, integration difficulties—proved insurmountable with the technology of the time. Funding dried up, startups failed, and the field entered what historians call the “AI winter.”
The expert systems era left important lessons:
Domain expertise isn’t enough. Even when systems perfectly captured expert knowledge, they failed to integrate into clinical practice.
Medicine is messier than rules. Real patients don’t present with textbook findings. Real data is incomplete, inconsistent, and ambiguous. Systems that required perfect input were useless in practice.
The last mile is the hardest. Technical achievement in the laboratory is only the beginning. Deployment requires solving integration, workflow, trust, and maintenance problems that are harder than the core AI challenge.
These lessons would be relearned by each subsequent generation of medical AI researchers.
1.4 The Machine Learning Revolution (1990s-2000s)
Clinical Context: As expert systems faded, a new paradigm emerged: instead of encoding human knowledge explicitly, let machines learn patterns directly from data. The electronic health record would finally provide data at scale—though not without new challenges.
The limitations of expert systems pointed toward a different approach. Rather than having humans articulate rules, what if machines could discover patterns themselves? This insight powered the machine learning revolution.
1.4.1 The Shift to Learning from Data
Machine learning inverted the expert systems approach. Instead of “knowledge engineering”—laboriously extracting rules from experts—machine learning algorithms found patterns in data automatically. Provide enough examples of patients with and without a disease, and the algorithm would learn to distinguish them.
The 1990s and 2000s saw an explosion of machine learning methods: support vector machines, random forests, boosted decision trees, and renewed interest in neural networks. Each found applications in medicine, from predicting hospital readmission to identifying high-risk patients to analyzing medical images.
But machine learning required something expert systems didn’t: large amounts of labeled training data. And here, medicine faced a fundamental challenge.
1.4.2 The EHR Revolution
The widespread adoption of electronic health records (EHRs) beginning in the 2000s, accelerated by the HITECH Act of 2009, transformed the data landscape. Suddenly, hospitals had digital records of millions of patient encounters—diagnoses, medications, laboratory values, clinical notes.
This data enabled machine learning at scale. Researchers could train models on hundreds of thousands of patients rather than the dozens or hundreds that characterized earlier studies. The field of “clinical informatics” emerged to extract value from this digital treasure trove.
Yet EHR data proved messier than anticipated. It was collected for billing and documentation, not research. Diagnoses reflected coding practices as much as patient reality. Notes contained crucial information but in unstructured free text that was difficult for machines to interpret. Missing data was the rule, not the exception.
1.4.3 Clinical Prediction Models
Some of the most successful medical algorithms from this era weren’t called “AI” at all. The Framingham Risk Score for cardiovascular disease, the APACHE scores for ICU mortality, the Wells criteria for pulmonary embolism—these prediction models, often based on logistic regression, became standard clinical tools.
What distinguished these successful tools?
Clear clinical utility: Each addressed a decision clinicians actually faced. “Should I anticoagulate this patient?” is more actionable than “What is this patient’s risk score?”
Transparency: Physicians could understand the inputs and their weights. A patient with diabetes, hypertension, and smoking history had higher cardiovascular risk—this made clinical sense.
Integration into workflow: These scores could be calculated quickly, sometimes on paper, and fit naturally into clinical decision-making.
Validation across populations: The successful scores were tested extensively across different hospitals and patient populations, building confidence in their generalizability.
Most machine learning research from this era didn’t achieve this kind of adoption. Models were developed, published, and forgotten. The “last mile” problem persisted.
1.4.4 Neural Networks: The First Wave
Neural networks—computing systems loosely inspired by biological neurons—had been around since the 1950s. The 1980s saw renewed interest with the development of the backpropagation algorithm for training multi-layer networks. By the 1990s, neural networks were being applied to medical problems, particularly electrocardiogram (ECG) interpretation.
The results were promising. Neural networks could learn to recognize arrhythmias from ECG signals, sometimes outperforming traditional rule-based systems. A few products reached the market, including the first FDA-cleared AI medical device—a neural network for screening diabetic retinopathy images—in the late 2000s.
But neural networks of this era had significant limitations. They required careful feature engineering—human experts had to decide which aspects of the data to feed the network. They were prone to overfitting, memorizing training data rather than learning generalizable patterns. And they were “black boxes” whose reasoning couldn’t be easily explained.
By the mid-1990s, interest in neural networks had waned again. Simpler methods like random forests and support vector machines often performed as well with less tuning. Neural networks would have to wait for more data, more computing power, and a new architecture to realize their potential.
1.5 Deep Learning Transforms Medical Imaging (2010s)
Clinical Context: In 2012, a neural network called AlexNet won the ImageNet competition by a stunning margin, recognizing objects in photographs far better than any previous algorithm. Within five years, similar networks were reading medical images at “physician level”—and the FDA began approving them.
Everything changed with deep learning. The same basic neural network idea, combined with much deeper architectures, vastly more data, and modern GPU computing, produced systems of unprecedented capability. And nowhere was this transformation more dramatic than in medical imaging.
1.5.1 The ImageNet Moment
The 2012 ImageNet Large Scale Visual Recognition Challenge marked a turning point. AlexNet, a deep convolutional neural network developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, achieved a top-5 error rate of 15.3%—almost ten percentage points better than the second-place entry using traditional computer vision methods.
The implications were immediately apparent to medical imaging researchers. If neural networks could recognize thousands of object categories in natural photographs, they could surely learn to recognize patterns in X-rays, CT scans, and pathology slides.
The key innovation enabling deep learning was end-to-end learning. Earlier neural network approaches required human experts to define relevant features—edges, textures, shapes—before feeding them to the network. Deep convolutional networks learned features automatically from raw pixels. This eliminated the feature engineering bottleneck and allowed networks to discover patterns humans might never have thought to look for.
1.5.2 Landmark Studies
A wave of high-profile studies demonstrated deep learning’s potential in medical imaging:
Diabetic retinopathy (2016): A team at Google led by Varun Gulshan trained a deep neural network on 128,000 retinal images to detect diabetic retinopathy. The network achieved sensitivity and specificity exceeding that of ophthalmologists. This study, published in JAMA, captured widespread attention and suggested AI could address the global shortage of specialists for screening.
Dermatology (2017): Andre Esteva and colleagues at Stanford trained a network on 129,450 clinical skin images spanning 2,032 diseases. They demonstrated performance on par with dermatologists in distinguishing malignant from benign lesions. The study, published in Nature, included the memorable claim that the network performed at “dermatologist-level.”
Chest X-ray (2017): The Stanford Machine Learning Group released CheXNet, a deep network trained on 112,120 frontal chest X-rays to detect 14 different pathologies. For pneumonia detection, the network exceeded the average performance of four radiologists.
These studies followed a common pattern: train on a large dataset, achieve impressive performance on held-out test data, compare favorably to physicians. The AI had arrived—or so the headlines suggested.
1.5.3 FDA Enters the Arena
The U.S. Food and Drug Administration began adapting its regulatory framework to AI medical devices. In 2018, the FDA cleared the first autonomous AI diagnostic system—IDx-DR, which could diagnose diabetic retinopathy without physician oversight. Other clearances followed for systems detecting wrist fractures, stroke, pulmonary nodules, and more.
The regulatory pathway itself became a subject of innovation. The FDA developed new frameworks for “software as a medical device” and grappled with novel questions: How do you regulate a system that learns and changes? How do you validate performance across different populations and settings?
1.5.4 The Deployment Gap
Yet adoption lagged far behind the publications. Despite dozens of studies claiming human-level performance, most hospitals weren’t using AI to read images. Why?
Validation concerns: Performance in a curated research dataset doesn’t guarantee performance in the messy real world. Studies often used high-quality images from academic centers, while clinical practice included patients with motion artifacts, unusual anatomy, and concurrent pathology.
Generalization failures: Networks trained on data from one hospital often performed poorly on data from another. Different scanner manufacturers, imaging protocols, and patient populations all affected performance.
Workflow integration: A stand-alone AI that requires images to be manually uploaded and results to be retrieved separately won’t be used. Integration into radiology PACS (picture archiving and communication systems) was technically challenging and expensive.
The “physician-level” problem: Even when AI matched average physician performance, it often failed on the cases physicians found difficult—precisely the cases where assistance would be most valuable. And it made different kinds of errors than humans, sometimes obvious mistakes that any physician would catch.
Liability questions: If an AI misses a cancer, who is responsible? The radiologist who reviewed its output? The hospital that deployed it? The company that built it? Clear answers remained elusive.
By the end of the 2010s, medical imaging AI had proved what was possible. It had not yet proved what was practical.
1.6 The Rise and Fall of IBM Watson Health
Clinical Context: No discussion of medical AI history would be complete without Watson—a cautionary tale of hype, hubris, and the gap between demonstration and deployment.
In 2011, IBM’s Watson system defeated human champions on the quiz show Jeopardy!, demonstrating remarkable natural language processing and knowledge retrieval capabilities. IBM quickly pivoted toward healthcare, envisioning Watson as a diagnostic and treatment recommendation engine that would transform medicine.
1.6.1 The Promise
IBM’s ambitions were vast. Watson would read the medical literature—all of it—and stay current with new publications. It would analyze patient data, including clinical notes, laboratory values, and genomic information. It would recommend evidence-based treatments, personalized to each patient. It would become, in IBM’s marketing, “the best doctor in the world.”
Partnerships followed with prestigious cancer centers, including MD Anderson and Memorial Sloan Kettering. IBM announced plans for Watson to tackle cancer treatment, drug discovery, and population health management. The media amplified the hype: Watson would revolutionize medicine.
1.6.2 The Reality
The reality proved far more modest. Several high-profile projects failed:
MD Anderson Cancer Center terminated its Watson partnership in 2017 after spending $62 million without achieving its goals. An audit found the project had delivered limited clinical value and was poorly managed.
Watson for Oncology, despite deployment at cancer centers around the world, drew criticism for recommending treatments that were “unsafe and incorrect” in some cases. Internal documents revealed that the system had been trained primarily on hypothetical patients created by Memorial Sloan Kettering physicians rather than on real patient data.
Watson Health was quietly sold to a private equity firm in 2022, ending IBM’s healthcare ambitions.
1.6.3 What Went Wrong?
Watson’s failure offers lessons that remain relevant:
Natural language understanding isn’t enough. Watson could parse questions and retrieve information, but medicine requires reasoning about uncertainty, weighing competing evidence, and understanding clinical context. Pattern matching over text, however sophisticated, isn’t the same as medical judgment.
Training data matters. Systems trained on curated, academic examples may not perform well on real-world cases with incomplete information, atypical presentations, and messy data.
Deployment is hard. Even if Watson’s recommendations had been perfect, integrating them into clinical workflow posed enormous challenges. Physicians didn’t want to copy-paste patient information into a separate system and wait for recommendations.
Hype creates expectations that can’t be met. IBM’s marketing promised transformation; the technology delivered incremental assistance at best. The gap damaged trust in both Watson and medical AI more broadly.
Watson’s story is not one of technical impossibility but of premature deployment, inflated expectations, and insufficient attention to the practical challenges of clinical implementation.
1.7 The Large Language Model Era (2020s)
Clinical Context: When GPT-4 passed the USMLE with a score that would place it in the top tier of human examinees, it demonstrated a kind of medical reasoning that previous systems couldn’t approach. But passing an exam isn’t the same as practicing medicine.
The release of ChatGPT in November 2022 and GPT-4 in March 2023 marked a new chapter in medical AI. These large language models (LLMs), trained on vast amounts of text from the internet, exhibited capabilities that surprised even their creators—including substantial medical knowledge.
1.7.1 Passing Medical Exams
GPT-4’s performance on medical licensing examinations drew immediate attention. It scored above the passing threshold on the USMLE—not marginally, but with performance that would rank among successful medical students. Similar results followed on medical specialty examinations and international licensing tests.
But exam performance has always been a limited measure of clinical capability. Medical examinations test knowledge and reasoning on carefully constructed questions with unambiguous correct answers. Real patients don’t come with multiple-choice options.
1.7.2 Emergent Medical Capabilities
More impressive than exam scores was the flexible, general-purpose nature of LLM capabilities. Earlier systems could diagnose or recommend treatments; they couldn’t explain medical concepts to patients, summarize research papers, or draft clinical documentation. LLMs could do all of these—sometimes with remarkable facility.
This flexibility arose from the training process. Unlike expert systems with hand-crafted rules or earlier ML models trained on specific tasks, LLMs learned from essentially the entire internet, including medical textbooks, research papers, clinical guidelines, and patient forums. They absorbed medical knowledge as part of learning language itself.
1.7.3 The First Mass Deployments
The 2020s saw something genuinely new: mass deployment of medical AI systems. Ambient AI scribes—systems that listen to clinical encounters and generate documentation—spread rapidly through healthcare settings. Companies like Nuance (now part of Microsoft), Abridge, and Suki deployed to thousands of clinicians.
These systems worked because they solved a genuine problem (documentation burden), integrated into existing workflows, and kept physicians in control. The AI drafted notes; physicians reviewed and signed them. The human remained responsible.
Other deployments followed: AI-assisted responses to patient portal messages, clinical summarization tools, and documentation assistants. Each succeeded to the extent it respected the same principles: clear clinical value, workflow integration, human oversight.
1.7.4 What’s Different This Time?
Is the current wave of AI different from previous hype cycles? Several factors suggest it might be:
General-purpose capability. Unlike expert systems that could only do what they were explicitly programmed to do, LLMs exhibit flexible, general-purpose intelligence. They can handle novel tasks without specific training.
Natural language interface. Previous systems required structured input in specific formats. LLMs understand natural language, reducing integration barriers.
Rapid improvement. LLM capabilities have improved dramatically year over year. The gap between GPT-3 and GPT-4 was substantial; further improvements continue.
Scale of investment. The resources flowing into AI development dwarf previous eras. This sustained investment may push through barriers that stopped earlier efforts.
But skepticism remains warranted. The fundamental challenges—integration, validation, trust, liability—haven’t disappeared. Technical capability still doesn’t guarantee clinical utility. And we remain early enough in this cycle that the eventual equilibrium is far from clear.
1.8 Lessons from History
Clinical Context: After seven decades of AI in medicine, certain patterns recur so reliably that they deserve explicit acknowledgment. Recognizing these patterns can help us navigate the current moment with appropriate expectations.
1.8.1 The Hype Cycle Is Real
Every era of medical AI has been accompanied by predictions that the technology would transform medicine and threaten physician jobs. In 1970, researchers predicted computers would soon match physicians in diagnosis. In 1980, expert systems were supposed to democratize medical expertise. In 2017, Geoffrey Hinton famously suggested radiologists should stop training because AI would soon make them obsolete.
None of these predictions materialized on the expected timescales. The pattern suggests we should be skeptical of current predictions—while remaining open to the possibility that this time really is different.
1.8.2 Technical Success Doesn’t Guarantee Clinical Impact
MYCIN matched experts. Watson could process medical literature. Deep learning networks achieved “physician-level” performance. Yet clinical impact remained limited. The bottleneck was rarely the core technology—it was everything else: integration, validation, workflow, trust.
This suggests that evaluating medical AI requires looking beyond accuracy metrics. How does it fit into clinical practice? Who maintains it as medical knowledge evolves? What happens when it makes mistakes?
1.8.3 Integration Matters More Than Accuracy
The AI systems that achieved clinical adoption—clinical prediction scores, some imaging AI, ambient scribes—succeeded because they fit into existing workflows. Systems that required physicians to change how they worked, even if technically superior, generally failed.
This implies that successful medical AI must be designed around clinical workflows, not the other way around. The best AI is invisible, augmenting what physicians already do rather than demanding they do something different.
1.8.4 The Human Factors Haven’t Changed
Questions that challenged MYCIN in the 1980s remain relevant today: Who is responsible when AI makes mistakes? How much should physicians trust algorithmic recommendations? How do we prevent automation bias—the tendency to accept AI outputs uncritically?
These aren’t technical questions with technical solutions. They require new norms, policies, and professional practices. Progress on human factors has been slower than progress on algorithms.
1.8.5 Medicine Is Harder Than It Looks
Each generation of AI researchers has underestimated the complexity of clinical practice. Medicine isn’t just pattern recognition—it’s managing uncertainty, communicating with patients, navigating healthcare systems, and making decisions under constraints of time, resources, and incomplete information.
AI systems that perform well on carefully curated datasets often struggle with the messiness of real clinical data. Models that excel at diagnosis may offer little help with the harder question of what to do about it. Technical benchmarks capture only a fraction of what clinical competence requires.
1.9 Looking Ahead
Clinical Context: As you begin this book, you’re entering medicine at a unique moment. The tools available to you will be dramatically different from those your teachers used—and the skills you develop now will shape how those tools are used.
We are somewhere in the middle of a transformation whose endpoint we cannot yet see. LLMs have demonstrated capabilities that seemed impossible five years ago. Adoption of AI tools in clinical practice is accelerating. The question is no longer whether AI will play a role in medicine, but what role it will play and how we will manage the transition.
This book aims to prepare you for that future by providing foundations that won’t become obsolete as specific technologies evolve:
Technical literacy: Understanding how AI systems work—their capabilities, limitations, and failure modes—enables critical evaluation of both current tools and future developments.
Practical skills: Knowing how to work with AI systems, from prompt engineering to model evaluation, will be increasingly valuable as these tools become ubiquitous.
Critical perspective: The history we’ve reviewed suggests that hype often exceeds reality, that deployment is harder than development, and that human factors matter as much as technical performance. This perspective helps separate genuine advances from marketing.
The physicians, researchers, and health informaticists who will shape medical AI’s impact are being trained now. Some are reading this book. The opportunity—and responsibility—is to ensure that these powerful tools serve patients, respect clinicians, and improve health outcomes.
The history of AI in medicine is still being written. What comes next depends, in part, on you.