R2D2

Bedside Rounds

Clinicians are often expected to utilize clinical decision support systems. However, they receive little to no training in the actual use of such systems.

In this first of a two-part episode, Bedside Rounds (R2D2 and C3PO), is about the first effective clinical decision support system, AAPHELP, and the impacts that it has made of clinical care. It also includes a discussion about the implications for integration of artificial intelligence into medicine. You’re invited to join the host, Dr. Adam Rodman, and his guest, Dr. Shoshana Herzig as they discuss the development of clinical decision support.

First, listen to the podcast. After listening, ACP members can take the CME/MOC quiz for free.

Details
About Bedside Rounds
Show Notes
Accreditation & Disclosures

CME/MOC:

Up to 1 AMA PRA Category 1 Credits ™ and MOC Points
Expires September 04, 2026 active

Cost:

Free to Members

Format:

Podcasts and Audio Content

Product:

Bedside Rounds

Bedside Rounds is a medical podcast by Adam Rodman, MD, about fascinating stories in clinical medicine rooted in history. ACP has teamed up with Adam to offer continuing medical education related to his podcasts, available exclusively to ACP members by completing the CME/MOC quiz.

This is Adam Rodman, and you are listening to Bedside Rounds, a monthly podcast on the weird, wonderful, and intensely human stories that have shaped modern medicine, brought to you in partnership with the American College of Physicians.This episode is called R2D2, and it’s the second part on the development of clinical decision support. Last time we talked about the “acute abdomen” and a really important, well, what we might today call a “use case” for decision support – when there are limited surgeons, or God-forbid the only surgeon available might be the one who needs his abdomen opened. It was actually this exact situation – the danger and drama of the acute abdomen – that ushered in the development of clinical decision support – and in many ways its lessons teach us a lot about the precipice of diagnostic artificial intelligence we find ourselves in.

And who do I turn to when I’m trying to suss out the impact of technology of the nature of diagnosis? Of course, it’s friend-of-the-show and smartest-person-I-know, Dr. Shani Herzig.

Shani Herzig (01:06):

Yeah. I am Shawni Herig. I'm a clinician investigator at Beth Israel Deaconess Medical Center. I do pharmaco epidemiology and I love studying hospitalized patients. It's an area that hasn't been well investigated up until, I don't know. Well, that's not really true. I guess the Institute of Medicine. You can cut

Adam Rodman (01:27):

All this out. I'm gonna cut it all out. Are (laugh) why are we talking today?

Shani Herzig (01:33):

I don't know. You don't know. You keep asking. You keep asking me to do these things and I keep I keep saying, oh, I don't know Adam. I don't know if I have anything

Adam Rodman (01:41):

To add. Why do you think we're talking today?

Shani Herzig (01:43):

I think we're talking today because we, anytime math comes up, you, you seem to associate me with math and, and base theorem comes up a lot. I don't know. I, so I think that we're talking today because we are moving very quickly towards computer involvement. Ai I should say AI involvement in the practice of medicine and clinical diagnosis, and really incorporation of AI into every single facet of medicine, including medical publication, for example. Which has come up at my editorial board meetings lately. I bet it has. And so how, how this is gonna change the face of medicine. How how AI is being applied and, and kind of what my thoughts are as a, as a researcher, an epidemiologist on, on top of all

And Shani of course is right. This episode is coming at a time of profound change – and one that I’m caught up in myself. But I’ll get into that more later. Let’s start by “checking in” in the 1960s with our understanding of diagnostic thinking. Take yourself back to the episode with Gurpreet, referencing the famous 1959 paper by Ledley and Lusted. I have the paper linked in the shownotes, and you are interested in this topic at all – and I assume you are, if you’re listening to Bedside Rounds – you need to read it. I mean, it’s a difficult read in many ways, and relies on symbolic methods of talking about Bayesian reasoning that we largely don’t use anymore. But it’s in many ways the foundational paper of both clinical reasoning and diagnostic artificial intelligence.

My TL;DR is this: Ledley and Lusted argue that current human diagnostic processes are largely intuitive and error prone, and they recommend reconceiving diagnosis on a fundamentally probabilistic – that is, Bayesian – basis. And to be clear, they do NOT argue that human minds should work like this, even if you were taught these ideas in an evidence-based diagnosis course. They are explicit – this type of analysis is far better suited to a computer, and they describe a theoretical process on punch cards by which this could actually be carried out.

So we need to talk about Bayes’ theorem. If you’ve listened to Bedside Rounds for a while, you know it comes up quite a bit when talking about the history of diagnosis. The history of the theorem itself is actually quite fascinating – if you’re really looking for a history of math book, I highly recommend McGrayne’s “The Theory that Would Not Die.” The big point here is that Bayes’ theorem was still relatively avant garde when Ledley and Lusted were righting, despite already being hundreds of years old. The theorem basically describes how probabilities change based on knowledge of prior conditions. So basically Ledley and Lusted imagined how the probability of a disease on a differential diagnosis might change with different patient characteristics or diagnostic tests – if this could all be mapped out on a scientific basis, you need only input the right data and the computer would tell you the probability of different diseases.

The other famous paper that I’ve referenced previously is the work of Yerushalmy about the development of signal detection theory and the reliability of the interpretation of chest radiography for pulmonary tuberculosis, and the clinical use of sensitivity and specificity. By the 1960s, forward thinking physicians had realized what Yerushalmy had grokked a decade and a half before – that all diagnoses – and likely anything mediated by a human – were subject to inherent variability and disagreement. And by extension, if computers were going to get involved in diagnosis – and I have no idea what Yerushalmy thought of that, but as a skeptic of the cognition of doctors, he was probably quite happy about it – these were problems to be overcome.

And of course these papers were not published in a vacuum. All of this is happening in the setting of the field known as cybernetics, a transdisciplinary intellectual movement about self-regulating systems that greatly influenced, among many others, computer science, cognitive psychology, sociology – and, of course, medicine. Most of the doctors and researchers that I’ve spoken about were affiliated with and cross-pollinated with leaders of the cybernetics movement. Even the word “artificial intelligence” dates to this period, in the 1950s. In medicine, this group of physicians and researchers was organizing itself into a movement that would soon be named clinical informatics.

I’m going to focus on the creation of just one person – Tim de Dombal, a surgeon and researcher at St James Hospital in Leeds, as well as a jazz pianist and amateur racecar driver. It’s unclear when De Dombal started to become interested in computers, but now having read through dozens of his papers – he wrote over 450 total in his lifetime so that’s only a small fraction – it seems to be shortly after he obtained his FRCS in 1967. I’ll go ahead and say it – De Dombal is one of those figures that just deserves more research; he made such an impact, basically launched an entire field, and there’s almost no scholarship on him – so if any budding historians are looking for a project, he’s the ideal target!

De Dombal first became interested in computers for surgical education. In particular, he was frustrated that teaching didn’t focus on the process of diagnosis, which he felt was still largely taught on an intuitive basis. Using an Elliott 903C that he borrowed from the Leeds Electronic Computing department, he wrote a program called CAL. It technically stood for “computer assisted learning,” but given the many references he would make to HAL 9000 from 2001 later in life, it’s clearly meant to be tongue in cheek. And if you will recall HAL turns a little bit murderous towards his human colleagues by the end of the movie, so I think it’s also clear that the editors of BMJ were not keeping up with their science fiction; 2001 had only come out the year before.

CAL walked students through the surgical diagnostic process in acute abdominal diseases, and would correct their errors as they went through each case – 18 of them in total. While he found that medical students found it very valuable as a teaching tool, he ran further studies where he put residents and attendings through the cases. “Our results were far from encouraging, in that the clinicians behaved in a rigid inflexible manner, asking the same questions in every simulation, irrespective of their relevance to the particular case.” De Dombal was uncertain if this was merely frustration with using a computer program – or if this represented actual diagnostic deficiencies in his colleagues, which you can imagine, greatly concerned him.

This turned out to be his gateway drug, switching his research focus from ulcerative colitis to clinical computing. He decided early on that his goal would be to use computers to build a better surgeon. Just as Yerushalmy had shown that pulmonologists and radiologists couldn’t agree on chest x-ray interpretation, De Dombal turned his attention to physical exam signs in his patients with severe ulcerative colitis. Whenever one of his patients presented in the emergency room, he would have multiple surgeons of different training levels evaluate the patient for a variety of “classic” findings. Medicine was a “semi-exact scientific discipline,” he wrote, quoting Norbert Wiener, the founder of cybernetics. He realized that he needed to find the “more exact” pieces that would allow a computer to operate. He did this by calculating simple concordance – which exam findings had the highest degree of agreement. It’s basically the same methodology we use today, except we usually calculate something called a kappa score, which substracts out findings by chance.

Unsurprising to us in the 21st century, or to any medical student who has dutifully nodded when their attending hands them their stethoscope and tells them that in the right middle lobe there’s a faint musical wheeze, the surgeons didn’t actually agree on many of the findings. The most reliable ones were the presence of tenderness, distension, and rebound tenderness – that’s when the pain is more pronounced after you remove your hand, or “rebound”. Other so-called classic findings – especially guarding and rigidity – were far less reliable. He concludes: “It therefore remains an open question whether the alarming magnitude of observer variation we have recorded in severe UC reflects specific difficulties in the examination of such patients, or whether it merely highlights a general problem of which the clinician may not be fully aware… as several authors have pointed out, the true importance of observer error is neither its presence nor its magnitude, but the effect which it has on the diagnostic process and on the physician engaged in that process. With the advent of computer-based systems for processing clinical information the degree of observer variation in the “traditional” physical examination assumes an added importance.”

And it wasn’t just examination signs; in a remarkably post-structuralist turn, even the language of medicine was in doubt. In interviews with 19 surgeons, there was no agreement on the definition of an acute abdomen, and in a literature review he found 20 unique definitions of dyspepsia.

De Dombal had always been a perfectionist by nature, which was necessary for the task ahead. He realized that he could not rely on “tradition” or even experience when it came to diagnosis – he would actually have to perform a detailed study of patient characteristics and figure out what was ACTUALLY predictive. While he was interested in diagnosis as a whole, he decided that he would limit himself only to acute abdominal pain. Why? Well, first of all, he was a surgeon, so this was something with which he was intimately familiar with. But acute abdominal pain also had a very good “gold standard”. Remember all the way back to Yerushalmy in the episode A Vicious Circle – most diseases, such as pulmonary tuberculosis, do not have an ideal “gold standard” with which to compare different diagnostic tests; this introduces a certain unmeasured epistemic uncertainty into our diagnostic tests, and makes them look like they perform better than they actually do. Acute abdominal pain avoided some of those problems, since cases would undergo surgery and then have an actual pathological diagnosis – about as good as you could hope for from a gold standard. And if you didn’t get the gold standard? Well, it meant they got better, also a helpful category. It was also clear that for the sake of his study he would need to clearly define an acute abdomen, so his consensus definition was abdominal pain that required presentation to the hospital within seven days.

So he and his team decided to develop a “data-base” (still in quotation marks, and with a hyphen, given the newness of the term) of actual patients presenting to the General Infirmary at Leeds or to St. James’ University hospital with acute abdominal pain, 600 in total. They decided to look at 42 clinical attributes – doing the math, that’s over 25,000 items. And this massive missive of data was initially analyzed by hand, though fortunately the team soon got a desktop computer to help.

The database allowed De Dombal to make predictions in a probabilistic fashion. For example, 16% of the patients who presented with acute abdominal pain had a final pathological diagnosis of appendicitis. But if this pain was initially in the right lower quadrant, the diagnosis was appendicitis 60% of the time.

The value of such a database seems obvious – after all, it is basically Ledley and Lusted’s vision of a diagnostic machine, based on real patient data (and capable of being easily updated with future patients). But like so many of this generation of informaticists, De Dombal was a little bit burned by the over-enthusiastic predictions of computerized doctors a generation before, he tempered his enthusiasm as much as possible: “It should be emphasized that the chief role in this sphere belongs overwhelmingly to the clinician, and many practical difficulties must still be overcome before even the most cautious introduction of such probability-based systems into routine clinical practice.”

He clearly wasn’t too discouraged, because the next thing he did was actually BUILD such a computer program. He decided to ditch the “murderous space artificial intelligence” theme and go for something a little less ominous, which I think is very unfortunate – AAPHELP. In most scholarship, it’s called the Leeds Abdominal Pain System, and I’ll use both interchangably. The program was written in Fortran on an English Electric KDF9 computer – a 4.7 ton machine located in the electronic computing department, just 800 meters from the surgery department; the computer could be accessed via a teletype terminal in the surgery department itself.

Let’s talk about what AAPHELP actually looked like from the perspective of a user – which would be the surgeon stationed in the A+E, often a trainee. Many of my younger listeners may be shocked to hear that emergency medicine is a relatively new field – the first program was only started in 1971, and large emergency rooms were usually staffed by a combination of a surgical resident and a nonsurgical (often medicine) resident. So imagine that a patient comes in with acute, severe abdominal pain – either by an ambulance, or walking in. The surgeon on service would come and evaluate the patient in their usual manner. Immediately after their interview with the patient, they would complete a standardized form (and later in the study, directly input the information directly into the teletype machine). The form, validated through De Dombal’s prior studies and the database, had items such as the age, sex, duration of symptoms, quality of pain, relieving factors, and location of the pain. AAPHELP would then run a Bayesian analysis – depending on other computing demands, a single patient took between 30 seconds and 15 minutes. The computer would then print out a list of the possibility of diagnoses and recommended follow up tests. For example, in a 47 year-old woman with 12-24 hours onset of intermittent generalized abdominal pain, decreased appetite, jaundice, history of previous pain, no guarding or rebound, the computer would print out a 93.2% chance of pancreatitis, 3.1% change of an SBO, and a 2.7 percent change of a perforated duodenal ulcer, and a 0 percent chance of the other diagnoses. Per the treatment protocol, the surgeon would first determine their treatment plan before looking at what the computer recommended.

De Dombal piloted AAPHELP for 18 months, from January 1971 through July 31 1972. 552 patients were enrolled in total. And the results were, for lack of a better word, staggering. First, let’s just talk about accuracy. The diagnostic accuracy of the computer was 91.8%. Senior surgeons by comparison were only 79.6 percent; the residents 77%, and the interns 72%.

But accuracy brings tangible benefits for AAPHELP’s patients. And the patients truly benefited from the presence of the computer. De Dombal tracked outcomes from prior to the intervention through the study period – not randomized, of course, but still a reasonable test since the makeup of the surgeons was effectively the same. One outcome was the timeliness of appendectomies in patients who had appendicitis – looking at what percentage had abscess or perforation when they finally received a laparotomy. In 1969 - 1970, 40% of appendectomies met those criteria. But during the study period, the rate fell to 4-5%. There was a similar finding with so-called negative laparotomy – surgeries performed on people who did not need them. Prior to the trial, the negative laparotomy rate was 25%; during the trial it fell to 6-7%. So that’s an increase in people who need surgery getting it earlier, and a decrease in people who didn’t need surgery getting it.

AAPHELP had been a success – but why? And what lessons did this offer for future attempts at building a diagnostic machine, something that might demonstrate “artificial intelligence”?

De Dombal proffered two theories:

Theory #1 – human intuition was really bad, and AAPHELP was just better at making predictions.

Theory #2 – human intuition is actually pretty good, but in an emergency situation, there were just too many factors to weigh; after all, De Dombal’s database had 42 different characteristics.

The consummate scientist, De Dombal devised a clever experiment to see which of these two explanations might be right.

His team created booklets with pre-prepared tables with a variety of factors such as age, pain location and duration, etc, and provided them to six residents and six attendings. They were instructed to fill out how much a given factor increased the chance of appendicitis. Remember, there were 42 characteristics, so this was in a very real fashion a data dump – De Dombal had no doubt that physicians would be overwhelmed.

After these booklets were completed, De Dombal and co. prepared a list of the most discordant physician opinions when compared with his actual clinical database. The surgeons were then shown a list of estimates of 5 data points from their booklets, and 5 from the database. Despite being blinded, most still chose their original estimates. They were then unblinded, and given an opportunity to stick with their original estimate, the database estimate, or a number in between. The majority still chose their original estimate (and less than 5% of the time went with the epidemiological estimate that the computer used).

De Dombal then ran AAPHELP again on the 552 patients using these physician-derived probabilities rather than epidemiological data. Why? He was testing a very active area of debate – and something that is STILL taught today. Even today, teachers in evidence-based diagnosis classes will ask learners for their “gestalt” of the pre-test probability of a disease. If that approach ACTUALLY worked, there would be absolutely no need for exhaustive and expensive epidemiological studies! There might even be less resistance to using a computer –after all, the machine was truly assisting the physician, not merely replacing their cognition.

You can imagine the results. This did not work out. Not even close. Using the surgical gold standards, epidemiological estimates were far more reliable than clinician gestalt. Ironically, the more common the diagnosis, the MORE pronounced was the difference – which at least would be the opposite of my intuition.

Back to De Dombal’s two hypotheses to explain why the computer operated better than people. Hypothesis #1 was in fact correct. Physicians are terrible at pre-test probability. “The implications for computer-aided diagnosis are clear. The computer should use real-life data from large-scale surveys and not merely estimates from clinicians”

There was another possible explanation that De Dombal wanted to address. Could the computer actually MAKE the doctors better somehow? Perhaps by causing them to pay more attention? Or as a worthy opponent that the human had to best, like some sort of bizarre surgeon John Henry operating against a two ton computer instead of a steam drill?

The evidence, unfortunately, suggested that the mere presence of a computer wasn’t what was driving any differences. This is often called the Hawthorne effect, in which people change their behavior merely BECAUSE they know they are being observed. It was first described in the 1920s in studies of lighting on assembly floors – both turning on AND turning off extra lighting appeared to have similar effects. And unfortunately, Hawthorne effects appear to be a problem in all observational studies on people, though it can be minimized with careful study design (and to be clear – there are many things that you probably believe strongly that are mostly Hawthorne effect).

First of all, they didn’t get the computer’s recommendation until after they had already made their initial assessment. And De Dombal continued to collect data AFTER the trial – that is, after the terminal was removed. Error rates, such as appendicitis complications or negative laparotomies, began their predictable rise.

The improvement in performance must be from the algorithm, from AAPHELP itself. And the algorithm was hardly an “intelligence.” If anything, it constrained diagnostic thought by limiting evaluations to things that were easily measured with good reliability. It was capable of explaining its thoughts; its conclusions were testable, and fed back into its initial assumptions. That is – it worked because it was a computer, as contrasted to messy human cognition.

With much credit to De Dombal, he realized that the dream of a generalized diagnostic machine was far off. “It has long been an ambition of those working in the field to feed in clinical data and allow the computer to select from its files the most appropriate diagnosis from the whole spectrum of recognized clinical ailments. Unfortunately, this is not currently possible.”

Problem one was that AAPHELP started to break down when it got out of a limited selection of abdominal conditions, such as burst ovarian cysts. And attempts to broaden it to more general causes of abdominal pain (many of which didn’t have a surgical gold standard, like dyspepsia) it failed with far poorer reliability than human physicians.

De Dombal simultaneously understood the gravity of what he had done – and the limitations. AAPHELP was not about to put physicians out of a job. If anything, their role was even more important.

“The system is quite incapable of reliable operation unless a clinician first elicits reliable data from the patient – a curiously “old-fashioned” re-emphasis on the traditional values of accurate history-taking and careful clinical examination…. No one speaks of a stethoscope making a diagnosis; and it seems to us meaningless to speak of the computer in terms which imply that this sort of machine system usurps the clinician’s traditional role, even if, when the computer indicates its probabilities, we speak of the most like complaint as being the “computer’s diagnosis””.

This is a good time for Shani to step in. In De Dombal’s time it was pretty clear that human cognition was necessary to collect data, even if some of it could later be sorted by a computer. But clinical computing has grown in leaps and bounds in the time since, and can make inferences that no human ever could.

Shani Herzig (11:10):

Feel any better, (laugh), this was back in 2007 when I was taking class at the School of Public Health and I was in a pharmaco epidemiology class. And, you know, part of risk prediction and research is the standard thing teaching has been, you know, you wanna choose your predictor, your hypothesized predictors, aari based on your clinical knowledge. Like get out your doctor hat and say, these are the things that I think predict something. But around the time that pharmac epidemiology was really blooming and propensity scores were coming out, we realized that you don't actually have to be parsimonious with your predictors that you choose and not, and that's because propensity scoring gives you much more power if you have, have a, a common exposure. But not only that, but it's better if you're not parsimonious. Cuz what we found is if you tell a computer mine through this data and figure out which of these variables are gonna be the most predictive of a certain exposure, the computer will identify things that you never would've thought of. Like in one of the times that this was done in the development of, you know, these high-dimensional propensity scoring approaches where they select from like all data points that are available within a medical record system, things were coming out that we never would've thought of as humans to use as predictors.

Adam Rodman (12:28):

Can you give it an example?

Shani Herzig (12:29):

I I will, I can and I'm going to. Oh, great. So the example was, you know and if I remember correctly, this was a study where we were trying to predict use of statins on the part of a physician, and it turned out that it wasn't, you know, high. It wasn't just high cholesterol, high blood pressure, blah, blah, blah. Looking things like looking at how many visits has the patient had in the prior year, how many healthcare encounters mm-hmm. Have they had mm-hmm. <Affirmative>. So actual, actual measures of engagement with the healthcare system were were just as good if not better than these predictors that any physician would've given you off the top of your head. And like, it's, it's making use of all of the information in a way that humans would not have even thought of, but that turned out to like massively increase your predictive ability.

AAPHELP had such a positive impact on patients that of course it was coming back. In April 1974, AAPHELP was reintroduced at Leeds, and the accuracy rate became even higher than the first trial, reaching 95%. AAPHELP was then rolled out at two more hospitals across the UK, then 8 more centers. In every study, diagnostic accuracy increased, though not as much as at Leeds. Studies were even done in low-income countries across the world where there was less access to surgeons. In every case, it worked better than humans – but seemed to decrease its efficacy the further it got from Leeds.

It turns out that De Dombal had discovered one of the most potent challenges on decision support – spectrum bias.

Shani Herzig (06:46):

That's probably the one that I use the most wells I use sometimes. But it, you know, and this, this starts getting into some of the problems with y you know, bay theory and trying to develop clinical diagnosis algorithms or, or programs is the prior probability of disease or the prevalence of disease in any given population really depends on the population in very nuanced ways. Right? So, you know, from a tr from a study, you know, wells, you know what the average risk is for a patient of that profile who's presenting to the emergency department, for example, for, for, you know, they're now wells for hospitalized patients. But the point is that not all of these these risk stratification tools meant to support clinician decision making can be applied right to the setting in which the physician is applying it. You know, the, the prior probabilities change vastly, depending on the patient population you're actually applying it

Adam Rodman (07:43):

To. And that would be spectrum bias if you wanted to put the, the diagnosis, the diagnostic word

Shani Herzig (07:48):

On that. Yeah. Or as d the Donal mm-hmm. <Affirmative>, as the Donal said, I think he called it, he had a nice term for it, geographic portability or something. Which Yeah, that's exactly right.

Adam Rodman (08:01):

And it's the same idea, I mean, to Don Ball noted that even like neighboring hospitals worked better, but the further that he got even in in England, it worked less good. Less well and less well.

Shani Herzig (08:11):

Yeah. Well, and you know, he, he, there are two interesting issues embedded in that. One is the idea that the, the patient presenta patients with the same disease can present very differently depending on where they are. There are cultural differences in express depending

Adam Rodman (08:27):

On the referral.

Shani Herzig (08:28):

Right. Pain, you know, there, there are a million reasons that patients with the same disease can present differently depending on geography or various things. And then there's the idea that different diseases have different prevalences in different geographic areas. So you've got, so, so even if patients are presenting with exactly the same constellation of symptoms, if the prevalence of disease is really different in one pl place than another, then that, you know, that information that you gain from some diagnostic test actually doesn't change your likelihood that that patient has that disease. Pro or con, if the prevalence is super low, you know, like mm-hmm. <Affirmative>, like, so it's kind of like

Shani Herzig (09:27):

Time. That's right. That's, that's totally right. And so you'd have to be constantly changing the inputs, like unless, which is what we're gonna talk about today more, which is unless you have a system that instantaneously in the moment is constantly updating its own information, right. And, and you know, that I think is the biggest difference in what would've blown to Don Ball's mind is you read his work and it's, it's operating under the paradigm of the time, which is that in order to get a computer to do these things, you have to know how humans think. And then you have to teach the computer how to think like humans think. When in reality what's happening with AI is computers are almost now understanding how we think better than we could even tell you how we think. Like they're just using so much information to come to an answer in a way that we might not have ever even thought to come to that answer. Because they're able to recognize this is an important piece of information, even if we don't recognize that it's an important piece of information.

So let’s go all the way back to episode one and explain why I spent an episode talking about autoappendectomies and submarines. Because AAPHELP would actually become an essential part of the US Navy’s defense systems. In 1978, physicians were still not stationed on submarines. With reference to the Seadragon, the US Naval Submarine Medical Research laboratory conducted a study in 1978 to see if they could successfully triage abdominal pain. There was a Cold War rationale to these decisions. Sure, evacuating a sailor was expensive and potentially dangerous. But on a “fleet ballistic missile” – the submarines capable of launching nuclear weapons against the Soviets – evacuating a seaman might threaten nuclear deterrence. A prospective study was launched at the Balboa Naval Hospital in San Diego. The goals of this study were slightly different than at Leeds – to successfully identify “nonspecific abdominal pain” – that is, a situation that would not need an evacuation. The study showed that corpsman using AAPHELP performed with a similar (though not quite as high) diagnostic accuracy to emergency room physicians – and for the purpose of the Navy, that had a low rate of inappropriate evacuation and inappropriately being held – a false negative rate of only 3-5%, which was in line with the original study. The reason for decreased diagnostic accuracy overall was thought to be the fact that the data was collected by corpsman, and not physicians – yet another example of spectrum bias.

AAPHELP was launched on submarines across the US Fleet – I found an article going over a training video for sailors from 1980, and I found the reference to a corpsman handbook from the 1980s. If anyone has any memories of using AAPHELP, or a sense of when its use was stopped, please reach out to me – I’m very curious!

What did AAPHELP accomplish? Was it, in fact, an artificial intelligence? De Dombal at least thought no. Remember how I talked about Keeve Brodman and the MDS? At that point – you know, the 1950s and 60s– informaticians and cyberneticists essentially thought that diagnosis was made by pattern fitting. Things had started to shift by the 1970s, and cognitive psychology had started to shed a lot more light on the nature of physician thought. Were doctors algorithmic, following a pre-determined set of rules to get to diagnoses? Or were they heuristic, following the hypotheticodeductive model where individual hypotheses are sequentially made and then tested? Or are they outcomes based, fundamentally weighing the probability of diagnoses with the expected benefits and harms of treatments?

By De Dombal’s time, it was clear that all of these models were right SOME of the time, and that none of them were right all of the time. Diagnosis was a far more complicated process than doctors even a decade before had foreseen, and De Dombal was more than happy to throw them under the bus:

“Individuals in the 1950s and 1960s, waving computer-printout around like some electronic papal encyclical, enthusiastically advocating systems that promised the moon, cost the earth, and delivered nothing, did much to set back the progress of medical computing.”

De Dombal had actually returned to a far older view of diagnosis that harkened back to the 19th century, with its focus on the varying cognitive strategies that went into data collection and interpretation. This is the topic that I’ve been giving grand rounds on this year – The Two-Faced God: The Facts of Disease, the Nature of Clinical Reasoning, and the Cognitive Conflict that has Defined Modern Medicine. I will eventually turn it into a podcast, but if you’ve listened to The History and The Database, both with Gurpreet Dhaliwal, you get the idea.

Da Costa’s presents a three step view of diagnosis as a basic outline: one is collecting the facts of the disease; two is recognizing signs, and three is using reasoning faculties to fit this into a diagnosis. Of course, these three simple steps are far more complicated that they seem on the surface, but it’s still the basic framing of the process.

In recognizing how badly the previous generation had messed up diagnostic AI, De Dombal returns to the classics and gives a version of diagnosis that is basically the same as our sudden understanding: step 1 is the acquisition of “data” from the patient, step 2 is the analysis of data, which includes questioning about the validity, identifying the problem, and identifying an etiologic agent agent. Step 3 is the making of a therapeutic decision, what we today call management reasoning.

He, of course, mapped out everything in a logic tree. But in De Dombal’s research, there was one thing that was clear – physicians performed poorly at every step of these diagnostic processes. “They ask a large number of irrelevant questions; the fail to ask questions that are relevant; they fail to record the data in a way that is easy to follow; they ignore obvious clues in the information available, or they obtain masses of relatively useless biochemical data, utilizing less than 5 percent of it.”

His complaints have been major themes throughout these series on diagnostic reasoning going back to the 1850s – and I love this quote in particular because it’s so similar to 21st century complaints about the iPatient and our overreliance of diagnostic tests rather than spending time with patients. Sorry everybody – humans have always been humans, and there was no golden age in medicine.

To Shani, this tension dating back almost two centuries – doctors being essential to the “facts of disease” but also being kind of terrible at it – is essential to understanding current developments in artificial intelligence. Or more broadly, ALL developments in artificial intelligence.

Shani Herzig (13:51):

That's totally right. You know, the other thing that jumped out at me that I think, you know, into Don Ball's work, he, he was talking about that that, you know, the steps to diagnosis, there's information acquisition. Yes. And then there's information interpretation and then there's application of that interpretation to the care of the patient. And he very quickly dismisses the computers as ever having a role in the information acquisition component. But we're already seeing a million examples of that. Right. So, you know, starting from, you know, years back with the emergence of telemedicine and the fact that you can have a stethoscope that's on a patient's chest and a physician across the country can be, you know, listening to that. So that's like a very primitive example of it. But recently, this is like the coolest thing that's happened to me lately. So I wear contact lenses and it's so annoying having to get these contact lens exams for the purposes of, of ordering contact lenses when I'm like a young fairly health, well young is debatable, but fairly healthy.

Adam Rodman (14:57):

I think you're young as someone gaining an age income

Shani Herzig (14:59):

(Laugh) individual without ever having had any eye problems, why do I need to go in, pay $150, have a contact lens appointment? And so a company that I won't use the name of is now doing contact lens appointments through your phone. Yeah. And you set your phone up at a certain distance from yourself and provided you our low risk, which they have ways to calculate based on prior history, et cetera, et cetera. You can use your phone and it snapshots your eye and it's, it's, and there's not a doctor on the other end assessing it. Your own phone then assesses whether your eyes look okay, whether they see any gross abnormalities, and you use the, the phone itself, tech teching EKGs from your Apple Watch. You know, like the number of ways that computers now actually collect data in the absence of human hands is just incredible. So I really think that computers are gonna be involved in every single aspect of those three facets of clinical diagnosis that the Don Ball spoke about.

Shani Herzig (28:18):

I'll tell you the other thing that comes up into dole's work that I think is going to be a hard thing to tackle with just a computer is interpreting what a patient means when they say certain things. Uhhuh <affirmative>. Right. Uhhuh <affirmative>. So it, it is really hard sometimes to decipher what it is a patient is actually saying. And I mean, I guess if a human can do it, we are using some, we're using cues and maybe we aren't able to describe what cues those are that we're using. So by the virtue of that, maybe computers could do the same thing. But I just think that there's so much in that, in that, in kind of the way that we interpret the information that we're receiving, there's so much inherent like subjectivity or there's

Shani Herzig (29:25):

But sometimes your knowledge, your interpretation is based on your years of experience with that patient. Right? Yeah. It's cuz I know this person, I know he's an under reporter of pain or I know she's a over reporter of something. You know, like it's just,

I’m not so certain De Dombal would be surprised about this moment. He was pretty clear eyed about the future – computers were here – and they were only going to get better. “There is very little point in talking about when computers and information science will arrive in relation to surgical decision making. Both are already here, for good or ill.”

Of course, he had recognized the failures of previous generations, which tempered his enthusiasm. There were fundamental epistemological problems – especially the problem of syntax and the variability and disagreement in clinical data, that would limit the utility of diagnostic machines for at least for the foreseeable future. Therefore, he introduced a new term into medicine – decision support systems, which we still use today.

“This is not a competition between computer and doctor. Where competitive systems have been tried, they have failed. A well-designed diagnostic support system should do for the doctor what a well-designed golf course does for the golfer. It should flatter his strengths, reward his good efforts, and – instead of harshly penalize his errors – so motivate him that he himself seeks to improve his own performance. In short, it should play like Augusta, rather than Pine Valley or Oakmont.” I will be honest, I don’t understand any of those references, but I gather it’s better to play at Augusta.

I would likely need to read De Dombal’s correspondence to truly know this, but I think he was reacting against the previous generation and their over-optimistic and simplistic predictions on the future of AI. And at this point, the collection and collation of data might be what human doctors best bring to the diagnostic process

Shani Herzig (16:47):

Yes. Yeah, he did. And, and it is scary to think about and you know, the idea that are are, is there ever going to be a time where we are superfluous? And I don't know the answer to that. It, I think it's, it's

Adam Rodman (17:03):

Possible. Can I this is what I would say. I would say that until three months ago, I don't think there was a point entertaining it because of the, the nature of the hypothetical deductive process. Right. But now we need to actually test it. Right? Yeah. I think, I don't know the answer, but it's something that we need to investigate.

Shani Herzig (17:20):

Yeah, I totally agree. I I could see that happening and I agree with you. My thinking on that has evolved. Like, you know, five years ago I would've said, no, there will never be a time, it's always going to be computers and AI augmenting our human decision making, clinical decision making. But, but I could now, it doesn't seem crazy to me for it not to be augmentation. At some point

Shani Herzig (18:50):

Yeah. Yeah. And you, you have to, and that type of model still relies on physicians recognizing what the inputs need to be. Whereas the beauty of AI is it will pick up on

Adam Rodman (19:06):

Things that, that

Shani Herzig (19:06):

Are more predictive that we didn't know. And that's the other really interesting thing. And one of the, one of the things that makes people scared about using AI for research is that it'll come up with things that, that turn out to be predictors or things that fly in the face of what we know ice cream. And we don't know. You saw that? No.

Adam Rodman (19:25):

Read the Atlantic headline. Now it's about ice cream being protective against diabetes.

Shani Herzig (19:31):

It's an amazing article. But, but you don't know why that is. So like, you know, you don't know if that's just colinear with something else, if that's a proxy for something else, or whether, whether it actually is Right. Because you don't know. Like, I mean,

Adam Rodman (19:46):

Isn't that true about humans too though? It

Shani Herzig (19:48):

Is. It is. It is. But at least with humans you can at least, you can kind of like unpack things a bit more. A lot of times with AI you can't really unpack how they got to that.

I talk a lot about metaphor on Bedside Rounds. I mean, of course I do. We humans love to explain things with stories. I recently gave a talk on thanatology, death systems, and the medicalization of palliative care where I forced everyone to watch 5 minutes of one of the best Star Trek: TNG episodes of all time, Darmok, where Picard learns through struggle to communicate with a species who can ONLY communicate through metaphor. The subtext, of course, is that humans are not too different. And, I’ll just say, if you’ve ever thought, Star Trek is just overly dramatized people in pajamas and cheesy makeup, well, that’s true, but if you just watch this one episode you’ll get why I love it so much and why it’s made such an impact in my life – I’ll put a link in the shownotes.

All of this is to say, De Dombal leaned deep into science fiction metaphors, just as he had when he developed his CAL.

“Most people draw fictional parallels with computers, like the Big Brother situation in 1984. The best fictional comparison from the realms of fantasy is neither Orwellian, as in 1984, not Clarkeian, as in 2001, but R2D2 in Star Wars. R2D2’s automated system’s function was merely to provide the right information at the right time to enable decisions to be made which subsequently proved to be correct. What we are looking at now, therefore, are the descendants of R2D2 rather than of HAL or Big Brother.”

That quote is from 1978, so probably written months after Star Wars came out. De Dombal closes with a warning – one that was remarkably prescient coming from 2023.

R2D2 was a best case scenario. Big Brother was still possible. “In this sense, 1984 will only come about if clinicians ignore the potential of current decision-making support systems, in which case, administrators will play an ever more prominent role in their use. It therefore behooves clinicians to evaluate very carefully this new development in medical thought.”

A brief history of informatics clearly shows the many ways in which De Dombal was correct, and I doubt that many physicians today have a great appreciation for all the time they spend on the computer.

I think it’s fair that the R2D2-like clinical decision support that cheerfully supports the human physician is his legacy, a legacy that has certainly not been without problems – ahem, the Epic sepsis score – but has changed, in a meaningful way, the way that we think about and practice medicine. But what about in the year 2023?

Shani Herzig (24:09):

His legacy, which I think is actually kind of what we're talking about as potentially being challenged now. I, I felt the biggest legacy was the idea that we can and should work together with computers to deliver better medicine. When I say that, I think that that is potentially going to be challenged. It's the, should we even be part of that equation? Part of that that I think, you know, maybe not even that long down the road, there's gonna be some question about like, you know, as we were just discussing it, we could hit a point where humans are not necessary in that. I could see that happening. And maybe not in all of medicine, but in certain parts of medicine.

Shani Herzig (30:16):

I think that as he conceived of them, I think those days are approaching their end. Well, alright. So I do, I do think that there will always be bayesian type aspects to clinical reasoning. And I'd have to think a little bit about specifically where that will continue to be so important.

Adam Rodman (30:41):

Diagnostic tests I imagine isn't that kind of gonna be the classic area?

Shani Herzig (30:45):

Yes. And so, you know, I think where those things fall apart is, you know, you can know the sensitivity and specificity of a test and as we've talked about on prior episodes that you've done, even that is actually not an inherent property of, you know, it's not a fixed property of

Adam Rodman (31:03):

This house. Can you, can you teach medical students that? Because whatever I tell them that I get in fights and it always goes back to what their textbook

Shani Herzig (31:08):

Says. But, but even assuming that we can capture sensitivity, specificity, the estimate of the prior probability, which is the, the prevalence of disease is so dependent on a million aspects of that patient sitting in front of you. So like you, you can know the prevalence among a population. So in the US what's the prevalence of X, Y, or Z? But that doesn't tell you the prevalence of that disease in a male patient in their sixties. Yeah. With Ashkenazi Jewish background, who also happens to smoke and who, so like, you know, you, you never truly know the prevalence of disease in the, you can go narrower or narrower. Narrower. So that's where I think computers could, you know, they can do a million different permutations of what the prevalence of disease is in tinier and tinier subgroups of patient populations and really refining that prior probability. They could also tell you what the test performance characteristics are, how sensitivity and specificity

Adam Rodman (32:15):

Varies and how they work at your institution. If they can talk to that's each other, they can tell you how they work on a national level. You

Shani Herzig (32:21):

Know, we're always psyched when we, like I was psyched when I found out we had a hospital antibiogram, right? So I knew exactly what prevalence of different organisms was at our own specific hospital. Like, you know, so having so computers will facilitate more accurate inputs for doing the same types of calculations that Daba was using in his early versions of these diagnostic reasoning tools.

Shani Herzig (41:25):

You know, did, so back to your question about what is, did Don Ball's legacy? I think the idea that having the computer provide their thought, their, their their thinking around something and then providing that information to the clinician and allowing them to take that information into account, that I think is fundamentally a great concept. Right? Because I do think that there need to be checks and balances in any system and you know, we get that to some extent, like on teaching cer, you know, in, in hospitals systems are set up so that there's never just one person involved in the care of a patient, right? There's pharmacists who are making sure your dosage is right. There's nurses who are making sure that, you know grossly oversimplifying roles. But I just think that you never wanna solely rely on one type of processor.

Adam Rodman (42:20):

And we've talked about this before and that's a problem with diagnosis as is there's usually no one to check uss Right. Or challenge

Shani Herzig (42:26):

Us. And so I, I think if I have to, if I had to answer your question, come to come up with an answer to your question about what his legacy is, I think it's, it's just the idea that, that humans and computers could work together to achieve a better diagnostic process and that computers can actually help to make humans better at what they do. So I really loved that in his work. He found that physicians were able to improve with that information and that when they shut that

Adam Rodman (42:59):

They, they got worse again.

Shani Herzig (43:00):

They, when they shut that off, they got worse again. And then they turned it back on and they got better. And so I think that there are definitely, for as long as humans are involved in delivery of healthcare and, and medical care, I think there are certainly going to be ways where computers can help us do the roles that we're still doing better. Okay.

Shani is somehow not nearly as big a nerd as I am – or rather, she’s a different type of nerd.But she actually had an amazing insight about De Dombal’s droid analogy. Rather than referring to the computer as the droid, it might work better for US – the role of human doctors might shift to being C3PO, who, if you will recall for the original Star Wars, is a protocol droid, and whose primary function is to communicate with humans – or, okay, sapient beings – and translate this with machines.

Shani Herzig (34:01):

So I think we've already alluded to this a bit, which is I think that there are nuances of, of information acquisition mm-hmm. <Affirmative> that I think humans, it will behoove uss to have humans involved in. Although you could argue at, at some point maybe the patient will interact directly with the computer and, you know, at the end of the day it's, it's, we gotta go on whatever the information is that the patient is giving. So,

Adam Rodman (34:36):

So information acquisition

Shani Herzig (34:37):

In information acquisition I think is one area where we, we may need to continue to be involved for maybe longer than some other parts.

Adam Rodman (34:46):

And by information acquisition you mean talking to the patient? Mostly

Shani Herzig (34:49):

That is, yes. Yes. But you know, scientists, we like to say fancy words for very simple concepts. <Laugh> Yes. Talking to the patient. I, I I think that counseling and delivery of of information, I feel like that's always going to be, it's always going to feel better getting informa hard information to hear from someone that you have a long-term relationship with who can really sit down and, and explain it to you. And I don't know, I mean I, I do know that chat GT G P T is already able to give pretty wonderful explanations of

Shani Herzig (35:43):

Do, you know, what I think is very interesting too, is the intersection of the pandemic with all of this. And what I mean by that is that I never thought that we would get to a point where humans, where it would be like the Jetsons or you know, where you don't leave your house all, you know, you, you, you do everything from, from a computer screen and that there's no need for human interaction. Cuz I was always like, people are always gonna need human interaction. But you know, the pandemic took away so much human act interaction and it hasn't fully come back and we don't know that it ever will. And I think now, you know, people still get interaction, but they get it from the people they specifically want to get it from as opposed to being forced to. And where I'm going with all of this is that I think that the pandemic has made it that we are all much more comfortable relating to other individuals through a computer screen mm-hmm. <Affirmative> by virtue of what has happened during the pandemic, such that that gap between the human element and what a computer can give has been narrowed.

Adam Rodman (36:47):

Right. And this is what I would say. We both have children and my kids are going to grow up interacting with chatbots and because, you know, they're two and four chatbots that are more powerful than the ones now. That's right. So you and I I don't think would ever feel comfortable being counseled by a chatbot. I don't think that's gonna, like society changes, it's changed with technology before. There's no reason to think it won't continue the change. That's

Shani Herzig (37:10):

Right. Yep. There, there's just less, less human presence in general and make, and people are just more and more comfortable with lack thereof,

And that is ending the show on a rather bleak note. This episode has brought us up to 1978, which I think it the most contemporary that Bedside Rounds has ever gotten.

Contributors

Adam Rodman, MD, FACP - Host
Shoshana Herzig, MD, FACP - Guest

Reviewers

Paul Kunnath, MD, FACP
Gabriel Pajares Hurtado, MD

Those named above, unless otherwise indicated, have no relevant financial relationships to disclose with ineligible companies whose primary business is producing, marketing, selling, re-selling, or distributing healthcare products used by or on patients. All relevant relationships have been mitigated.

Release Date: September 5, 2023

Expiration Date: September 4, 2026

CME Credit

This activity has been planned and implemented in accordance with the accreditation requirements and policies of the Accreditation Council for Continuing Medical Education (ACCME) through the joint providership of the American College of Physicians and Bedside Rounds. The American College of Physicians is accredited by the ACCME to provide continuing medical education for physicians.

The American College of Physicians designates this enduring material (podcast) for 1 AMA PRA Category 1 Credit™. Physicians should claim only the credit commensurate with the extent of their participation in the activity.

ABIM Maintenance of Certification (MOC) Points

Successful completion of this CME activity, which includes participation in the evaluation component, enables the participant to earn up to 1 medical knowledge MOC Point in the American Board of Internal Medicine’s (ABIM) Maintenance of Certification (MOC) program. Participants will earn MOC points equivalent to the amount of CME credits claimed for the activity. It is the CME activity provider’s responsibility to submit participant completion information to ACCME for the purpose of granting ABIM MOC credit.

How to Claim CME Credit and MOC Points

After listening to the podcast, complete a brief multiple-choice question quiz. To claim CME credit and MOC points you must achieve a minimum passing score of 66%. You may take the quiz multiple times to achieve a passing score.

ACP Clinical Search

Financial Well-being

Telehealth

Quick Links

ACP Clinical Search

Financial Well-being

Telehealth

Quick Links

Bedside Rounds