Future of Medical AI & Search

Over the last two years, “search,” as a category, has begun to shift tectonically. At SPC, we’ve been wondering: how will it change medicine?

Earlier this year, we sat down with Vivek Natarajan (lead researcher for Google’s MedPaLM project and SPC alum), Heejin Jeong (ex-researcher at Together AI and SPC member), Eric Lehman (Head of Clinical NLP at OpenEvidence), and Sumanth Kaja (Faculty & Attending Physician at Columbia University Medical Center and SPC alum) to explore that question.

Heejin: To get us started: I studied reinforcement learning in my PhD at UPenn, then worked at Waymo as an ML engineer and tech lead. After three years there, I joined Together AI, first as a researcher, then moved into product, working on AI infrastructure.

Through personal experience, I started wondering where the variability in medical practice comes from. We don’t always experience it, but it becomes more apparent in complex and rare cases. So, I started talking to a lot of physicians to understand their thought process, decision-making, and how they keep up with evolving medical practices. One of the most common behaviors I noticed was search.

We’ve seen a lot of progress and innovation in how we do search, from simple keyword search to more AI-driven approaches. Also, many doctors have started using ChatGPT, not just in the U.S., but in other countries as well. That made me want to dig deeper into how doctors, health professionals, and even patients acquire medical information—especially through search.

Eric: I’m Eric — nice to meet you all. I just graduated last May with my PhD, working with Peter Szolovits at MIT. I’ve been very interested in AI applications for healthcare and obsessed with how we can get AI tools into physicians’ or patients’ hands in a way that makes sense in current workflows.

Over the last seven or eight yearsI’ve worked in this space, it’s evolved from tiny applications to make doctors’ lives slightly easier, to having these incredible tools that can transform everyday work. The question is how best to implement them in clinical practice.

I joined Open Evidence about three years ago, and it’s been a huge journey. The technology has changed a lot, so the mission has changed too. We’re always thinking about how to use state-of-the-art tools to make literature search easier or to streamline tasks like writing prior authorization letters, and so on.

Vivek: I’m a research scientist at Google, working at the intersection of AI, biology, and medicine. I used to be at SPC right after the COVID pandemic, dabbling in that same intersection. Back then, it was still early days for tech-bio and AI+biology, but now it’s a huge field, with hundreds of millions of dollars going into startups. A lot of the connections I made at SPC have become close collaborators, so I’m excited for this session.

Sumanth: I’m currently an emergency room physician and faculty at Columbia. Over the last 10–15 years, I’ve focused on expanding care access. Back around 2015, I joined the early team at Human Dx, which first got me thinking about how AI can be used for diagnostic reasoning and to expand access to care.

Later, I trained at NYU and Bellevue for my emergency medicine residency, including during COVID. Then I went to Columbia. I love my clinical job—it’s some of the most important work I’ll ever do, but it’s also extremely hard. We talk a lot about physician burnout, which comes from a number of places: administrative tasks, documentation, etc. But among my colleagues in emergency rooms, one of the biggest sources of burnout is not having the time or resources to help people the way we wish we could. Another reality is that most medical care isn’t delivered in hospital walls; it’s at home.

While we understand a lot, we still don’t understand most of medicine, whether we’re clinicians or patients. My biggest question is how we make sure that both I, as a clinician, as well as my patients, can understand their health and quickly find the most appropriate answers to our questions. And perhaps more importantly, know how to ask the right questions.

Gopal: Over the last decade, people are obviously familiar with WebMD and have trusted Google with many a medical query. Help us understand the actual landscape. Do you think of it as consumer search (WebMD) versus professional medical search (for doctors)? Or do you use a different axis? How do you categorize the world of medical search from, say, 2014 to 2024, before we jump to 2025 and beyond?

Sumanth: I do like the traditional framing of patient search versus clinician or medical search. There’s a difference in the information delivered to a patient versus the information delivered to a clinician, and a lot of that is intentional. A lot revolves around nuance, safety, how to deliver information properly, and so on.

I think one thing that’s drastically changing now though, with LLMs, is the ability to meld that gap into one. It becomes more of a design question: it’s the same information, but how do we tailor it to different users?

Heejin: I totally agree. One big reason we’ve had very separate channels for health professionals and patients is the knowledge bottleneck. Another factor is cost structures: for example, UpToDate costs money, whereas Google is free. Now that AI can parse medical sources in different “languages” for different audiences, I wonder if the future of medical search might unify where the information comes from.

Sumanth: I’m curious how you think about that, Eric, because at Open Evidence, you’re building a tool that’s great for literature search but can be used by different audiences. How do you all think about design for different audiences when it comes to information access and safety?

Eric: I think it has to be different. At a bare minimum, the writing has to differ for patients versus physicians, even if you’re starting from the same underlying studies. Doctors want primary literature; that’s helpful for them. Patients might just want a Harvard Health website, which they find great, whereas a doctor would say, “Really? That’s it?”

There’s also a different volume of information. Doing patient search is a lot easier, in my view, because there’s less nuance. As a patient myself, I can imagine exactly what I’d want to see. But designing something for doctors requires talking to many doctors. Doctors might say, “Yes, guidelines are great, but they can be out of date, and we need to look at specific studies.” Patients won’t question guidelines the same way. There are more edge cases with physicians.

Gopal: I find that translation aspect interesting. You’re fundamentally “translating” primary, secondary, tertiary evidence for different user groups. One big early (and still massive) application of large language models is translating between spoken languages. Is translating for doctors vs. patients the same shape of problem, or is it more complex than a simple translation?

Heejin: Can I add another layer? It’s not just a translation question. With LLM-based systems, you might “grab” a source and augment it, but you could also interpret the source differently depending on whether the audience is a doctor or a patient. You can highlight which part is “research” and so on. So maybe the question is: do we still need different sources? Or can a single AI interpret one source and deliver it appropriately to different audiences?

Eric: I still think if you could do a physician search perfectly, turning it into patient-friendly information is just a matter of good translation—“taking out the nuance,” basically. But it doesn’t go the other way around.

Sumanth: Beyond just the reading level, the questions each audience asks are different. Also, each user has different needs: quick facts vs. in-depth nuance vs. actionability. Translation, while obviously important for language and culture, is equally important across education and individual needs.

Gopal: Sumanth, you mentioned earlier that the biggest source of burnout for you and your colleagues isn’t just documentation, it’s being unable to help patients in the best possible way. If we project forward to a “doctor of the future” who somehow does that better, what does that person do that most doctors today don’t do?

Sumanth: Probably two or three big things. First, they’re able to easily keep learning as medicine changes. My practice has changed drastically even in the last five years. Research is published at an exponential rate, so staying on top of it is huge. People who weigh and evaluate research well tend to be some of the best clinicians. They can read, for example, 35 new papers and realize, “This one paper is truly practice-changing for this particular subset of patients in this specific way,” so they remember that. Filtering the noise is key, and that’s where tools like Open Evidence could be really powerful.

The second big thing is empathy. That’s crucial. In my first couple of years of training, I spent more time keeping up with research, but as I mature, I understand that listening and spending the time to explain to patients can be just as, if not more, helpful. Many treatment failures happen because patients don’t understand the instructions or the next steps, or they can’t get to follow-up care (or because we don’t always have the time to explain properly). The biggest challenge around diagnosis isn’t always that we have the “wrong” answer; it’s often that we don't have enough information or time. In the real world, diagnosis and treatment are iterative, with no single clear right or wrong answer. They unfold in parallel rather than as neat, sequential steps. Unfortunately, in our current state of what I hope we’ll be able to call caveman medicine one day, we often have to guess, check, and adjust.

Gopal: Eric, on the research side, how do you think about reactivity versus proactivity? For instance, Open Evidence is often used in a “question → answer” workflow. But doctors also do a lot of proactive learning—thumbing through journals, etc. Can you talk about the push vs. pull dynamic?

Eric: From a machine-learning standpoint, reactivity is easier. If a doctor knows exactly what they want, it’s straightforward to retrieve it. Vague queries are trickier because it’s not always clear what they actually need.

Google does an incredible job of responding to vague queries with knowledge panels and so on. I don’t think we’re there yet. We can give a summary, but we don’t have that rich “overview” experience. Also, offering push updates can be complicated. Right now, if a user asks a question and a new paper emerges five months later that changes the answer, we can email them if they want that feature. But giving doctors broad “overview” updates that say, “Here are three new papers that change your practice,” verges on creating clinical guidelines from scratch, which is very challenging.

Sumanth: I do think we’ll get there eventually. My job needs both: active and passive information retrieval. Every week, I block out time to keep up with research—starting with curated blog posts or summaries from academic experts who have helped me filter the top of the funnel, and narrowing to a few papers I’ll read in depth. But on my shifts, I look up a lot of specific questions—some are recall questions like dosing checks, others are extremely nuanced. It’s rarely “Patient X has these symptoms—what’s the diagnosis?” It’s more “Does this biomarker hold true in a patient with advanced Crohn’s disease. Does that hold if the patient also has liver failure?” That nuance is rarely in UpToDate or traditional sources, so I rely on colleagues or deeper research after my shift. That’s where advanced research filtering, retrieval, and synthesis tools will help.

Gopal: Heejin, I’m curious—coming from your background in reinforcement learning at Waymo, then AI infra at Together, do you see through-lines between those experiences and the medical search space?

Heejin: A couple of things come to mind. One is the debate around end-to-end vs. modular approaches for building an AI system. Waymo leaned more modular, with multiple ML models working together and that led to better interpretability. Now, in LLM applications, especially where safety and accuracy matter, we’re also seeing an emphasis on interpretability.

Another interesting parallel is the “long tail.” In self-driving, the long tail is a crucial problem: you can’t deploy a model at 85–90% accuracy and call it done. Similarly, in medicine, the bar for safety and accuracy is very high. How do we address long tail cases or evaluate them properly? That’s a big question.

Vivek: [The long tail] is probably the most important problem to solve if we want to deploy AI at scale in healthcare. It’s not too different from self-driving’s evolution: you combine real-world miles (with safety drivers) and simulation.

Much of our data comes from creating “diagnostic dialogue” learning environments—sometimes multimodal—to simulate diverse patient encounters and see how the AI doctor/agent behaves. We get feedback from these simulations, at a massive scale, which is otherwise impossible with real-world data alone.

One key is that your feedback mechanisms must be specific and granular—if it’s too broad, it’s not actionable. It’s not as straightforward as in Go or StarCraft, where “winning” is the clear objective, or in self-driving, where “don’t crash and reach your destination” is key. In medical consultations, good care depends on eliciting the right history, communicating well, building rapport, and more. We track 60 different axes—some clinical, some not—and generate auto-eval metrics for each axis in simulation rollouts that can hit hundreds of millions of encounters.

This helps the AI become what you might call “the world’s most experienced doctor,” because in simulation you can systematically vary symptoms, backgrounds, rare diseases—really anything. In real life, a rare disease is truly rare, but in simulation, you can present it frequently so the model learns. Of course, we also focus on real-world miles, but simulation provides that combinatorial breadth to handle the long tail.

Gopal: Pulling on the self-driving thread – we can imagine 2 buckets: autopilot and copilot. If we look a few years ahead: maybe in three years we’ll have, say, Med-PaLM 4 or 5 acting as a high-fidelity “junior doctor” in the palm of billions of people worldwide (i.e. autopilot). Meanwhile, in three years, Open Evidence might be the “best copilot” for the best doctors globally. Those feel like two different ways to build products for the future of medicine.

I’m curious, Eric—when you think about Open Evidence’s top priorities, how closely do they align with what Vivek described? Or do you view it as a different design space, since you’re assuming there’s a human doctor behind the wheel?

Eric: I think it makes our problem easier, and it lets me sleep better at night. Even if the system isn’t 99.999% accurate, you still have some of the most capable people in the world—doctors—using it and applying their own reasoning. Also, the task of retrieval and research is more objective than many realize. It’s not just random guesswork on what paper is relevant; it’s fairly straightforward to say, “This newer paper is more relevant than that old paper.” There’s usually decent agreement among people about which studies matter most.

But when you start talking about medical diagnosis, next steps, and so on, it gets really hard. The label space is huge, and different clinicians have different opinions. If you’re trying to build general medical intelligence, that’s a taller order than what we do. We just want to surface the best literature so doctors can decide. That’s something we might be able to do extremely well in the next year or two—maybe near-perfect.

Meanwhile, I hope we see general intelligence for medicine keep improving over the next 10 or 20 years.

Vivek: I don’t think these use cases require fundamentally different technology. The same system that can perform a simulated consultation as well as a doctor can also help augment clinicians with, say, complex specialty diagnoses.

They’re not two totally different architectures. The technology at the frontier is so general that you can reconfigure it for different roles, like “assistant” vs. “fully automated.”

Heejin: I imagine one big challenge is the gap between when a diagnosis or treatment happens and when its consequences show up. That feedback loop can be quite delayed. Another is all the confounding factors that happen in between. So it’s hard to pinpoint whether a certain AI output caused a certain outcome.

How do you handle that complexity when trying to build a truly great system?

Eric: It does. I remember looking at reinforcement learning in healthcare a while back, and it always felt magical—sometimes it works, sometimes it doesn’t. At least five years ago, RL in healthcare didn’t produce amazing results, because patient cases are so complex. Trying to build counterfactuals from real hospital data didn’t yield huge gains. I’m curious if Vivek has seen bigger improvements recently.

Vivek: I’d say the biggest challenge in healthcare is data. Even if you took all the EHR data out there, it’s not that helpful. EHRs are messy, built for billing and fee-for-service, not for training advanced AI. If you train on that data, you might just reproduce fee-for-service behaviors, not better outcomes.

I think we need to build a new system—something like a collaborative case management record—where the patient’s journey and the clinician’s workflow are captured meaningfully. That’s not what EHRs do now. They’re messy, and it’s very hard to get the crucial interventions and outcomes.

Sumanth: I agree completely. We don’t have great datasets to measure these things properly, nor have we defined the right metrics. The other issue is that clinical data in charts is incomplete and must be used in context. Data capture in EHRs, while capturing some of the general clinical story, were designed to optimize for billing, coding, liability. And they capture only tiny segments in time, with bias towards what is entered, and only within a medical encounter. So the patient picture is grossly incomplete. We need multimodal approaches—video, audio, text, and more, and we need in-context data.

It’s very hard to re-architect EHRs from the inside because they were designed for billing. On the patient side, too, we need ways to capture everything relevant: the questions they ask, the concerns they have, the advice from their uncle, etc. Ambient listening is a first step, maybe. But we need truly collaborative interfaces that record the reality of a patient’s experience. That data could help an AI system produce more accurate reasoning.

Gopal: Yeah, and that implies a different care journey, right? If you’re a patient having these high-frequency, high-fidelity interactions with your provider—whether it’s a human or a system or both—that’s a radically different world from now. Heejin, you’ve been digging into this space, searching for an interesting problem to solve that could help bring us closer to that future. What has surprised you about the industry or the people you’ve met? And for others, maybe over the last 3, 6, or 12 months, what’s something that’s really changed your perspective?

Heejin: The first thing that surprised me was how much doctors actually search. The search methods and their amounts vary depending on their domain, but it was surprising to see how often they’re collecting information and trying to access the newest knowledge—very similar to what we do outside of medicine as laypeople. That’s part of why I wanted to learn more about how search will evolve with AI in the medical space.

I believe there’s a lot more we can do with search and intelligence to make information more personalized. Doctors have to consider a lot of complex patient and medical contexts. They read all these guidelines and journals, which are standardized, and figure out what works best for the patient’s specific case. Similar trends exist in other AI domains: moving from standardized, generalized intelligence to more personalized and detailed nuance. It’s exciting to see if that will change the medical field.

Eric: I always assumed ML in medicine would be a 30- or 40-year journey, but it’s been shocking how quickly large language models have been adopted. I didn’t think physicians would use tools that can make mistakes, but many are asking ChatGPT to summarize patient notes and trusting it to be reasonable. The risk tolerance among physicians has been surprising.

Sumanth: I’ve found that clinicians aren’t hesitant to use new tools; they just have a very high threshold for quality, and the realities of our care are often mis-matched with the products being designed for us - even those with some of the highest new adoption. Now, for the first time, we’re seeing some tools that match or outperform legacy tools that we already use, we understand our limitations (mostly), and we want them.

Also, how quickly things have moved in just one year is incredible. I’m starting to see limitations that come less from the models but more from the data. If we capture better data from both patients and clinicians, we can build AI that fundamentally changes healthcare. While true expertise and new information will not be replaceable, I don’t think we will always need deep human expert understanding and mental models in all situations if we do this right.

Vivek: I agree with that. A lot of medicine has been one-size-fits-all, often based on studies of mostly Western populations. That’s suboptimal, and it hurts people. The big opportunity with AI is to scale personalized healthcare to everyone. If we build the right infrastructure and learning environments, we can give interventions tailored to individuals, including more nuanced data—what they eat, where they live, wearable data, even molecular measurements. We can provide truly personalized insights. This no longer feels like science fiction; we can actually get there.

Gopal: In 2023, medicine is usually discrete conversations, whether that’s with Google or an actual clinician. Tell me more about how multimodality changes that. Most people might not think in terms of “multimodality,” but it’s a different way of interacting with your own health.

Vivek: Yes, we’re measuring at much more detailed resolutions—genetic data, cellular data—and no human expert can manually process terabytes of daily data. But AI systems can. They can make fine-grained predictions and see the immediate outcomes of interventions, maybe within an hour.

We’re moving to ingestible bioelectronics, more data about the microbiome, neural interfaces. We’ll be something slightly beyond just “humans.”

That’s good, because we’ll have proactive interventions and a healthcare system fundamentally different from today—thanks to AI’s ability to interpret multimodal data.

Sumanth: On that same thread, we’ve been working toward personalized medicine for years, and now we have an inflection point to move much faster. It’s funny, but even the constant, albeit sometimes slow, shift from treating buckets of symptoms to treating underlying pathophysiology has been one of the largest steps towards personalization. One example is the fields of Psychiatry and Neuropsychiatry. We’ve made tremendous leaps in research over recent years moving from throwing meds at DSM-bucketed criteria towards learning how to differentiate DSM-defined diseases with much higher resolution. Research is moving quickly, but it often takes years for even good research to permeate practice.

Another challenge in developing much better personalized medicine is in areas where we have fewer very concrete metrics that we are learning to measure, like cancer or other highly-genomic led care. Medicine has a huge observability problem. As we advance in how we sense and capture data, maybe we’ll move from diagnosing someone with “antiphospholipid antibody syndrome,” or so many other diseases that we diagnose and treat in a “gray” area, to better understanding the individual biochemical and genetic underpinnings of disease subclasses. Disease names are artificial ways for us to capture and explain. Hopefully, instead we’ll be able to estimate a personal probability of clotting in a given scenario, regardless of that label. Data and sensors are the key.

Gopal: I’m curious: what do each of you see as the hardest question you haven’t figured out how to answer for the next phase of what you’re building?

Vivek: From my perspective, it’s no longer purely a technical challenge. Sure, there are a few unknowns, but I’m pretty confident we’ll figure those out. We’ll have, in some broad sense, a “medical superintelligence” that’s generally available. The real question is about incentives, societal adoption, and making sure this future is evenly distributed. We don’t want a scenario where only certain sections of society or certain geographies have access, while large parts of the country do not.

Another question is how to prevent “the good” from being the enemy of “the perfect.” Right now, there’s a certain bar for humans to practice medicine, and I believe AI should meet roughly the same bar. Unfortunately, many people expect perfection from AI. If it makes a single mistake, it can cause huge controversy or harm, potentially setting back the field by years. How do we accept that these are probabilistic machines—that they will make mistakes, just in different ways than humans—but still see the overall benefit?

Ultimately, we need rigorous evidence showing these technologies are beneficial to society as a whole. We need to use the scientific method and the gold standards of medicine to generate that evidence. Otherwise, if we deploy unproven technology and it causes harm, it could set the whole field back. Yet these systems can be extremely helpful if handled properly.

Sumanth: We see that a lot in discussions of self-driving. Now we’re talking about people’s lives in healthcare. The five of us probably live in a bit of a bubble in how we think about probabilities and adoption. But I wonder if we’ll see a broad societal shift. As a doctor, I’m probably also a probabilistic machine—I’m not perfect either, and I miss things.

Heejin: In self-driving, AI would handle a scenario correctly but in a way that feels different from how a human driver would. So even if an AI system is 98% accurate and a human is 98% accurate, the distribution of when they’re right or wrong might be very different. I’m curious how that affects society’s acceptance. Should we develop AI that thinks more like a human, or should we only optimize for the best outcome and assume society will adapt?

Vivek: It’s a great question. My bias is we should optimize for the best outcomes, not necessarily to mimic human reasoning. For instance, I’m using a laptop right now, but I don’t really understand how the transistors work—it’s still magical to me. Similarly, if we build enough evidence that an AI is making the right decisions most of the time, people will adopt it.

Look at self-driving. Initially, if you told someone, “Go ride in a driverless car,” they’d say no. But once people experience it in San Francisco, most say it’s magical. So it’s often about that first encounter that shows the system doing something previously impossible—whether real-time access, better diagnoses, long-term engagement, or something else. That’s how you get believers. Of course, any new technology has people on both extremes, but I think the majority will follow once they see tangible benefits.

Sumanth: Medicine is all about trust. It’s high-stakes, with big information asymmetry. But oddly, it might be easier to adopt in healthcare than in self-driving, because for better or worse, many people - both domestically and around the world - don’t have good access to care right now, and many don’t trust the existing system or clinicians. If AI can solve that access problem, I think patient adoption could be quite fast, even if some clinicians hold out longer.

Gopal: Eric, what about you? What boulders are you gearing up to push up the hill?

Eric: I’m especially interested in AI systems that act almost like a real-time agent. Within five or ten seconds, I want a comprehensive answer at an expert level—essentially real-time research. If I ask, “Should I give this treatment to this specific patient?” the AI would say, “Well, that’s nuanced because of X, Y, and Z. The guidelines say this. Here’s a relevant trial—though not exactly your patient’s population—and so on.” That would be like having an expert physician who can instantly synthesize all available information.

Then there’s the broader question of how we integrate AI into medicine. We still don’t have much guidance from the FDA, which could shift things dramatically. AI in medicine might continue to grow quickly, or we might be forced to wait for 10-year studies. We might only get to use Med-PaLM 2 once Med-PaLM 10 is already out. So, my dream scenario is to start with lower-level applications and then move toward true decision support down the line.

Gopal: Building on this vision for the future of medicine, let’s fast-forward to 2030. Vivek, what do you see that’s fundamentally different about how we experience care?

Vivek: I hope by 2030, waiting to see a primary care doctor or specialist is practically alien. People should have access to world-class expertise wherever they are, on whatever device is standard in 2030.

I also hope healthcare shifts from “do more encounters for billing” to “look after you as a person.” That means not waiting until you’re really sick, but detecting issues years earlier and intervening earlier, which also saves resources. It’s proactive rather than reactive. And that frees up doctor time for the more complex or urgent cases. So overall, you’d spend less time in the clinic and more time just living your life.