As the general public has embraced large language models (LLMs) such as ChatGPT, Claude and Gemini, scientists have been exploring how these artificial intelligence (AI) tools could enhance medical research.
Some argue that LLMs could dramatically boost researchers’ efficiency in completing certain types of medical studies, and research published in February in the journal Cell Reports Medicine exemplifies that vision for the technology.
The study used massive datasets of patient biomedical information to predict the risk of preterm birth in a given pregnancy. These types of predictions have been a powerful AI use case for years, and were possible with more traditional types of machine learning than LLMs employ. But this study was notable in that LLMs enabled junior researchers — a graduate student and a high school student — to efficiently generate very accurate code.
That code predicted a baby’s gestational age at birth and the likelihood of preterm birth. The AI’s output matched and, in one case, even beat analyses from expert teams who had used human-generated code to crunch the same data.
“What I saw with junior scientists here and how effective they could be truly inspired and amazed me,” said study co-author Marina Sirota, interim director of the Baker Computational Health Sciences Institute at the University of California, San Francisco.
One big promise of LLMs is to lower the barrier for researchers to produce code and conduct complex analyses — but it comes with risks. As AI quickly improves, researchers must grapple with myriad questions. What guardrails need to be established to ensure AI’s accuracy? How do we measure its output? And how will the role of human researchers evolve as these systems gain prominence?
How AI prediction works
Sirota’s team drew on data used in the Dialogue for Reverse Engineering Assessments and Methods (DREAM) Challenges, international competitions in which teams of scientists tackle complex biomedical problems using shared datasets.
The open-source datasets included blood transcriptomics, which looks at RNA, a molecule that reflects which genes are active in the body. They included epigenetic information from placental cells, which described chemical tags that sit “on top of” DNA and control which genes can be switched on, and microbiome data describing the bacteria present in vaginal fluid samples.
These data points were flagged with the type of sample they came from — blood, placental tissue or vaginal fluid — and labeled with outcomes of interest, namely gestational age and preterm birth. Machine learning algorithms can then be trained to spot links between a sample’s source and its label. For example, they may reveal that microbiome samples with certain mixes of bacteria often come from people who have given birth early.
Once trained on a subset of data, the algorithm can be tested on samples that lack labels, to see if it can predict the label that should be there. For instance, it should flag samples with bacterial mixes similar to those in the training data linked to a higher risk of preterm birth.
But we can speed that up as well — the cleaning part and normalization of data — with generative AI.
Marina Sirota, interim director of the Baker Computational Health Sciences Institute at the University of California, San Francisc
The final step is to evaluate the models’ accuracy and compare them. “Accuracy” in the context of machine learning has a specific definition: the number of correct predictions divided by the total number of predictions.
Human- vs. AI-generated code
The DREAM Challenge was aimed at uncovering links between these medical metrics and the risk of preterm birth. Some risk factors, including having infections during pregnancy, are already well known. But the DREAM Challenge wanted to see what signals might be gleaned from clinical samples, like blood.
It’s the kind of work that normally demands months of effort from trained bioinformaticians. But instead of writing the analysis code themselves, the junior researchers in the recent study gave each of eight LLMs a single prompt describing the data available and the labeling task at hand: predicting gestational age or preterm birth.
LLMs tested
- ChatGPT o3-mini-high
- ChatGPT 4o
- DeepSeek R1
- Gemini 2.0 FlashExpThink
- Qwen 2.5 Coder
- Llama 3.2
- Phi-4
- DeepSeek-R1-Distill-Qwen
With this simple prompting, four of the eight models — DeepSeekR1, Gemini, and ChatGPT’s o3-mini-high and 4o — produced code that ran successfully. The best performer, OpenAI’s o3-mini, was as accurate as the original human DREAM Challenge teams. For one task, which involved estimating gestational age from epigenetic data, it was more accurate than humans had been.
What’s more, the junior researchers generated results in about three months and submitted a manuscript describing their results within six months, whereas the same process took the original DREAM Challenge teams years.
“We got lucky with the review process here, but six months to generate the results and write the paper is pretty incredible, especially for a junior scientist,” Sirota told Live Science.
Preterm birth, before 37 complete weeks of pregnancy, affects roughly 11% of infants worldwide. Babies born too early are at higher risk than full-term babies for a host of health troubles, including but not limited to problems affecting their brains, eyes and digestive systems. Being able to predict which pregnant patients are more likely to give birth early could mean closer monitoring and treatments to protect the baby and make full-term birth more likely, experts say.
Beyond writing code
The data used in the Cell Reports Medicine paper started “in good shape,” Sirota noted, in tables that AI could easily read. “But we can speed that up as well — the cleaning part and normalization of data — with generative AI,” she said.
Sirota’s team is now exploring other LLM applications, including a new tool called Chat PTB (short for “preterm birth”) that they’ve developed. The Chat GPT-based tool is embedded in papers published by the March of Dimes research network, part of a nonprofit aimed at improving maternal and infant health. Instead of manually combing through this literature, researchers can now query Chat PTB and get synthesized answers with references — a task that used to take hours, compressed into seconds.
But tools like Chat PTB and the code-writing approach in Sirota’s study represent only the first wave. AI-enhanced medical research is moving toward “agentic” AI, meaning systems that don’t respond to only one prompt but instead carry out multistep research workflows with increasing autonomy.
Instead of responding with only text, an agentic agent is capable of checking and iterating on its own work until it reaches its objective. It can also take action on a user’s behalf, like searching the internet and running code, rather than just writing it.
That shift toward greater AI autonomy and less human oversight brings both enormous potential and serious risk. In a January study published in the journal Nature Biomedical Engineering, researchers evaluated LLMs on 293 coding tasks drawn from 39 published biomedical studies, initially allowing the LLMs to come up with workflows on their own. They found that the overall accuracy came in below 40%.
Their solution was to separate planning from execution: They had the AI produce a step-by-step analysis plan that a human researcher reviewed before any code got written. The approach boosted the accuracy to 74%.
The goal of AI is not perfection, but to do better than people.
Ian McCulloh, professor of computer science at Johns Hopkins University’s Whiting School of Engineering
“The goal is not to ask researchers to blindly trust an AI system,” study co-author Zifeng Wang, who was a doctoral student at the University of Illinois Urbana-Champaign at the time of the study, told Live Science in an email.
Instead, the goal is to “design frameworks where the reasoning, planning, and intermediate steps are visible enough that researchers can supervise and validate the process,” said Wang, who is a co-founder of Keiji AI.
Why safeguards matter
These risks don’t mean researchers should shy away from AI, but they do need to apply the same rigor to AI-generated work that they would to any other collaborator’s output, scientists caution.
“The question is not whether LLMs accelerate science or create ‘AI slop,'” Ian McCulloh, a professor of computer science at Johns Hopkins University’s Whiting School of Engineering, told Live Science in an email. “The question is how we leverage this powerful technology within the scientific method.”
But McCulloh also cautioned against holding AI to an impossible standard. People tend to assume AI is error-prone and downplay human error, he said, when, in reality, both humans and machines make mistakes. He anecdotally described a consulting client who lamented AI’s 15% miss rate on a certain task, not realizing his human employees’ miss rate was 25%.
“The goal of AI is not perfection,” McCulloh said, “but to do better than people.”
That effort will involve agreeing on how to measure AI’s success. Dr. Ethan Goh, a physician-researcher at Stanford University, pointed out that health care still lacks standardized benchmarks for evaluating AI’s performance. Goh recently published a randomized trial in JAMA Network Open that studied how LLMs influence doctors’ reasoning in determining diagnoses.
Because LLMs are trained on such a vast amount of data, “benchmarks are so expensive to produce,” Goh told Live Science. What’s more, he said, AI improves so quickly that most commercial models start beating the few benchmarks that exist and rapidly render them useless. Amid these challenges, Goh’s team at Stanford’s AI Research and Science Evaluation (ARISE) Healthcare Network is working to develop such standards by the end of this year.
For all the uncertainty around standards and safeguards, the researchers who spoke with Live Science shared a common conviction: AI belongs in the lab, but not unsupervised.
“We have to be careful not to forget what we know in terms of the scientific process,” Sirota said. “But I think the opportunity is tremendous.”













