Boosting Oncology Trials: LLM Accuracy Meets Unprecedented Speed

Fact checked by Sabrina Serani
News
Article

AI streamlines patient identification for clinical trials, significantly reducing review time while maintaining high accuracy, transforming oncology workflows.

Cancer cell image by Gwendolyn Salas / MJH Life Sciences using AI

Cancer cell image by Gwendolyn Salas / MJH Life Sciences using AI

Optimizing workflows and patient identification for clinical trials remains a critical challenge in oncology. Manual chart review, while thorough, is an incredibly time-intensive process that can significantly delay patient enrollment and trial initiation. To address this, a new study explored the utility of large language models (LLMs) in streamlining this crucial step.

Christine Vecchio, research nurse coordinator at Cleveland Clinic Taussig Cancer Center, and Eirini Schlosser, chief executive officer and founder of Dyania Health, discussed their recent abstract presented at the 2025 American Society of Clinical Oncology Annual Meeting regarding this research that compared a medically specialized LLM against human manual review for identifying eligible patients for clinical trials, focusing on both accuracy and completion time.

The LLM demonstrated comparable accuracy to a specialized melanoma research nurse (95.73% vs 95.11%) yet completed the task in 2 and a half minutes, compared with 7 hours for the human reviewer. The efficiency gain became even more pronounced when the LLM was pitted against a generalized research nurse, maintaining its high accuracy (95.73% vs 88.09%) while dramatically outperforming in speed (2 and a half minutes vs 9 hours).

In an interview with Targeted OncologyTM, Vecchio and Schlosser delved into the methodology, practical adoption strategies for community oncology practices, and the broader implications of AI's expanding role in shaping future oncology workflows.

Targeted OncologyTM: Can you provide some background on this abstract and what was looked at?

Vecchio: This trial compared a LLM vs human manual review of medical records and then compared the accuracy and length of time between the 2 methods. The LLM and the melanoma research nurse had comparable accuracy. It was 95.73% vs 95.11%, but obviously, there was a significant difference in the time of completion. The LLM took about 2 and a half minutes, and I took 7 hours. But when we compared the LLM with a generalized research nurse, we had both a significant difference in accuracy and the time for completion. The accuracy was 95.73% for the LLM and 88.09% for the research nurse, and then time for completion was again, 2 and a half minutes vs 9 hours. Overall, it demonstrated how useful medically specialized LLMs can be in reducing clinical time and finding eligible patients for trials.

Can you briefly describe the methodology used to evaluate both approaches?

Schlosser: Firstly, with regards to benchmarking any large language model, we try to closely map it to the exact task that is being asked of it to perform. Effectively here, we looked at the ground truth, which is defined as what the consensus of clinicians would determine as the gold-standard accurate result. Basically, and I say this as consensus, because where Christine and the other research nurse were focused on solving for this task was actually at the individual level. There is a difference, in general, of saying, "Okay, consensus," of having many different people with the clinical expertise to assess those conclusions, all agreeing that this is the right answer that gets defined as the 100% score.

Then we would assess the performance of Synopsis AI's large language model, as well as the research nurses, both Christine, who specialized in melanoma, as well as the generalist research nurse, against that benchmark. We also did this with 2 different cohorts as well. So, it was a total of over 1000 question sets with the pairs of notes. Effectively, you can imagine that would inevitably take any human around 20 minutes per note set at the minimum. Christine was obviously, I think, faster and ahead of the curve in general. But overall, the LLM will go as fast as you designate computing resources to support that. So, with more GPU, it would go even faster. Less GPU would take slower. The 2 and a half minutes was on 2 GPUs.

How accurate was the LLM-based system in identifying specifically melanoma patients for trials?

Schlosser: 95.71% accurate.

What types of data or clinical criteria posed the greatest challenge for each method?

Vecchio: Staging can change across a patient's timeline. I know that making sure that, for me as a human, to find the most accurate staging... I mean, technically, they stage it once, but clinically, how you treat it changes across time.

Schlosser: In general, the large language model is not designated as a medical device. If information is missing, we explicitly instruct it not to give an assessment. For example, if no one has put in a tumor stage, we would be able to determine the thickness and size of the tumor, the lymphatic spread, and the metastasis, but we would not instruct it to deduce a stage, even though we could. But it would fall into a medical device category, and so we ultimately leave that for the clinicians to be able to deduce any final human responsibility.

That is critical because while the situation that I just described is actually a simple use case where you would say, "Well, it is clearly mapped to whatever that staging approach looks like," in a real-world environment, you're often looking at information where it could be ambiguous, vague, incorrectly recorded, or there is confusion or conflict between what might be in a pathology report vs what was coming from a patient's previous provider, etc. So, the flagging of ambiguity is actually the real value that the LLM can provide, in the sense that we can say, "Okay, here is the attributable source that led to this conclusion, for, let's say, the size of the tumor." That allows for an auditable trail for the humans in the loop to be able to go back and check with the provider and verify any other information, or maybe a patient was not tested, and they can get a genetic panel, for example.

I think it is an interesting perspective because the way that we set it up with this particular study, we wanted to make sure that we were putting it in a test tube environment, basically, so that the LLM was asked to answer questions, and then those questions were exactly the ones posed to the humans in comparison. So, I think staging and points where there would be missing information, it is not necessarily more difficult. The accurate answer would be that the result is inconclusive if there is nothing listed. That is where you get from the stage of, "Here is what is in the actual results of the performance of the AI," and, "Here is what needs to be done as a response to receiving the results of those answers to questions." We view the clinical trial criteria as mirroring what would be asked as a question to the AI system.

What role did human reviewers still play in the AI-assisted process?

Schlosser: I think if you look at this outside a lens of AI to human performance at chart review, that is only a very small sliver of the process of getting patients into clinical trials, participating in clinical trials, finally doing day-of screening, contacting the patient, etc. Clinical trials broadly are a very human process. But if you look at, let's say, in this particular clinical trial that we were working on, which was in parallel with this abstract, that was a setup whereby we were automating the review of about 1500 [patients with] melanoma every day. Those patients have changing characteristics over time. If a human were to sit down today and say, "I am going to read 1500 patients end-to-end, all of their [electronic medical record (EMR)] notes and histories for the past 3 years that might be relevant," that would take them several years to get through just the 1500 once. If those patients are changing daily, the comparable process is not actually humanly possible.

I think the main message here is being able to deploy AI in tasks that were humanly impossible anyways, and then have the humans be able to be much more focused on where they can be driving value, like communicating with the patients, getting them on the actual studies. That is the majority of the work. The AI is processing the data to empower that work. So, it is basically quite complimentary. And, at least, we are of the opinion, you could never really run a clinical trial, at least in this decade, without humans driving it.

How can community oncology practices realistically adopt LLM tools for trial screening?

Schlosser: Firstly, within the context of this abstract, we have been deployed at Cleveland Clinic. As a partner of Cleveland Clinic, and Cleveland Clinic had invested in us, we are processing the data on behalf of Cleveland Clinic without it leaving the firewall. [This means that] if there was a community or regional site within Cleveland Clinic, that is within the bounds of that database that the data is being processed. With new sites, or, let's say, it's a community oncology site that is not part of Cleveland Clinic, etc., it would be a separate relationship with us.

I would say that broadly, this is not something that you can just throw an LLM at because, for example, ChatGPT has a specific context window, meaning it will read up to a certain amount of text and answer questions about that text. It's not particularly capable of reasoning. Effectively, a large language model is one part of a system. We have 8 different models that do different things.

One model, for example, will determine whether a note is relevant. That also would be paired with the type of question we're asking. If we're looking for patient-reported outcomes, maybe the call transcripts might be relevant, or maybe the patient just might have been calling about their appointment time, and it is completely irrelevant. Therefore, it would be a waste of computing resources to have the AI read all 800 notes that might be in existence for a patient. In addition to that, if we look at the multi-note approach, which is extremely common, the entirety of medicine is built on the ability to read and deduce what is often a singular conclusion from information that could be spread across dozens of notes and labs. That is actually the opposite task that normal large language models are able to do.

I would say if there are any new community cancer centers, it needs to be a focus that is quite intentional to partner with a group that does this. It is not something that they can just say, "Okay, we're going to have ChatGPT go do this." There is really not an extensive population of AI researchers who have experience doing this alongside an acumen of data engineering that can build these plumbing and build the plumbing and the systems to be able to effect a proper end-to-end process to find patients and enroll them in studies. The same thing goes with structured data as well. So, if you are looking forand this is not particular to melanoma—but let's say you're looking for 2 consecutive [prostate specific antigen (PSA)] level changes in [patients with] prostate cancer, that effectively would be 2 characteristics coming from the labs. You are not looking for a large language model to do that when it is a numerical or codified dataset.

What larger messages does this study send about AI's evolving role in oncology workflows?

Vecchio: Definitely making sure that you have the right tool for what you need. And then, people are always concerned that it is going to replace humans. As we have all alluded to already, this is just a tool that you use for me to be able to spend more time with my patients. So, making sure I have the quality tools that allow me to do that.

Schlosser: I would echo that. I think the workforce flows around this are really what make a big difference with regards to how the humans use the tools, and then how it can effectively be an actor or a catalyst of empowerment that enables the nurse research coordinators as well as the physicians to be able to spend time on what they need to do in treating patients. That's the big focus here. I think it's not just a buzzword effort to say, "Oh, okay, we're doing AI for something." You can throw AI at anything, but if it's not effectively used as a tool, it's going to have a very moot point involvement in really interacting with clinical trials as well as patient care.

REFERENCE:
Vecchio C, Braley S, Kennedy L, et al. Analysis of a large language model-based system versus manual review in clinical data abstraction and deduction from real-world medical records of patients with melanoma for clinical trial eligibility assessment. J Clin Oncol. 2025;43(suppl 16):1571. doi:10.1200/JCO.2025.43.16_suppl.1571

Newsletter

Stay up to date on practice-changing data in community practice.