A Comparison of a Large Language Model vs Manual Chart Review for the Extraction of Data Elements From the Electronic Health Record
2023; Elsevier BV; Volume: 166; Issue: 4 Linguagem: Inglês
10.1053/j.gastro.2023.12.019
ISSN1528-0012
AutoresJin Ge, Michael Li, Molly Delk, Jennifer C. Lai,
Tópico(s)Machine Learning in Healthcare
ResumoLarge language models (LLMs) hold tremendous potential for accelerating clinical research and augmenting clinical care. One of the most promising LLM use cases is natural language processing (NLP) and the extraction of elements from unstructured text, for which LLMs may be superior to existing NLP software packages.1Singhal K. et al.Nature. 2023; 620: 172-180Crossref PubMed Scopus (188) Google Scholar,2Bhattarai K. et al.bioRxiv.[Preprint, September 29. 2023; https://doi.org/10.1101/2023.09.27.559788Crossref Scopus (0) Google Scholar Due to its standardized Liver Imaging Reporting and Data System (LI-RADS),3Chernyak V. et al.Radiology. 2018; 289: 816-830Crossref PubMed Scopus (625) Google Scholar hepatocellular carcinoma (HCC) imaging provides an ideal test case for LLM-enabled NLP data extraction from unstructured text. We sought to assess the performance of a general-purpose LLM, permitted to be used with protected health information (PHI), vs human manual chart review in extracting 8 distinct data elements from HCC imaging reports. The detailed methods are provided in the Supplementary Materials. In brief, we used the Versa generative pretrained transformer (GPT) 4 application programming interface (API), the PHI-compliant version of Microsoft Azure OpenAI GPT-4 LLM implemented at the University of California, San Francisco (UCSF), for this study.4Azure OpenAI Service Models.https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/modelsDate accessed: August 26, 2023Google Scholar Versa GPT-4 has a token limit of 8192 tokens, defined as the unit used to compute text length.4Azure OpenAI Service Models.https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/modelsDate accessed: August 26, 2023Google Scholar We manually reviewed 1101 longitudinal pre– and post–locoregional treatment (LRT) computed tomography or magnetic resonance imaging abdomen imaging reports for 753 patients with HCC enrolled in the Functional Assessment in Liver Transplantation (FrAILT) study at UCSF.5Lai J.C. et al.Hepatology. 2017; 66: 564-574Crossref PubMed Scopus (313) Google Scholar The imaging may have been obtained at any time during the patients' clinical course (before or after HCC diagnosis) and may not necessarily include HCC lesions. We manually tagged reports for 8 distinct data elements: (1) maximum LI-RADS for HCC lesions, (2) number of HCC lesions, (3) diameter (in centimeters) of the largest lesion, (4) sum of diameters (in centimeters) of all HCC lesions, (5) presence/absence of macrovascular invasion, (6) presence/absence of extrahepatic metastases, (7) presence/absence of previous LRT, and (8) study adequacy. The manual tagging process took about 1 to 2 minutes per record for an aggregate time of ∼28 hours. All 1101 imaging reports were trimmed to include only findings and impressions. We iteratively developed a few-shot prompt with testing on the first 5 records by alphabetical order of coded identifiers (Supplementary Figure 1).6Wang J, et al. arXiv [Preprint, April 28, 2023]. https://doi.org/10.48550/arXiv.2304.14670.Google Scholar Because snowballing of data passed per prompt often led to execution failure from exceeding GPT-4's token limit, we ran 56 loops of the final prompt with 20 records per loop to process all 1101 records. The execution of the loop extraction code took ∼2 hours. We evaluated the performance of Versa GPT-4 vs manual chart review by calculating the accuracy, precision, recall/sensitivity, specificity, and F1 (harmonic mean of precision and recall) for each of the 8 data elements. For multilevel classifications (maximum LI-RADS, number of HCC lesions, diameter of the largest lesion, and sum of tumor diameters), we calculated weighted-average precision, recall/sensitivity, and F1.7Sokolova M. Lapalme G. Inf Process Manag. 2009; 45: 427-437Crossref Scopus (3581) Google Scholar For binary classifications (macrovascular invasion, extrahepatic metastases, previous LRT, and study adequacy), we defined the presence of these features as positive cases for precision, recall/sensitivity, and F1. We treated a manual chart review value of 0 as the only negative class and calculated binary specificities. We estimated 95% confidence intervals through bootstrapping with 2000 iterations. All analyses were conducted in R, version 4.3.1 Beagle Scouts (R Core Team).8R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2013.Google Scholar This study was approved by the UCSF institutional review board in study no. 11-07513. Performance metrics for the 8 data elements extracted by Versa GPT-4 compared to manual chart review are shown in Table 1. Overall accuracy was 0.934 (95% confidence interval, 0.928–0.939), with accuracies for specific data elements varying between 0.886 for sum of tumor diameters to 0.989 for extrahepatic metastases. In general, accuracy was higher for classification tasks (maximum LI-RADS, macrovascular invasion, extrahepatic metastases, previous LRT, and study adequacy) vs comparison (maximum tumor diameter) or summation (number of tumors and sum of tumor diameters). Precision, recall, and F1 statistics for elements with multilevel classifications (maximum LI-RADS, number of tumors, maximum tumor diameter, and sum of tumor diameters) were calculated as weighted-average values: these values may have been biased because accurate predictions of the absence of imaging features (eg, Versa GPT-4 noting 0 tumors when there were no tumors by manual chart review) were included in the statistics. In general, Versa GPT-4 was highly specific, and overall accuracy was higher in normal reports, eg, 0.963 for those with no LR4 or LR5 findings.Table 1Performance Evaluation Statistics of Versa GPT-4 vs Manual Chart ReviewData elementAccuracyBinary event rateaIf applicable.PrecisionRecall/sensitivitySpecificitybSpecificity is calculated as a binary statistic with the assumption that values of 0 represent negative cases.F1Maximum LI-RADS0.944 (0.928–0.955)—0.946 (0.931–0.958)cWeighted statistics given because these were multilevel classifications.0.944 (0.928–0.955)cWeighted statistics given because these were multilevel classifications.0.948 (0.928–0.965)0.944 (0.930–0.956)cWeighted statistics given because these were multilevel classifications.Number of tumors0.922 (0.903–0.936)—0.924 (0.906–0.938)cWeighted statistics given because these were multilevel classifications.0.922 (0.905–0.937)cWeighted statistics given because these were multilevel classifications.0.938 (0.917–0.956)0.922 (0.904–0.938)cWeighted statistics given because these were multilevel classifications.Maximum tumor diameter0.920 (0.902–0.944)—0.934 (0.914–0.944)cWeighted statistics given because these were multilevel classifications.0.920 (0.902–0.935)cWeighted statistics given because these were multilevel classifications.0.924 (0.902–0.945)0.924 (0.907–0.937)cWeighted statistics given because these were multilevel classifications.Sum of tumor diameters0.886 (0.866–0.903)—0.907 (0.881–0.918)cWeighted statistics given because these were multilevel classifications.0.886 (0.866–0.904)cWeighted statistics given because these were multilevel classifications.0.916 (0.894–0.938)0.892 (0.871–0.908)cWeighted statistics given because these were multilevel classifications.Macrovascular invasion (binary)0.936 (0.919–0.948)0.0050.054 (0.013–0.109)0.800 (0.333–1.000)0.936 (0.921–0.950)0.101 (0.026–0.195)Extrahepatic metastases (binary)0.989 (0.980–0.994)0.0060.308 (0.071–0.600)0.571 (0.143–1.000)0.992 (0.986–0.996)0.400 (0.125–0.667)Previous LRT (binary)0.910 (0.891–0.926)0.7610.924 (0.907–0.942)0.961 (0.947–0.973)0.750 (0.695–0.802)0.942 (0.931–0.953)Inadequate study (binary)0.965 (0.952–0.974)0.0520.655 (0.528–0.780)0.667 (0.546–0.788)0.981 (0.973–0.989)0.661 (0.553–0.755)Overall accuracy0.934 (0.928–0.939)—————NOTE. Values in parentheses are 95% confidence intervals.a If applicable.b Specificity is calculated as a binary statistic with the assumption that values of 0 represent negative cases.c Weighted statistics given because these were multilevel classifications. Open table in a new tab NOTE. Values in parentheses are 95% confidence intervals. This is one of the first studies that compared the performance of the general purpose GPT-4 LLM, implemented in a PHI-compliant fashion, vs manual chart review for the extraction of clinical data. We showed high accuracy for simple extraction tasks, which degraded with more complex use cases but still showed reasonable accuracy (88.6%). For this specific demonstration use case of data extraction from HCC imaging reports, one immediate potential clinical application is automated determination of tumor burden for transplant eligibility. Potential applications resulting from the broader technology of LLM-enabled NLP via API integration include (1) text-based prediction modeling, (2) information retrieval from clinical records via a question-and-answer format, (3) augmented clinical decision support through active surveillance of clinical documentation, and (4) patient- and provider-facing chatbots. The future integration of LLMs for clinical uses will require the development of governance structures for the safe and ethical use of generative artificial intelligence tools. Moreover, further clarity on liability and whether these tools would be considered regulated medical devices is also needed.9Benjamens S. et al.NPJ Digit Med. 2020; 3: 118Crossref PubMed Scopus (477) Google Scholar This specific study has multiple limitations. First, we used manual chart review as the gold standard comparison, but there could be human data extraction errors biasing the results. Second, general concerns remain around accuracy and algorithmic bias regarding using the pretrained GPT-4 model for clinical tasks despite prompt engineering. Finally, our institution is one of the few that has PHI-compliant LLMs implemented at this time. This study could not and should not be conducted on publicly available consumer-facing versions of GPT-4 due to patient privacy/security concerns regarding LLM use.10Wang C. et al.J Med Internet Res. 2023; 25e48009Google Scholar As such, the potential clinical applications that we envision can only be possible with the availability of PHI-compliant LLMs. Despite these limitations, our study demonstrated 2 concepts: (1) the feasibility of using general purpose LLMs for data extraction and (2) the use of an LLM deployed in an isolated protected environment that accommodates PHI (as opposed to ChatGPT) for clinical use cases. Jin Ge had full access to all the data in the study and takes responsibility for the data and the accuracy of the data analysis. The authors thank the University of California, San Francisco (UCSF) Artificial Intelligence (AI) Tiger Team, Academic Research Services, Research Information Technology and the Chancellor's Task Force for Generative AI for their software development, analytical, and technical support related to the use of the Versa application programming interface (API) gateway (the UCSF secure implementation of large language models and generative AI via the API gateway), Versa chat (the chat user interface), and related data assets. A previous version of this manuscript was posted on the medRxiv preprint server as MEDRXIV/2023/294924 (https://doi.org/10.1101/2023.08.31.23294924). Jin Ge, MD, MBA (Conceptualization: Lead; Formal analysis: Lead; Investigation: Lead; Methodology: Lead; Supervision: Equal; Writing – original draft: Lead) Michael Li, MD, MPH (Data curation: Equal; Formal analysis: Supporting; Validation: Equal; Writing – review & editing: Supporting) Molly B. Delk, MD (Data curation: Equal; Formal analysis: Supporting; Validation: Equal; Writing – review & editing: Supporting) Jennifer C. Lai, MD, MBA (Conceptualization: Supporting; Data curation: Supporting; Formal analysis: Supporting; Investigation: Supporting; Methodology: Supporting; Supervision: Equal; Writing – review & editing: Supporting) We used the Versa API, which is the PHI- and intellectual property–compliant deployment of the Microsoft Azure OpenAI GPT-4 LLM at UCSF.e1Azure OpenAI Service – Advanced Language Models.https://azure.microsoft.com/en-us/products/ai-services/openai-service-bDate accessed: August 25, 2023Google Scholar In addition to the Versa API, UCSF has a chat interface deployment utilizing Microsoft Azure OpenAI GPT-35-turbo called Versa Chat. To access Versa, we used RStudio, version 2023.06.0 Build 421 (Posit Software, PBC);e2RStudio Team RStudio: integrated development for R. RStudio Inc, Boston, MA2015Google Scholar R, version 4.3.1 Beagle Scouts (R Core Team);e3R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2013.Google Scholar and the R packages devtools, version 2.4.5,e4Tools to Make Developing R Packages Easier.https://cran.r-project.org/web/packages/devtools/index.htmlDate accessed: September 17, 2023Google Scholar and openapi, version 0.3.1.e5zhanghao-njmu. openapi. GitHub.https://github.com/zhanghao-njmu/openapiDate accessed: September 17, 2023Google Scholar The Versa GPT-4 API, like other GPT-4 implementations, has a token limit of 8192 tokens, defined as the unit that OpenAI generative artificial intelligence models use to compute text length. Based on the specific text tokenization strategy, 1 token approximates to about 4 characters or 1 word.e6Shibata Y. et al.Byte pair encoding: a text compression scheme that accelerates pattern matching. Fukuoka, Japan: Kyushu University.1999Google Scholar Token limits are imposed due to computational and memory constraints. Transformer models, such as OpenAI's GPT-3.5-Turbo and GPT-4, rely on attention, which is a mathematical parameter that models connections between individual words in a text and thereby "understands" the context of word sequences. This mechanism requires the computation and storage of attention scores between each pair of tokens in an input. As the word (and token) sequences increase, the memory requirement and computational complexity of the text completion command also increase exponentially.e7Vaswani A, et al. arXiv [Preprint, June 12, 2017]. https://doi.org/10.48550/arXiv.1706.03762.Google Scholar Moreover, LLMs suffer from a phenomenon called snowballing, which is where token use progressively increased with each set of inputs (a user's directions to the LLM) and outputs (LLM responses) in a session. This is because each new input prompt in a session contains information from all previous prompts and outputs to allow for in-context learning (or memory).e8Liu J, et al. arXiv [Preprint, January 17, 2021]. https://doi.org/10.48550/arXiv.2101.06804.Google Scholar For example, an initial prompt may require 1000 tokens to execute (both input and output), but the 10th prompt in that session may require 3500 tokens because it contains information from the previous 9 inputs and outputs. The 8192-token limit imposed by GPT-4, therefore, encompasses all user inputs and LLM outputs for the chat session.e9Azure OpenAI Service Models.https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/modelsDate accessed: August 26, 2023Google Scholar To construct the gold standard comparison, we manually reviewed 1101 longitudinal pre- and post-LRT computed tomography or magnetic resonance imaging abdomen imaging reports from 753 patients diagnosed with HCC enrolled in the Functional Assessment in Liver Transplantation (FrAILT) study at UCSF.e10Lai J.C. et al.Hepatology. 2017; 66: 564-574Crossref PubMed Scopus (330) Google Scholar The cross-sectional imaging may have been obtained at any time during the patients' clinical course and, therefore, may or may not contain evidence of HCC because a diagnosis could have occurred subsequent to the date of imaging. We manually tagged the imaging reports for 8 distinct data elements: (1) maximum LI-RADS for any lesions, (2) number of HCC lesions (defined as LR4, LR5, LR-viable, or LR-equivocal lesions), (3) diameter (in centimeters) of the largest HCC lesion, (4) sum of diameters (in centimeters) of all HCC lesions, (5) presence or absence of macrovascular invasion, (6) presence or absence of extrahepatic metastases, (7) presence or absence of previous LRT, and (8) whether the imaging was determined to be inadequate in the report. This manual tagging process took about 1 to 2 minutes per record, for an aggregate time of approximately 5 hours. All 1101 imaging reports were trimmed through simple string search mechanisms to include only the findings and impressions. Due to GPT-4's token limit as described; we iteratively developed a few-shot prompt with testing on the first 5 records by alphabetical order of coded identifiers. In few-shot prompting, a limited number of example inputs and outputs are provided to help guide the LLM and reinforce in-context learning.e11Wei J, et al. arXiv [Preprint, January 28, 2022]. https://doi.org/10.48550/arXiv.2201.11903.Google Scholar This is opposed to zero-shot prompting, in which the LLM is asked to respond to inputs without being supplied any examples of how to carry out the desired task and, therefore, fulfills the prompt instructions with only its pretrained general purpose knowledge.e12Wang J. et al.arXiv.[Preprint, April 28. 2023; https://doi.org/10.48550/arXiv.2304.14670Crossref Scopus (0) Google Scholar, e13Liu P, et al. arXiv [Preprint, July 28, 2021]. https://doi.org/10.48550/arXiv.2107.13586.Google Scholar, e14Kojima T, et al. arXiv [Preprint, May 24, 2022]. https://doi.org/10.48550/arXiv.2205.11916.Google Scholar Because snowballing of data passed per prompt often led to execution failure from exceeding the token limit, we ran 56 loops of the final few-shot extraction prompt with 20 records per loop for data extraction. The execution of the loop extraction code in RStudio with access to the Versa GPT-4 API took approximately 2 hours. We evaluated the accuracy of Versa GPT-4 API data extractions vs manual chart review with each imaging report as a separate record. We calculated performance metrics—notably, accuracy, precision, recall/sensitivity, specificity, and F1 score (harmonic mean of precision and recall commonly used to evaluate classification in machine learning)—for each of the 8 data elements.e15Sokolova M. Lapalme G. Inf Process Manag. 2009; 45: 427-437Crossref Scopus (3693) Google Scholar For multilevel classifications (maximum LI-RADS score, number of HCC lesions, diameter of the largest lesion, and sum of tumor diameters), we calculated weighted-average precision, recall/sensitivity, and F1 score statistics. For binary classifications (macrovascular invasion, extrahepatic metastases, previous LRT, and study adequacy), we calculated the binary event rate and defined the presence of these features as a positive case for precision, recall, and F1 score statistics. We treated a manual chart review value of 0 as the only negative class and calculated binary specificities. We estimated 95% confidence intervals for performance metrics whenever possible through bootstrapping with 2000 iterations. Statistical analyses were also conducted in RStudio and R using the R packages boot, version 1.3-28.1,e16boot Bootstrap Functions (Originally by Angelo Canty for S).https://cran.r-project.org/web/packages/boot/index.htmlDate accessed: January 6, 2023Google Scholar and caret, version 6.0-94.e17Kuhn M. Classification and regression training [R package caret version 6.0-94], 2023. https://cran.r-project.org/web/packages/caret/index.html. Accessed January 10, 2024.Google Scholar This study was approved by the UCSF institutional review board in study no. 11-07513.
Referência(s)