Artigo Acesso aberto Revisado por pares

Using fine-tuned large language models to parse clinical notes in musculoskeletal pain disorders

2023; Elsevier BV; Volume: 5; Issue: 12 Linguagem: Inglês

10.1016/s2589-7500(23)00202-9

ISSN

2589-7500

Autores

Akhil Vaid, Isotta Landi, Girish N. Nadkarni, Ismail Nabeel,

Tópico(s)

Text Readability and Simplification

Resumo

Musculoskeletal disorders like lower back, knee, and shoulder pain create a substantial health burden in developed countries—affecting function, mobility, and quality of life1Bhattacharya A Costs of occupational musculoskeletal disorders (MSDs) in the United States.Int J Ind Ergon. 2014; 44: 448-454Crossref Scopus (120) Google Scholar. These conditions are often multifactorial and require clinicians to thoroughly assess their cause to decide on appropriate investigative and treatment approaches. However, the electronic health-care record does not explicitly capture pain characteristics within ICD-10 billing codes, making time-consuming and error-prone retrospective chart review necessary. The scarcity of reliable tools to automatically parse unstructured clinical notes for pain discriminants makes point-of-care interventions and quality improvement challenging. These challenges call for a method to extract meaningful data from unstructured text, which can be achieved through natural language processing. However, statistical or rule-based approaches2Miotto R Percha BL Glicksberg BS et al.Identifying acute low back pain episodes in primary care practice from clinical notes: observational study.JMIR Med Inform. 2020; 8e16878Crossref PubMed Scopus (13) Google Scholar are limited by differences in the vocabularies, syntax, and negation of statements (eg, "the patient does not have…"). As such, these methods might not generalise to another disease process, institution, or style of note taking. In contrast, deep learning uses neural networks to identify patterns in heterogeneous data without explicit instructions on which patterns to consider. Large language models (LLMs) are neural networks that take their name from the many billions of tunable parameters making up their structure. Such models are typically trained in an unsupervised manner on trillions of words of unstructured text to allow them to understand relationships between words. An example is the generative pre-trained transformer (GPT)3Zhang M Li J A commentary of GPT-3 in MIT technology review 2021.Fundamental Research. 2021; 1: 831-833Crossref Scopus (21) Google Scholar class of models that work by predicting the next most likely word given available context. Such models have been used for question-answering, contextualisation, and as conversational chatbots in the form of ChatGPT4Leiter C Zhang R Chen Y et al.ChatGPT: a meta-analysis after 2.5 months.arXiv. 2023; (published online Feb 20.) (preprint).https://doi.org/10.48550/arXiv.2302.13795Google Scholar. GPT based chatbots have attracted interest within the health-care setting5Nature MedicineWill ChatGPT transform healthcare?.Nat Med. 2023; 29: 505-506Crossref PubMed Scopus (40) Google Scholar, 6The Lancet Digital HealthChatGPT: friend or foe?.Lancet Digit Health. 2023; 5: e102Summary Full Text Full Text PDF PubMed Scopus (56) Google Scholar given their potential to assist clinicians in a number of their responsibilities. However, such models have potential implications attached to their use. Generative models can hallucinate, or make up facts that have no basis in reality or are otherwise nonsensical, which could cause harm if clinicians rely on their recommendations verbatim. This is especially true of general-purpose models that might not have the specialised knowledge or nuance required to operate within clinical settings. Furthermore, larger generative models require substantial financial investments in hardware and training data—making them inaccessible for health-care institutions in resource-limited settings. Additionally, sending protected health information to technology companies might not be an acceptable solution given patient privacy concerns. In this work, we develop locally-running, privacy-preserving LLMs capable of following plain language instructions to extract characteristics of musculoskeletal pain (such as location and acuity) from a heterogeneous collection of unstructured clinical notes. We gathered 26 551 patient notes between Nov 16, 2016, and May 30, 2019, from five Mount Sinai facilities (Mount Sinai Hospital, Mount Sinai Morningside, Mount Sinai Beth Israel, Mount Sinai Queens, and Mount Sinai Brooklyn), using a simple text match for the word "pain" for initial selection. Expert clinical personnel, including a nurse practitioner and an internal medicine resident, manually labelled 1714 notes from 1155 patients for pain location and acuity. Of these, 1675 notes mentioned both pain location and acuity, 21 mentioned only location, and 18 had no mention of pain. These labels were then individually reviewed and assessed by the senior author of the study (IN) to ensure accuracy. 19% of the final dataset came from within the primary care setting, 51% from Internal Medicine, and 30% from Orthopaedics (appendix p 10), with a median note length of 1005 (IQR 1574·75) tokens (appendix p 1 and p 10). A token is the smallest unit of text that the model reads, generates, or manipulates, often corresponding to a word, subword, or character. Labels were created for the location (ie, shoulder, lower back, knee, or other pain), and the acuity (ie, acute, chronic, and acute-on-chronic [an existing diagnosis of chronic pain with either an exacerbation, or additional injury leading to pain]). Manually created labels were converted into short sentences representative of the location and acuity of pain (eg, chronic lower back pain). For notes that did not contain references to either the pain location or acuity, the created sentence contained a corresponding unknown. All derived sentences were paired to the original note text, as well as a sample instruction to "describe the pain based on this note" (appendix pp 2–3; pp 11–12). These data were used to fine-tune a publicly available foundational language model named LLaMA-7B.7Touvron H Lavril T Izacard G et al.Llama: open and efficient foundation language models.arXiv. 2023; (published online Feb 27.) (preprint).https://doi.org/10.48550/arXiv.2302.13971Google Scholar We also trained another LLaMA-7B using our note dataset combined with the publicly available Alpaca8Taori R Gulrajani I Zhang T et al.Stanford Alpaca: an instruction-following LLaMA model.https://github.com/tatsu-lab/stanford_alpacaDate: May 30, 2023Date accessed: March 20, 2023Google Scholar dataset, containing non healthcare related general instructions and expected responses. This approach is known as instruction fine-tuning9Ouyang L Wu J Jiang X et al.Training language models to follow instructions with human feedback.Adv Neural Inf Process Syst. 2022; 35: 27730-27744Google Scholar and forms the basis of models such as InstructGPT and ChatGPT (appendix pp 2–3; pp 11–12). The performance of derived models was compared against established baseline model architectures, such as the open source architectures BERT and Longformers. We used group shuffle splitting to partition data into 75% training, 5% validation, and 20% testing group. Each note within the testing group was reformulated into a prompt containing an instruction, and text generated by the model was parsed to quantify the model's ability to capture the location and acuity of the pain. Models were trained on the Minerva HPC cluster at Mount Sinai on a node containing 4xA100 80G GPUs. The LLaMA-7B model trained on patient notes alone achieved classification accuracies of 0·89 (95% CI 0·88–0·90) for shoulder pain, 0·94 (0·93–0·94) for lower back pain, 0·90 (0·89–0·91) for knee pain, and 0·98 (0·97–0·99) for other pain locations. The LLaMA-7B model trained with the extended Alpaca dataset showed slightly better accuracy for shoulder pain at 0·93 (0·92–0·93) but performed similarly or slightly worse in other categories. Both LLaMA-7B models outperformed baseline models in terms of sensitivity except for knee pain, where the Longformer achieved a sensitivity of 0·94 (0·93–0·95) (table; appendix pp 4–6; pp 13–14). Additional performance metrics can be found in the appendix.TableModel performanceLLaMA-7B (95% CI)LLaMA-7B and Alpaca dataset (95% CI)BERT (95% CI)Longformer (95% CI)Shoulder painAccuracy0·89 (0·88–0·90)0·93 (0·92–0·93)0·93 (0·92–0·93)0·92 (0·91–0·93)Sensitivity0·93 (0·92–0·95)0·92 (0·90–0·93)0·87 (0·85–0·89)0·88 (0·86–0·89)Specificity0·87 (0·86–0·88)0·93 (0·92–0·94)0·96 (0·95–0·97)0·94 (0·93–0·95)Lower back painAccuracy0·94 (0·93–0·94)0·94 (0·93–0·94)0·94 (0·93–0·94)0·91 (0·90–0·92)Sensitivity0·86 (0·84–0·89)0·91 (0·89–0·93)0·83 (0·80–0·85)0·88 (0·86–0·90)Specificity0·95 (0·94–0·96)0·94 (0·93–0·95)0·97 (0·96–0·97)0·92 (0·91–0·93)Knee painAccuracy0·90 (0·89–0·91)0·90 (0·89–0·91)0·90 (0·89–0·91)0·87 (0·86–0·89)Sensitivity0·89 (0·88–0·91)0·93 (0·92–0·94)0·92 (0·91–0·94)0·94 (0·93–0·95)Specificity0·90 (0·89–0·92)0·88 (0·87–0·89)0·89 (0·88–0·90)0·84 (0·82–0·85)Pain in other locationAccuracy0·98 (0·97–0·99)0·93 (0·92–0·94)0·93 (0·92–0·94)0·93 (0·92–0·94)Sensitivity0·74 (0·68–0·81)0·63 (0·57–0·69)0·26 (0·19–0·32)0·00 (0·00–0·00)Specificity0·99 (0·99–1·00)0·99 (0·98–0·99)0·97 (0·96–0·97)0·98 (0·98–0·99)Acute painAccuracy0·83 (0·82–0·85)0·80 (0·78–0·81)0·73 (0·71–0·75)0·75 (0·74–0·77)Sensitivity0·57 (0·53–0·60)0·53 (0·49–0·56)0·63 (0·60–0·66)0·27 (0·23–0·29)Specificity0·93 (0·92–0·94)0·89 (0·88–0·91)0·77 (0·75–0·78)0·93 (0·92–0·94)Chronic painAccuracy0·83 (0·82–0·85)0·81 (0·79–0·82)0·77 (0·76–0·79)0·80 (0·79–0·82)Sensitivity0·70 (0·67–0·73)0·61 (0·59–0·64)0·68 (0·65–0·71)0·67 (0·64–0·69)Specificity0·91 (0·90–0·92)0·91 (0·90–0·93)0·83 (0·82–0·85)0·89 (0·88–0·90)Acute on chronic painAccuracy0·82 (0·80–0·83)0·79 (0·77–0·80)0·76 (0·74–0·78)0·67 (0·66–0·69)Sensitivity0·92 (0·90–0·94)0·90 (0·89–0·92)0·59 (0·56–0·61)0·82 (0·80–0·84)Specificity0·76 (0·75–0·78)0·72 (0·70–0·74)0·86 (0·84–0·87)0·59 (0·57–0·61)Recall and sensitivity are the same metric and might therefore be used interchangeably. CI=confidence interval. Open table in a new tab Recall and sensitivity are the same metric and might therefore be used interchangeably. CI=confidence interval. Our models also categorised pain acuity as acute, chronic, or acute-on-chronic. The LLaMA-7B model trained on patient notes alone achieved classification accuracies of 0·83 (0·82–0·85) for acute pain, 0·83 (0·82–0·85) for chronic pain, and 0·82 (0·80–0·83) for acute-on-chronic pain. The LLaMA-7B model trained with the extended Alpaca dataset showed generally lower performance, with classification accuracies of 0·80 (0·78–0·81) for acute pain, 0·81 (0·79–0·82) for chronic pain, and 0·79 (0·77–0·80) for acute-on-chronic pain. LLaMA-7B outperformed baseline models across all metrics except the BERT model achieving a sensitivity of 0·63 (0·60–0·66) for acute pain (table; appendix pp 7–9; pp 13–14). Further, this work opens up avenues for future research, including considerations of how fine-tuned LLMs could influence care in clinical settings calling for quick decision-making—such as emergency departments; the legal and ethical implications of using LLMs to guide patient care; and finally, barriers of cost, compute, and overall performance that must be overcome before such models can be fully and widely utilized Therefore, for parsing clinical notes, both LLaMA-7B models outperformed baseline models in all but a few cases. Lower performance at acuity detection is possibly due to none of the models being provided with an exact delineation between times corresponding to acute and chronic, and possibly due to the shared vocabulary between acute and acute-on-chronic notes. However, the fine-tuned LLaMA models did not perform as expected in notes without pain. We observed instances of hallucination, where the models inferred or made assumptions about pain acuity or location where there was none indicated. In our testing dataset, 21 notes did not mention pain acuity while mentioning location, and an additional 18 notes made no mention of pain at all. The model fine-tuned on notes alone was able to correctly classify only two of 39 instances as having unknown pain acuity, while the model fine-tuned on the combined notes and Alpaca dataset incorrectly assumed an acuity in all cases where pain was not mentioned. Model performance for pain localisation in the 21 notes with only a mention of pain location was much better. The combination dataset model achieved an accuracy of 0·81 for back pain, 0·76 for shoulder pain, 0·62 for knee pain, and 0·48 for other pain, while the model fine-tuned on notes alone achieved similar metrics except for 0·71 for knee pain. A definitive benefit of LLMs is reduction in complexity. Baseline models had to be trained separately for each outcome. In contrast, a single LLM could attend to both outcomes simultaneously. As such, LLMs are task agnostic while baselines are task specific. If future requirements entail the extraction of additional information from a clinical note (such as patient disposition), baseline models would require that the outcome be first made categorical, and new models trained accordingly. In contrast, an LLM can simply be prompted differently to extract this information from the note and return it as text. This ability also extends to explainability of the predictions—wherein instead of generating complex saliency maps, the model might be prompted to output the words most responsible for its prediction. Owing to the immense financial and technical investments associated with training a foundation language model from scratch, it is far more cost-effective to instead fine-tune existing models on data more relevant to downstream tasks. Such fine-tuning also serves to supplement the information contained within the foundation model, which might not extend to a particular task or specialty depending on its pre-training. As such, the LLaMA class of models forms an excellent platform due to its open-source nature, and our work serves as an analysis of the capabilities of a fine-tuned 7B parameter model when used in a clinical setting. We recognise the value in exploring the versatility of our methods across different datasets, foundation models, and sizes, and consider this to be an avenue for future research. Our work must be considered alongside its limitations. The resource intensity of working with LLMs directly affects the size of the foundation model used for fine-tuning, as well as the length of context possible to deliver to the model through the prompt. As such, our models required a four GPU (each with 80G of video memory) configuration to fine-tune. However, inferencing was possible using a single GPU. Available techniques now allow for such models to be scaled down to where they may be run on consumer hardware at the cost of some accuracy.10Hu EJ Shen Y Wallis P et al.Lora: Low-rank adaptation of large language models.arXiv. 2021; (published online Oct 16.) (preprint).https://doi.org/10.48550/arXiv.2106.09685Google Scholar Both LLMs were prone to hallucinating both the acuity and severity of pain in notes without mention of pain. Such hallucination is a known issue and could potentially be mitigated through the provision of additional context, or more specific directions. In conclusion, our findings indicate that pre-trained LLMs can serve as a robust foundation for creating fine-tuned models capable of effectively parsing unstructured clinical notes in a directed manner. Such models could be deployed as specialised conversational agents or chatbots (appendix pp 2–3), helping clinicians swiftly access pertinent patient history, maintain data privacy, and potentially streamline clinical workflows. Further, this work opens up avenues for future research, including considerations of how fine-tuned LLMs could influence care in clinical settings that call for quick decision making—such as emergency departments; the legal and ethical implications of using LLMs to guide patient care; and finally, the barriers of cost, computation, and overall performance that must be overcome before such models can be fully and widely used. GN reports research funding from the NIH and Renalytix; royalties from Renalytix; consultancy agreements from AstraZeneca, BioVie, GLG Consulting, Pensieve Health, Reata, Renalytix AI, Siemens, GSK Pharma, and Variant Bio; honoraria for lectures from GSK Pharma; serves in an advisory role for CRIC, Renalytix, Pensieve Health, and Neurona Health; and owns equity and stock options in Pensieve Health as a cofounder, Renalytix, and Verici. IN is part of the Board of Directors at the American College of Occupational and Environmental Medicine in a voluntary capacity. All other authors declare no competing interests. AV conceptualized the study, developed methodology, performed formal analysis and visualization. IN collected and curated data. AV, IL, and IN directly accessed and verified the underlying data in the study. AV wrote the original draft of the paper. GN and IN contributed to supervision of the study. All authors had full access to all the data in the study, provided critical feedback, approved the final draft, and had final responsibility for the decision to submit for publication. This study was approved by the institutional review board at Icahn School of Medicine at Mount Sinai. STUDY-19-00607: Utilizing Natural Language Processing (NLP) and Machine Learning (ML) to Predict Acuity and Return to Work Timeline for Patients with Lower Back, Knee, and Shoulder Pain. The study was exempt from the requirement of individual patient consent due to use of retrospective patient data. This work is supported by grant UL1TR001433 from the US National Center for Advancing Translational Sciences, National Institutes of Health, and T42OH008422 from the Pilot Projects Research Training Program of the New York and New Jersey Education and Research Center, National Institute for Occupational Safety and Health. The funders of the study had no role in study design, data collection, data analysis, data interpretation, or writing of the work. The study uses identified patient notes for training and testing models, which cannot be released given concerns of patient privacy. Foundational LLaMA language models are publicly available for non-commercial use directly from Meta AI research. The instructional Alpaca dataset is also publicly available with a non-restrictive license. Code for this work is available at https://github.com/akhilvaid/MusculoskeletalPainLLaMA under a GPLv3 license. Download .pdf (2.36 MB) Help with pdf files Supplementary appendix

Referência(s)