Diagnostic Case-Control versus Diagnostic Cohort Studies for Clinical Validation of Artificial Intelligence Algorithm Performance
2018; Radiological Society of North America; Volume: 290; Issue: 1 Linguagem: Inglês
10.1148/radiol.2018182294
ISSN1527-1315
Autores Tópico(s)COVID-19 diagnosis using AI
ResumoHomeRadiologyVol. 290, No. 1 PreviousNext CommunicationsFree AccessLetters to the EditorDiagnostic Case-Control versus Diagnostic Cohort Studies for Clinical Validation of Artificial Intelligence Algorithm PerformanceSeong Ho Park Seong Ho Park Author AffiliationsDepartment of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, South Koreae-mail: [email protected]Seong Ho Park Published Online:Dec 4 2018https://doi.org/10.1148/radiol.2018182294MoreSectionsPDF ToolsImage ViewerAdd to favoritesCiteTrack CitationsPermissionsReprints ShareShare onFacebookTwitterLinked In Editor:I read with interest the article by Dr Nam and colleagues (1), which was recently published online in Radiology. I commend the authors on their fine study. Neglecting proper clinical validation of artificial intelligence (AI) algorithms intended for medical diagnosis is a concern (2), one issue of which is lack of robust external validation. Therefore, the investigators should receive particular credit for their efforts in performing external validation using data from four different institutions.Nevertheless, as addressed in the Discussion to a certain extent, this study provides insufficient evidence regarding the generalization of the study results to real-world practice. One important related issue not explicitly addressed in the article is the difference between a diagnostic case-control study and a diagnostic cohort study (3,4). The external validation part of this study adopted a diagnostic case-control design, that is, investigators somehow collected disease-positive (ie, case) and disease-negative (ie, control) subjects. In contrast, a diagnostic cohort study defines the clinical setting and patients first, to which a test will be applied in real-world practice (eg, asymptomatic adults aged X–X years with X pack-year smoking history), and all patients undergo the diagnostic procedure. Case-control design is prone to spectrum bias, potentially leading to inflated estimation of the diagnostic performance (3,5). For example, as compared with case-control subjects, a real-world cohort may have more patients with disease-simulating conditions, comorbidities that may pose diagnostic difficulties, and findings for which the forceful binary distinction is inappropriate, which are some of the attributes collectively referred to as "spectrum." The differing spectrum could substantially affect diagnostic performance. Case-control design can also create unnatural disease prevalence. Differing prevalence may not directly affect the diagnostic performance but would require a change in the threshold to turn the mathematical output of an AI algorithm into disease-positive and disease-negative decisions, creating some uncertainty regarding the real-world performance of the algorithm (5). Consequently, diagnostic case-control studies provide weak evidence of diagnostic efficacy (4).This study would make a great methodologic example of early stage external validation of a diagnostic AI algorithm. Studies to further validate the performance in real-world practice by using a diagnostic cohort design or even randomized studies to compare the use of the AI tool and conventional care regarding ultimate patient outcome should follow.Disclosures of Conflicts of Interest: disclosed no relevant relationships.References1. Nam JG, Park S, Hwang EJ, et al. Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology https://doi.org/10.1148/radiol.2018180237. Published online September 25, 2018. Accessed November 14, 2018. Google Scholar2. AI diagnostics need attention. Nature 2018;555(7696):285. Crossref, Google Scholar3. Rutjes AW, Reitsma JB, Vandenbroucke JP, Glas AS, Bossuyt PM. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51(8):1335–1341. Crossref, Medline, Google Scholar4. Pepe MS. Study design and hypothesis testing. In: Pepe MS, ed. The statistical evaluation of medical tests for classification and prediction. Oxford, England: Oxford University Press, 2003; 214–217. Google Scholar5. Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 2018;286(3):800–809. Link, Google ScholarReferences1. Nam JG, Park S, Hwang EJ, et al. Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology https://doi.org/10.1148/radiol.2018180237. Published online September 25, 2018. Google Scholar2. AI diagnostics need attention. Nature 2018;555(7696):285. Google Scholar3. Rutjes AW, Reitsma JB, Vandenbroucke JP, Glas AS, Bossuyt PM. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51(8):1335–1341. Crossref, Medline, Google Scholar4. Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316(22):2402–2410. Crossref, Medline, Google Scholar5. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115–118. Crossref, Medline, Google ScholarReferences1. Nam JG, Park S, Hwang EJ, et al. Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology https://doi.org/10.1148/radiol.2018180237. Published online September 25, 2018. Accessed November 14, 2018. Google Scholar2. AI diagnostics need attention. Nature 2018;555(7696):285. Crossref, Google Scholar3. Rutjes AW, Reitsma JB, Vandenbroucke JP, Glas AS, Bossuyt PM. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51(8):1335–1341. Crossref, Medline, Google Scholar4. Pepe MS. Study design and hypothesis testing. In: Pepe MS, ed. The statistical evaluation of medical tests for classification and prediction. Oxford, England: Oxford University Press, 2003; 214–217. Google Scholar5. Park SH, Han K. Methodologic guide for evaluating clinical performance and effect of artificial intelligence technology for medical diagnosis and prediction. Radiology 2018;286(3):800–809. Link, Google ScholarResponseJu Gang Nam, Chang Min Park Ju Gang Nam, Chang Min Park Author AffiliationsDepartment of Radiology and Institute of Radiation Medicine, Seoul National University and College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, Republic of Koreae-mail: [email protected]First, we truly appreciate Dr Park's interest and concern regarding our work (1) and in how we may best validate the results of AI algorithms before application in real-world clinical practice. We whole-heartedly agree with Dr Park on most of the points he made. In particular, it is true that case-control studies alone may not be completely sufficient in validating AI algorithms before their application into real clinical practice; therefore, we have already begun conducting several ongoing studies for further evaluation of the performance of our deep learning–based algorithm (2).However, we selected a case-control study design as the initial validation method for several specific reasons. In case-control studies, as Dr Park mentioned, we can achieve clear-cut reference standards by taking cases from distinct ends of the spectrum. Therefore, it becomes convenient to obtain statistical estimates reflecting diagnostic performance, that is, sensitivity, specificity, diagnostic accuracy, area under the receiver operating characteristic curve, and so on. This also allows us to make a comparison of two modalities clearer. Thus, it is owing to this aspect that we adopted the case-control study design for our external validation (3).A cohort study design, on the other hand, has its own inevitable drawbacks regarding class imbalance issues as algorithms only showing negative results may exhibit high diagnostic performance in a cohort with sparse positive cases. Varying disease prevalence among different regions or clinical settings may also contribute to unstable or inconsistent performances of the algorithm, and establishment of a standard reference may be quite unclear or sometimes practically impossible with a cohort design. Moreover, employing an external validation database completely apart from the training data set seemed quite harsh for an initial test of the algorithm. For these reasons, the case-control design has typically been adopted as the initial validation method of deep learning–based algorithms (4,5). In this context, we also believe that a case-control study may be more appropriate as an initial external validation method for our deep learning algorithm.Nevertheless, further validation is surely warranted for robust validation. Therefore, we had already designed a series of validation studies starting from a case-control study, followed by a cohort study, a clinical outcome study, and, finally, a cost-effectiveness study, before the submission of our article. The current study is the initial step of our validation process. We again thank Dr Park for his interest, and we hope to receive his attention as well as his valuable critique on our following studies as well. Disclosures of Conflicts of Interest: J.G.N. Activities related to the present article: institution received a grant. Activities not related to the present article: disclosed no relevant relationships. Other relationships: disclosed no relevant relationships. C.M.P. Activities related to the present article: received a grant from Seoul National University Hospital, Lunit, and Seoul Metropolitan Government. Activities not related to the present article: Member of the advisory board for and grants from GE Healthcare. Other relationships: disclosed no relevant relationships.References1. Nam JG, Park S, Hwang EJ, et al. Development and validation of deep learning–based automatic detection algorithm for malignant pulmonary nodules on chest radiographs. Radiology https://doi.org/10.1148/radiol.2018180237. Published online September 25, 2018. Google Scholar2. AI diagnostics need attention. Nature 2018;555(7696):285. Google Scholar3. Rutjes AW, Reitsma JB, Vandenbroucke JP, Glas AS, Bossuyt PM. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005;51(8):1335–1341. Crossref, Medline, Google Scholar4. Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 2016;316(22):2402–2410. Crossref, Medline, Google Scholar5. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature 2017;542(7639):115–118. Crossref, Medline, Google ScholarArticle HistoryPublished online: Dec 4 2018Published in print: Jan 2019 FiguresReferencesRelatedDetailsCited ByMethods for Clinical Evaluation of Artificial Intelligence Algorithms for Medical DiagnosisSeong Ho Park, Kyunghwa Han, Hye Young Jang, Ji Eun Park, June-Goo Lee, Dong Wook Kim, Jaesoon Choi, 8 November 2022 | Radiology, Vol. 306, No. 1External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic ReviewAlice C. Yu, Bahram Mohajer, John Eng, 4 May 2022 | Radiology: Artificial Intelligence, Vol. 4, No. 3Deep Learning for Lung Cancer Nodal Staging and Real-World Clinical PracticeChang Min Park, Jong Hyuk Lee, 26 October 2021 | Radiology, Vol. 302, No. 1Deep Learning Systems for Pneumothorax Detection on Chest Radiographs: A Multicenter External Validation StudyYee Liang Thian, Dianwen Ng, James Thomas Patrick Decourcy Hallinan, Pooja Jagmohan, Soon Yiew Sia, Cher Heng Tan, Yong Han Ting, Pin Lin Kei, Geoiphy George Pulickal, Vincent Tze Yang Tiong, Swee Tian Quek, Mengling Feng, 14 April 2021 | Radiology: Artificial Intelligence, Vol. 3, No. 4Deep learning–based automated detection algorithm for active pulmonary tuberculosis on chest radiographs: diagnostic performance in systematic screening of asymptomatic individualsJong HyukLee, SunggyunPark, Eui JinHwang, Jin MoGoo, Woo YoungLee, SanghoLee, HyungjinKim, Jason R.Andrews, Chang MinPark2021 | European Radiology, Vol. 31, No. 2Understanding diagnostic test accuracy studies and systematic reviews: A primer for medical radiation technologistsGordon T.W.Mander, ZacharyMunn2021 | Journal of Medical Imaging and Radiation Sciences, Vol. 52, No. 2Robust breast cancer detection in mammography and digital breast tomosynthesis using an annotation-efficient deep learning approachWilliamLotter, Abdul RahmanDiab, BryanHaslam, Jiye G.Kim, GiorgiaGrisot, EricWu, KevinWu, Jorge OnievaOnieva, YunBoyer, Jerrold L.Boxerman, MeiyunWang, MackBandler, Gopal R.Vijayaraghavan, A.Gregory Sorensen2021 | Nature Medicine, Vol. 27, No. 2Diagnostic Accuracy Studies: Avoid a Case-Control Approach or Just State it Clearly!MatthiasJacquet-Lagrèze, RémiSchweizer, MartinRuste, Jean-LucFellahi2021 | Journal of Cardiothoracic and Vascular AnesthesiaApplications of machine learning and deep learning to thyroid imaging: where do we stand?Eun JuHa, Jung HwanBaek2021 | Ultrasonography, Vol. 40, No. 1Key Principles of Clinical Validation, Device Approval, and Insurance Coverage Decisions of Artificial IntelligenceSeong HoPark, JaesoonChoi, Jeong-SikByeon2021 | Korean Journal of Radiology, Vol. 22, No. 3Performance of a Deep Learning Algorithm Compared with Radiologic Interpretation for Lung Cancer Detection on Chest Radiographs in a Health Screening PopulationJong Hyuk Lee, Hye Young Sun, Sunggyun Park, Hyungjin Kim, Eui Jin Hwang, Jin Mo Goo, Chang Min Park, 22 September 2020 | Radiology, Vol. 297, No. 3Deep Learning–based Automatic Detection Algorithm for Reducing Overlooked Lung Cancers on Chest RadiographsSowon Jang, Hwayoung Song, Yoon Joo Shin, Junghoon Kim, Jihang Kim, Kyung Won Lee, Sung Soo Lee, Woojoo Lee, Seungjae Lee, Kyung Hee Lee, 21 July 2020 | Radiology, Vol. 296, No. 3Chest Radiograph Interpretation with Deep Learning Models: Assessment with Radiologist-adjudicated Reference Standards and Population-adjusted EvaluationAnna Majkowska, Sid Mittal, David F. Steiner, Joshua J. Reicher, Scott Mayer McKinney, Gavin E. Duggan, Krish Eswaran, Po-Hsuan Cameron Chen, Yun Liu, Sreenivasa Raju Kalidindi, Alexander Ding, Greg S. Corrado, Daniel Tse, Shravya Shetty, 3 December 2019 | Radiology, Vol. 294, No. 2Deep convolutional neural network applied to the liver imaging reporting and data system (LI-RADS) version 2014 category classification: a pilot studyRikiyaYamashita, AmberMittendorf, ZheZhu, Kathryn J.Fowler, Cynthia S.Santillan, Claude B.Sirlin, Mustafa R.Bashir, Richard K. G.Do2020 | Abdominal Radiology, Vol. 45, No. 1Test-retest reproducibility of a deep learning–based automatic detection algorithm for the chest radiographHyungjinKim, Chang MinPark, Jin MoGoo2020 | European Radiology, Vol. 30, No. 4Prospective Analysis Using a Novel CNN Algorithm to Distinguish Atypical Ductal Hyperplasia From Ductal Carcinoma in Situ in BreastSimukayiMutasa, PeterChang, JohnNemer, Eduardo PascualVan Sant, MarySun, AlisonMcIlvride, MahamSiddique, RichardHa2020 | Clinical Breast Cancer, Vol. 20, No. 6Effectiveness of ICT-based intimate partner violence interventions: a systematic reviewChristoEl Morr, ManpreetLayal2020 | BMC Public Health, Vol. 20, No. 1Artificial Intelligence-Based Triaging of Normal Chest Radiographs: Results of a Retrospective Simulation Study in a Multi-Center Health Screening CohortHyunsukYoo, Eun YoungKim, Hyung JinKim, Ye RaChoi, Young JunCho, Kwang NamJin2020 | SSRN Electronic JournalDevelopment and Validation of a Deep Learning–Based Automatic Brain Segmentation and Classification Algorithm for Alzheimer Disease Using 3D T1-Weighted Volumetric ImagesC.H.Suh, W.H.Shim, S.J.Kim, J.H.Roh, J.-H.Lee, M.-J.Kim, S.Park, W.Jung, J.Sung, G.-H.Jahng,2020 | American Journal of Neuroradiology, Vol. 41, No. 12Key principles of clinical validation, device approval, and insurance coverage decisions of artificial intelligenceSeong HoPark, JaesoonChoi, Jeong-SikByeon2020 | Journal of the Korean Medical Association, Vol. 63, No. 11Reproducibility and Generalizability in Radiomics Modeling: Possible Strategies in Radiologic and Statistical PerspectivesJi EunPark, Seo YoungPark, Hwa JungKim, Ho SungKim2019 | Korean Journal of Radiology, Vol. 20, No. 7Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published PapersDong WookKim, Hye YoungJang, Kyung WonKim, YoungbinShin, Seong HoPark2019 | Korean Journal of Radiology, Vol. 20, No. 3Ethical challenges regarding artificial intelligence in medicine from the perspective of scientific editing and peer reviewSeong HoPark, Young-HakKim, Jun YoungLee, SoyoungYoo, Chong JaiKim2019 | Science Editing, Vol. 6, No. 2Recommended Articles MRI for Detecting Root Avulsions in Traumatic Adult Brachial Plexus Injuries: A Systematic Review and Meta-Analysis of Diagnostic AccuracyRadiology2019Volume: 293Issue: 1pp. 125-133Posttraumatic Stress Disorder: Structural Characterization with 3-T MR ImagingRadiology2016Volume: 280Issue: 2pp. 537-5442017: A Look BackRadiology2017Volume: 285Issue: 3pp. 702-704Fare Thee Well and Let the Good Times RollRadioGraphics2020Volume: 40Issue: 1pp. 1-7Congratulations to the 2021 Editorial FellowsRadioGraphics2021Volume: 41Issue: 5pp. E140-E141See More RSNA Education Exhibits Feeling the Burn: Occupational Burnout in Interventional RadiologyDigital Posters2020The Spectrum of Interpretive and Perceptual Biases in Abdominal RadiologyDigital Posters2022Why Has AI Not Replaced Radiologists?Digital Posters2021 RSNA Case Collection Appendico-cecal IntusussceptionRSNA Case Collection2021Pulmonary pseudoaneurysmRSNA Case Collection2021Mixed Sclerosing Bone DysplasiaRSNA Case Collection2022 Vol. 290, No. 1 Metrics Altmetric Score PDF download
Referência(s)