All Publications


  • Regional mapping of natural gas compressor stations in the United States and Canada using deep learning on satellite imagery. Journal of environmental management Liu, B., Irvin, J., Omara, M., Wang, C., Kornberg, G., Sheng, H., Gautam, R., Ng, A. Y., Jackson, R. B. 2025; 393: 126728

    Abstract

    A comprehensive, open-access database of oil and gas infrastructure locations is necessary for accurately attributing emissions from satellites and managing pollution impacts on surrounding communities. However, open-access datasets are limited for many infrastructure types, including natural gas compressor stations, which account for approximately one-third of U.S. oil and gas sector methane emissions and are associated with harmful pollution. Here, we developed the first automated deep learning approach for detecting natural gas compressor stations in satellite imagery. We experimented with various neural network architectures trained on different image resolutions and footprints, and found that the best model achieved a precision of 0.81 at 0.95 recall. Incorporating whether a proposed facility is close to an oil and gas pipeline further improved model precision by 0.02. Deploying the best model to identify facilities across a critical 200,000 km2 oil and gas-producing region capturing the Marcellus Shale, we detected 1103 compressor stations that were not previously reported in a large bottom-up oil and gas infrastructure database. Incorporating these new locations revealed that population exposure to potential emitted pollutants may be underestimated by as much as 74 % when relying exclusively on reported data. Our work highlights the utility of machine learning to enhance infrastructure mapping for environmental management and pollution assessment.

    View details for DOI 10.1016/j.jenvman.2025.126728

    View details for PubMedID 40884956

  • Deep learning for detecting and characterizing oil and gas well pads in satellite imagery. Nature communications Ramachandran, N., Irvin, J., Omara, M., Gautam, R., Meisenhelder, K., Rostami, E., Sheng, H., Ng, A. Y., Jackson, R. B. 2024; 15 (1): 7036

    Abstract

    Methane emissions from the oil and gas sector are a large contributor to climate change. Robust emission quantification and source attribution are needed for mitigating methane emissions, requiring a transparent, comprehensive, and accurate geospatial database of oil and gas infrastructure. Realizing such a database is hindered by data gaps nationally and globally. To fill these gaps, we present a deep learning approach on freely available, high-resolution satellite imagery for automatically mapping well pads and storage tanks. We validate the results in the Permian and Denver-Julesburg basins, two high-producing basins in the United States. Our approach achieves high performance on expert-curated datasets of well pads (Precision = 0.955, Recall = 0.904) and storage tanks (Precision = 0.962, Recall = 0.968). When deployed across the entire basins, the approach captures a majority of well pads in existing datasets (79.5%) and detects a substantial number (>70,000) of well pads not present in those datasets. Furthermore, we detect storage tanks (>169,000) on well pads, which were not mapped in existing datasets. We identify remaining challenges with the approach, which, when solved, should enable a globally scalable and public framework for mapping well pads, storage tanks, and other oil and gas infrastructure.

    View details for DOI 10.1038/s41467-024-50334-9

    View details for PubMedID 39147770

    View details for PubMedCentralID PMC11327246

  • Automatic deforestation driver attribution using deep learning on satellite imagery GLOBAL ENVIRONMENTAL CHANGE-HUMAN AND POLICY DIMENSIONS Ramachandran, N., Irvin, J., Sheng, H., Johnson-Yu, S., Story, K., Rustowicz, R., Ng, A. Y., Austin, K. 2024; 86
  • VetLLM: Large Language Model for Predicting Diagnosis from Veterinary Notes. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Jiang, Y., Irvin, J. A., Ng, A. Y., Zou, J. 2024; 29: 120-133

    Abstract

    Lack of diagnosis coding is a barrier to leveraging veterinary notes for medical and public health research. Previous work is limited to develop specialized rule-based or customized supervised learning models to predict diagnosis coding, which is tedious and not easily transferable. In this work, we show that open-source large language models (LLMs) pretrained on general corpus can achieve reasonable performance in a zero-shot setting. Alpaca-7B can achieve a zero-shot F1 of 0.538 on CSU test data and 0.389 on PP test data, two standard benchmarks for coding from veterinary notes. Furthermore, with appropriate fine-tuning, the performance of LLMs can be substantially boosted, exceeding those of strong state-of-the-art supervised models. VetLLM, which is fine-tuned on Alpaca-7B using just 5000 veterinary notes, can achieve a F1 of 0.747 on CSU test data and 0.637 on PP test data. It is of note that our fine-tuning is data-efficient: using 200 notes can outperform supervised models trained with more than 100,000 notes. The findings demonstrate the great potential of leveraging LLMs for language processing tasks in medicine, and we advocate this new paradigm for processing clinical text.

    View details for PubMedID 38160274

  • A System for Automated Vehicle Damage Localization and Severity Estimation Using Deep Learning IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS Ma, Y., Ghanbari, H., Huang, T., Irvin, J., Brady, O., Zalouk, S., Sheng, H., Ng, A., Rajagopal, R., Narsude, M. 2023
  • Probabilistic Prediction of Laboratory Test Information Yield. AMIA ... Annual Symposium proceedings. AMIA Symposium Jiang, Y., Lee, A. H., Ni, X., Corbin, C. K., Irvin, J. A., Ng, A. Y., Chen, J. H. 2023; 2023: 1007-1016

    Abstract

    Low-yield repetitive laboratory diagnostics burden patients and inflate cost of care. In this study, we assess whether stability in repeated laboratory diagnostic measurements is predictable with uncertainty estimates using electronic health record data available before the diagnostic is ordered. We use probabilistic regression to predict a distribution of plausible values, allowing use-time customization for various definitions of "stability" given dynamic ranges and clinical scenarios. After converting distributions into "stability" scores, the models achieve a sensitivity of 29% for white blood cells, 60% for hemoglobin, 100% for platelets, 54% for potassium, 99% for albumin and 35% for creatinine for predicting stability at 90% precision, suggesting those fractions of repetitive tests could be reduced with low risk of missing important changes. The findings demonstrate the feasibility of using electronic health record data to identify low-yield repetitive tests and offer personalized guidance for better usage of testing while ensuring high quality care.

    View details for PubMedID 38222438

    View details for PubMedCentralID PMC10785903

  • Paddy rice methane emissions across Monsoon Asia REMOTE SENSING OF ENVIRONMENT Ouyang, Z., Jackson, R. B., McNicol, G., Fluet-Chouinard, E., Runkle, B. R. K., Papale, D., Knox, S. H., Cooley, S., Delwiche, K. B., Feron, S., Irvin, J., Malhotra, A., Muddasir, M., Sabbatini, S., Alberto, M. R., Cescatti, A., Chen, C., Dong, J., Fong, B. N., Guo, H., Hao, L., Iwata, H., Jia, Q., Ju, W., Kang, M., Li, H., Kim, J., Reba, M. L., Nayak, A., Roberti, D., Ryu, Y., Swain, C., Tsuang, B., Xiao, X., Yuan, W., Zhang, G., Zhang, Y. 2023; 284
  • GEO-Bench: Toward Foundation Models for Earth Monitoring Lacoste, A., Lehmann, N., Rodriguez, P., Sherwin, E., Kerner, H., Lutjens, B., Irvin, J., Dao, D., Alemohammad, H., Drouin, A., Gunturkun, M., Huang, G., Vazquez, D., Newman, D., Bengio, Y., Ermon, S., Zhu, X. edited by Oh, A., Neumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023
  • Marked crosswalks in US transit-oriented station areas, 2007-2020: A computer vision approach using street view imagery ENVIRONMENT AND PLANNING B-URBAN ANALYTICS AND CITY SCIENCE Li, M., Sheng, H., Irvin, J., Chung, H., Ying, A., Sun, T., Ng, A. Y., Rodriguez, D. A. 2022
  • A Pragmatic Stepped-wedge, Cluster-controlled Trial of Real-time Pneumonia Clinical Decision Support. American journal of respiratory and critical care medicine Dean, N. C., Vines, C. G., Carr, J. R., Rubin, J. G., Webb, B. J., Jacobs, J. R., Butler, A. M., Lee, J., Jephson, A. R., Jenson, N., Walker, M., Brown, S. M., Irvin, J. A., Lungren, M. P., Allen, T. L. 2022

    Abstract

    RATIONALE: Care of emergency department patients with pneumonia can be challenging. Clinical decision support may decrease unnecessary variation and improve care.OBJECTIVES: Report patient outcomes and processes of care following deployment of ePNa: comprehensive, open loop, real-time clinical decision support embedded within the electronic health record.METHODS: Pragmatic, stepped-wedge, cluster-controlled trial with deployment at 2-month intervals into 16 community hospitals. ePNa extracts real-time and historical data to guide diagnosis, risk stratification, microbiology studies, site of care and antibiotic therapy. We included all adult emergency department patients with pneumonia over three years identified by ICD-10 discharge coding confirmed by chest imaging.MEASUREMENTS AND MAIN RESULTS: Median age of the 6848 patients was 67 years (interquartile range 50-79), 48% female; 64.8% were hospital admitted. Unadjusted mortality was 8.6% before and 4.8% after deployment. A mixed-effects logistic regression model adjusting for severity of illness with hospital cluster as the random effect showed adjusted odds ratio of 0.62 (0.49, 0.79, P<0.001) for 30-day all-cause mortality after deployment. Lower mortality was consistent across hospital clusters. ePNa concordant antibiotic prescribing increased from 83.5 to 90.2% (P<0.001). Mean time from emergency department admission to first antibiotic was 159.4 (156.9, 161.9) minutes at baseline and 150.9 (144.1, 157.8) after deployment (P<0.001). Outpatient disposition from the emergency department increased from 29.2% to 46.9% while 7-day secondary hospital admission was unchanged, 5.2% versus 6.1%. ePNa was utilized by emergency department clinicians in 67% of eligible patients.CONCLUSIONS: ePNa deployment was associated with improved processes of care and lower mortality.

    View details for DOI 10.1164/rccm.202109-2092OC

    View details for PubMedID 35258444

  • Gap-filling eddy covariance methane fluxes: Comparison of machine learning model predictions and uncertainties at FLUXNET-CH4 wetlands AGRICULTURAL AND FOREST METEOROLOGY Irvin, J., Zhou, S., McNicol, G., Lu, F., Liu, V., Fluet-Chouinard, E., Ouyang, Z., Knox, S., Lucas-Moffat, A., Trotta, C., Papale, D., Vitale, D., Mammarella, I., Alekseychik, P., Aurela, M., Avati, A., Baldocchi, D., Bansal, S., Bohrer, G., Campbell, D., Chen, J., Chu, H., Dalmagro, H. J., Delwiche, K. B., Desai, A. R., Euskirchen, E., Feron, S., Goeckede, M., Heimann, M., Helbig, M., Helfter, C., Hemes, K. S., Hirano, T., Iwata, H., Jurasinski, G., Kalhori, A., Kondrich, A., Lai, D. Y. F., Lohila, A., Malhotra, A., Merbold, L., Mitra, B., Ng, A., Nilsson, M. B., Noormets, A., Peichl, M., Rey-Sanchez, A., Richardson, A. D., Runkle, B. R. K., Schafer, K. V. R., Sonnentag, O., Stuart-Haentjens, E., Sturtevant, C., Ueyama, M., Valach, A. C., Vargas, R., Vourlitis, G. L., Ward, E. J., Wong, G., Zona, D., Alberto, M. R., Billesbach, D. P., Celis, G., Dolman, H., Friborg, T., Fuchs, K., Gogo, S., Gondwe, M. J., Goodrich, J. P., Gottschalk, P., Hortnagl, L., Jacotot, A., Koebsch, F., Kasak, K., Maier, R., Morin, T. H., Nemitz, E., Oechel, W. C., Oikawa, P. Y., Ono, K., Sachs, T., Sakabe, A., Schuur, E. A., Shortt, R., Sullivan, R. C., Szutu, D. J., Tuittila, E., Varlagin, A., Verfaillie, J. G., Wille, C., Windham-Myers, L., Poulter, B., Jackson, R. B. 2021; 308
  • CheXED: Comparison of a Deep Learning Model to a Clinical Decision Support System for Pneumonia in the Emergency Department. Journal of thoracic imaging Irvin, J. A., Pareek, A., Long, J., Rajpurkar, P., Eng, D. K., Khandwala, N., Haug, P. J., Jephson, A., Conner, K. E., Gordon, B. H., Rodriguez, F., Ng, A. Y., Lungren, M. P., Dean, N. C. 2021

    Abstract

    PURPOSE: Patients with pneumonia often present to the emergency department (ED) and require prompt diagnosis and treatment. Clinical decision support systems for the diagnosis and management of pneumonia are commonly utilized in EDs to improve patient care. The purpose of this study is to investigate whether a deep learning model for detecting radiographic pneumonia and pleural effusions can improve functionality of a clinical decision support system (CDSS) for pneumonia management (ePNa) operating in 20 EDs.MATERIALS AND METHODS: In this retrospective cohort study, a dataset of 7434 prior chest radiographic studies from 6551 ED patients was used to develop and validate a deep learning model to identify radiographic pneumonia, pleural effusions, and evidence of multilobar pneumonia. Model performance was evaluated against 3 radiologists' adjudicated interpretation and compared with performance of the natural language processing of radiology reports used by ePNa.RESULTS: The deep learning model achieved an area under the receiver operating characteristic curve of 0.833 (95% confidence interval [CI]: 0.795, 0.868) for detecting radiographic pneumonia, 0.939 (95% CI: 0.911, 0.962) for detecting pleural effusions and 0.847 (95% CI: 0.800, 0.890) for identifying multilobar pneumonia. On all 3 tasks, the model achieved higher agreement with the adjudicated radiologist interpretation compared with ePNa.CONCLUSIONS: A deep learning model demonstrated higher agreement with radiologists than the ePNa CDSS in detecting radiographic pneumonia and related findings. Incorporating deep learning models into pneumonia CDSS could enhance diagnostic performance and improve pneumonia management.

    View details for DOI 10.1097/RTI.0000000000000622

    View details for PubMedID 34561377

  • Decreased mortality with rollout of electronic pneumonia clinical decision support across 16 Utah hospital emergency departments Dean, N., Vines, C., Rubin, J., Webb, B., Jacobs, J., Butler, A., Lee, J., Al Jephson, Jenson, N., Walker, M., Irvin, J., Lungren, M., Carr, J., Srivastava, R., Allen, T. EUROPEAN RESPIRATORY SOC JOURNALS LTD. 2020
  • Author Correction: PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ digital medicine Huang, S. C., Kothari, T., Banerjee, I., Chute, C., Ball, R. L., Borus, N., Huang, A., Patel, B. N., Rajpurkar, P., Irvin, J., Dunnmon, J., Bledsoe, J., Shpanskaya, K., Dhaliwal, A., Zamanian, R., Ng, A. Y., Lungren, M. P. 2020; 3 (1): 102

    View details for DOI 10.1038/s41746-020-00310-6

    View details for PubMedID 33594219

  • Evaluation of a Machine Learning Model Based on Pretreatment Symptoms and Electroencephalographic Features to Predict Outcomes of Antidepressant Treatment in Adults With Depression: A Prespecified Secondary Analysis of a Randomized Clinical Trial. JAMA network open Rajpurkar, P., Yang, J., Dass, N., Vale, V., Keller, A. S., Irvin, J., Taylor, Z., Basu, S., Ng, A., Williams, L. M. 2020; 3 (6): e206653

    Abstract

    Importance: Despite the high prevalence and potential outcomes of major depressive disorder, whether and how patients will respond to antidepressant medications is not easily predicted.Objective: To identify the extent to which a machine learning approach, using gradient-boosted decision trees, can predict acute improvement for individual depressive symptoms with antidepressants based on pretreatment symptom scores and electroencephalographic (EEG) measures.Design, Setting, and Participants: This prognostic study analyzed data collected as part of the International Study to Predict Optimized Treatment in Depression, a randomized, prospective open-label trial to identify clinically useful predictors and moderators of response to commonly used first-line antidepressant medications. Data collection was conducted at 20 sites spanning 5 countries and including 518 adult outpatients (18-65 years of age) from primary care or specialty care practices who received a diagnosis of current major depressive disorder between December 1, 2008, and September 30, 2013. Patients were antidepressant medication naive or willing to undergo a 1-week washout period of any nonprotocol antidepressant medication. Statistical analysis was conducted from January 5 to June 30, 2019.Exposures: Participants with major depressive disorder were randomized in a 1:1:1 ratio to undergo 8 weeks of treatment with escitalopram oxalate (n=162), sertraline hydrochloride (n=176), or extended-release venlafaxine hydrochloride (n=180).Main Outcomes and Measures: The primary objective was to predict improvement in individual symptoms, defined as the difference in score for each of the symptoms on the 21-item Hamilton Rating Scale for Depression from baseline to week 8, evaluated using the C index.Results: The resulting data set contained 518 patients (274 women; mean [SD] age, 39.0 [12.6] years; mean [SD] 21-item Hamilton Rating Scale for Depression score improvement, 13.0 [7.0]). With the use of 5-fold cross-validation for evaluation, the machine learning model achieved C index scores of 0.8 or higher on 12 of 21 clinician-rated symptoms, with the highest C index score of 0.963 (95% CI, 0.939-1.000) for loss of insight. The importance of any single EEG feature was higher than 5% for prediction of 7 symptoms, with the most important EEG features being the absolute delta band power at the occipital electrode sites (O1, 18.8%; Oz, 6.7%) for loss of insight. Over and above the use of baseline symptom scores alone, the use of both EEG and baseline symptom features was associated with a significant increase in the C index for improvement in 4 symptoms: loss of insight (C index increase, 0.012 [95% CI, 0.001-0.020]), energy loss (C index increase, 0.035 [95% CI, 0.011-0.059]), appetite changes (C index increase, 0.017 [95% CI, 0.003-0.030]), and psychomotor retardation (C index increase, 0.020 [95% CI, 0.008-0.032]).Conclusions and Relevance: This study suggests that machine learning may be used to identify independent associations of symptoms and EEG features to predict antidepressant-associated improvements in specific symptoms of depression. The approach should next be prospectively validated in clinical trials and settings.Trial Registration: ClinicalTrials.gov Identifier: NCT00693849.

    View details for DOI 10.1001/jamanetworkopen.2020.6653

    View details for PubMedID 32568399

  • Incorporating machine learning and social determinants of health indicators into prospective risk adjustment for health plan payments. BMC public health Irvin, J. A., Kondrich, A. A., Ko, M., Rajpurkar, P., Haghgoo, B., Landon, B. E., Phillips, R. L., Petterson, S., Ng, A. Y., Basu, S. 2020; 20 (1): 608

    Abstract

    BACKGROUND: Risk adjustment models are employed to prevent adverse selection, anticipate budgetary reserve needs, and offer care management services to high-risk individuals. We aimed to address two unknowns about risk adjustment: whether machine learning (ML) and inclusion of social determinants of health (SDH) indicators improve prospective risk adjustment for health plan payments.METHODS: We employed a 2-by-2 factorial design comparing: (i) linear regression versus ML (gradient boosting) and (ii) demographics and diagnostic codes alone, versus additional ZIP code-level SDH indicators. Healthcare claims from privately-insured US adults (2016-2017), and Census data were used for analysis. Data from 1.02 million adults were used for derivation, and data from 0.26 million to assess performance. Model performance was measured using coefficient of determination (R2), discrimination (C-statistic), and mean absolute error (MAE) for the overall population, and predictive ratio and net compensation for vulnerable subgroups. We provide 95% confidence intervals (CI) around each performance measure.RESULTS: Linear regression without SDH indicators achieved moderate determination (R2 0.327, 95% CI: 0.300, 0.353), error ($6992; 95% CI: $6889, $7094), and discrimination (C-statistic 0.703; 95% CI: 0.701, 0.705). ML without SDH indicators improved all metrics (R2 0.388; 95% CI: 0.357, 0.420; error $6637; 95% CI: $6539, $6735; C-statistic 0.717; 95% CI: 0.715, 0.718), reducing misestimation of cost by $3.5M per 10,000 members. Among people living in areas with high poverty, high wealth inequality, or high prevalence of uninsured, SDH indicators reduced underestimation of cost, improving the predictive ratio by 3% (~$200/person/year).CONCLUSIONS: ML improved risk adjustment models and the incorporation of SDH indicators reduced underpayment in several vulnerable populations.

    View details for DOI 10.1186/s12889-020-08735-0

    View details for PubMedID 32357871

  • PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ digital medicine Huang, S. C., Kothari, T. n., Banerjee, I. n., Chute, C. n., Ball, R. L., Borus, N. n., Huang, A. n., Patel, B. N., Rajpurkar, P. n., Irvin, J. n., Dunnmon, J. n., Bledsoe, J. n., Shpanskaya, K. n., Dhaliwal, A. n., Zamanian, R. n., Ng, A. Y., Lungren, M. P. 2020; 3 (1): 61

    Abstract

    Pulmonary embolism (PE) is a life-threatening clinical problem and computed tomography pulmonary angiography (CTPA) is the gold standard for diagnosis. Prompt diagnosis and immediate treatment are critical to avoid high morbidity and mortality rates, yet PE remains among the diagnoses most frequently missed or delayed. In this study, we developed a deep learning model-PENet, to automatically detect PE on volumetric CTPA scans as an end-to-end solution for this purpose. The PENet is a 77-layer 3D convolutional neural network (CNN) pretrained on the Kinetics-600 dataset and fine-tuned on a retrospective CTPA dataset collected from a single academic institution. The PENet model performance was evaluated in detecting PE on data from two different institutions: one as a hold-out dataset from the same institution as the training data and a second collected from an external institution to evaluate model generalizability to an unrelated population dataset. PENet achieved an AUROC of 0.84 [0.82-0.87] on detecting PE on the hold out internal test set and 0.85 [0.81-0.88] on external dataset. PENet also outperformed current state-of-the-art 3D CNN models. The results represent successful application of an end-to-end 3D CNN model for the complex task of PE diagnosis without requiring computationally intensive and time consuming preprocessing and demonstrates sustained performance on data from an external institution. Our model could be applied as a triage tool to automatically identify clinically important PEs allowing for prioritization for diagnostic radiology interpretation and improved care pathways via more efficient diagnosis.

    View details for DOI 10.1038/s41746-020-0266-y

    View details for PubMedID 33594235

  • PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ digital medicine Huang, S. C., Kothari, T. n., Banerjee, I. n., Chute, C. n., Ball, R. L., Borus, N. n., Huang, A. n., Patel, B. N., Rajpurkar, P. n., Irvin, J. n., Dunnmon, J. n., Bledsoe, J. n., Shpanskaya, K. n., Dhaliwal, A. n., Zamanian, R. n., Ng, A. Y., Lungren, M. P. 2020; 3: 61

    Abstract

    Pulmonary embolism (PE) is a life-threatening clinical problem and computed tomography pulmonary angiography (CTPA) is the gold standard for diagnosis. Prompt diagnosis and immediate treatment are critical to avoid high morbidity and mortality rates, yet PE remains among the diagnoses most frequently missed or delayed. In this study, we developed a deep learning model-PENet, to automatically detect PE on volumetric CTPA scans as an end-to-end solution for this purpose. The PENet is a 77-layer 3D convolutional neural network (CNN) pretrained on the Kinetics-600 dataset and fine-tuned on a retrospective CTPA dataset collected from a single academic institution. The PENet model performance was evaluated in detecting PE on data from two different institutions: one as a hold-out dataset from the same institution as the training data and a second collected from an external institution to evaluate model generalizability to an unrelated population dataset. PENet achieved an AUROC of 0.84 [0.82-0.87] on detecting PE on the hold out internal test set and 0.85 [0.81-0.88] on external dataset. PENet also outperformed current state-of-the-art 3D CNN models. The results represent successful application of an end-to-end 3D CNN model for the complex task of PE diagnosis without requiring computationally intensive and time consuming preprocessing and demonstrates sustained performance on data from an external institution. Our model could be applied as a triage tool to automatically identify clinically important PEs allowing for prioritization for diagnostic radiology interpretation and improved care pathways via more efficient diagnosis.

    View details for DOI 10.1038/s41746-020-0266-y

    View details for PubMedID 32352039

    View details for PubMedCentralID PMC7181770

  • Erratum: Author Correction: PENet-a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric CT imaging. NPJ digital medicine Huang, S. C., Kothari, T. n., Banerjee, I. n., Chute, C. n., Ball, R. L., Borus, N. n., Huang, A. n., Patel, B. N., Rajpurkar, P. n., Irvin, J. n., Dunnmon, J. n., Bledsoe, J. n., Shpanskaya, K. n., Dhaliwal, A. n., Zamanian, R. n., Ng, A. Y., Lungren, M. P. 2020; 3: 102

    Abstract

    [This corrects the article DOI: 10.1038/s41746-020-0266-y.].

    View details for DOI 10.1038/s41746-020-00310-6

    View details for PubMedID 32793812

    View details for PubMedCentralID PMC7387525

  • AppendiXNet: Deep Learning for Diagnosis of Appendicitis from A Small Dataset of CT Exams Using Video Pretraining. Scientific reports Rajpurkar, P. n., Park, A. n., Irvin, J. n., Chute, C. n., Bereket, M. n., Mastrodicasa, D. n., Langlotz, C. P., Lungren, M. P., Ng, A. Y., Patel, B. N. 2020; 10 (1): 3958

    Abstract

    The development of deep learning algorithms for complex tasks in digital medicine has relied on the availability of large labeled training datasets, usually containing hundreds of thousands of examples. The purpose of this study was to develop a 3D deep learning model, AppendiXNet, to detect appendicitis, one of the most common life-threatening abdominal emergencies, using a small training dataset of less than 500 training CT exams. We explored whether pretraining the model on a large collection of natural videos would improve the performance of the model over training the model from scratch. AppendiXNet was pretrained on a large collection of YouTube videos called Kinetics, consisting of approximately 500,000 video clips and annotated for one of 600 human action classes, and then fine-tuned on a small dataset of 438 CT scans annotated for appendicitis. We found that pretraining the 3D model on natural videos significantly improved the performance of the model from an AUC of 0.724 (95% CI 0.625, 0.823) to 0.810 (95% CI 0.725, 0.895). The application of deep learning to detect abnormalities on CT examinations using video pretraining could generalize effectively to other challenging cross-sectional medical imaging tasks when training data is limited.

    View details for DOI 10.1038/s41598-020-61055-6

    View details for PubMedID 32127625

  • Real-time electronic interpretation of digital chest images using artificial intelligence in emergency department patients suspected of pneumonia Dean, N., Irvin, J. A., Samir, P. S., Jephson, A., Conner, K., Lungren, M. P. EUROPEAN RESPIRATORY SOC JOURNALS LTD. 2019
  • Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ digital medicine Patel, B. N., Rosenberg, L. n., Willcox, G. n., Baltaxe, D. n., Lyons, M. n., Irvin, J. n., Rajpurkar, P. n., Amrhein, T. n., Gupta, R. n., Halabi, S. n., Langlotz, C. n., Lo, E. n., Mammarappallil, J. n., Mariano, A. J., Riley, G. n., Seekins, J. n., Shen, L. n., Zucker, E. n., Lungren, M. n. 2019; 2: 111

    Abstract

    Human-in-the-loop (HITL) AI may enable an ideal symbiosis of human experts and AI models, harnessing the advantages of both while at the same time overcoming their respective limitations. The purpose of this study was to investigate a novel collective intelligence technology designed to amplify the diagnostic accuracy of networked human groups by forming real-time systems modeled on biological swarms. Using small groups of radiologists, the swarm-based technology was applied to the diagnosis of pneumonia on chest radiographs and compared against human experts alone, as well as two state-of-the-art deep learning AI models. Our work demonstrates that both the swarm-based technology and deep-learning technology achieved superior diagnostic accuracy than the human experts alone. Our work further demonstrates that when used in combination, the swarm-based technology and deep-learning technology outperformed either method alone. The superior diagnostic accuracy of the combined HITL AI solution compared to radiologists and AI alone has broad implications for the surging clinical AI deployment and implementation strategies in future practice.

    View details for DOI 10.1038/s41746-019-0189-7

    View details for PubMedID 31754637

    View details for PubMedCentralID PMC6861262

  • CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., Seekins, J., Mong, D. A., Halabi, S. S., Sandberg, J. K., Jones, R., Larson, D. B., Langlotz, C. P., Patel, B. N., Lungren, M. P., Ng, A. Y., AAAI ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2019: 590–97
  • Erratum: Author Correction: Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ digital medicine Patel, B. N., Rosenberg, L. n., Willcox, G. n., Baltaxe, D. n., Lyons, M. n., Irvin, J. n., Rajpurkar, P. n., Amrhein, T. n., Gupta, R. n., Halabi, S. n., Langlotz, C. n., Lo, E. n., Mammarappallil, J. n., Mariano, A. J., Riley, G. n., Seekins, J. n., Shen, L. n., Zucker, E. n., Lungren, M. P. 2019; 2: 129

    Abstract

    [This corrects the article DOI: 10.1038/s41746-019-0189-7.].

    View details for DOI 10.1038/s41746-019-0198-6

    View details for PubMedID 31840097

    View details for PubMedCentralID PMC6904441

  • Author Correction: Human-machine partnership with artificial intelligence for chest radiograph diagnosis. NPJ digital medicine Patel, B. N., Rosenberg, L. n., Willcox, G. n., Baltaxe, D. n., Lyons, M. n., Irvin, J. n., Rajpurkar, P. n., Amrhein, T. n., Gupta, R. n., Halabi, S. n., Langlotz, C. n., Lo, E. n., Mammarappallil, J. n., Mariano, A. J., Riley, G. n., Seekins, J. n., Shen, L. n., Zucker, E. n., Lungren, M. P. 2019; 2 (1): 129

    Abstract

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

    View details for DOI 10.1038/s41746-019-0198-6

    View details for PubMedID 33293693

  • Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS medicine Bien, N., Rajpurkar, P., Ball, R. L., Irvin, J., Park, A., Jones, E., Bereket, M., Patel, B. N., Yeom, K. W., Shpanskaya, K., Halabi, S., Zucker, E., Fanton, G., Amanatullah, D. F., Beaulieu, C. F., Riley, G. M., Stewart, R. J., Blankenberg, F. G., Larson, D. B., Jones, R. H., Langlotz, C. P., Ng, A. Y., Lungren, M. P. 2018; 15 (11): e1002699

    Abstract

    BACKGROUND: Magnetic resonance imaging (MRI) of the knee is the preferred method for diagnosing knee injuries. However, interpretation of knee MRI is time-intensive and subject to diagnostic error and variability. An automated system for interpreting knee MRI could prioritize high-risk patients and assist clinicians in making diagnoses. Deep learning methods, in being able to automatically learn layers of features, are well suited for modeling the complex relationships between medical images and their interpretations. In this study we developed a deep learning model for detecting general abnormalities and specific diagnoses (anterior cruciate ligament [ACL] tears and meniscal tears) on knee MRI exams. We then measured the effect of providing the model's predictions to clinical experts during interpretation.METHODS AND FINDINGS: Our dataset consisted of 1,370 knee MRI exams performed at Stanford University Medical Center between January 1, 2001, and December 31, 2012 (mean age 38.0 years; 569 [41.5%] female patients). The majority vote of 3 musculoskeletal radiologists established reference standard labels on an internal validation set of 120 exams. We developed MRNet, a convolutional neural network for classifying MRI series and combined predictions from 3 series per exam using logistic regression. In detecting abnormalities, ACL tears, and meniscal tears, this model achieved area under the receiver operating characteristic curve (AUC) values of 0.937 (95% CI 0.895, 0.980), 0.965 (95% CI 0.938, 0.993), and 0.847 (95% CI 0.780, 0.914), respectively, on the internal validation set. We also obtained a public dataset of 917 exams with sagittal T1-weighted series and labels for ACL injury from Clinical Hospital Centre Rijeka, Croatia. On the external validation set of 183 exams, the MRNet trained on Stanford sagittal T2-weighted series achieved an AUC of 0.824 (95% CI 0.757, 0.892) in the detection of ACL injuries with no additional training, while an MRNet trained on the rest of the external data achieved an AUC of 0.911 (95% CI 0.864, 0.958). We additionally measured the specificity, sensitivity, and accuracy of 9 clinical experts (7 board-certified general radiologists and 2 orthopedic surgeons) on the internal validation set both with and without model assistance. Using a 2-sided Pearson's chi-squared test with adjustment for multiple comparisons, we found no significant differences between the performance of the model and that of unassisted general radiologists in detecting abnormalities. General radiologists achieved significantly higher sensitivity in detecting ACL tears (p-value = 0.002; q-value = 0.019) and significantly higher specificity in detecting meniscal tears (p-value = 0.003; q-value = 0.019). Using a 1-tailed t test on the change in performance metrics, we found that providing model predictions significantly increased clinical experts' specificity in identifying ACL tears (p-value < 0.001; q-value = 0.006). The primary limitations of our study include lack of surgical ground truth and the small size of the panel of clinical experts.CONCLUSIONS: Our deep learning model can rapidly generate accurate clinical pathology classifications of knee MRI exams from both internal and external datasets. Moreover, our results support the assertion that deep learning models can improve the performance of clinical experts during medical imaging interpretation. Further research is needed to validate the model prospectively and to determine its utility in the clinical setting.

    View details for PubMedID 30481176

  • Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS medicine Rajpurkar, P., Irvin, J., Ball, R. L., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D., Bagul, A., Langlotz, C. P., Patel, B. N., Yeom, K. W., Shpanskaya, K., Blankenberg, F. G., Seekins, J., Amrhein, T. J., Mong, D. A., Halabi, S. S., Zucker, E. J., Ng, A. Y., Lungren, M. P. 2018; 15 (11): e1002686

    Abstract

    BACKGROUND: Chest radiograph interpretation is critical for the detection of thoracic diseases, including tuberculosis and lung cancer, which affect millions of people worldwide each year. This time-consuming task typically requires expert radiologists to read the images, leading to fatigue-based diagnostic error and lack of diagnostic expertise in areas of the world where radiologists are not available. Recently, deep learning approaches have been able to achieve expert-level performance in medical image interpretation tasks, powered by large network architectures and fueled by the emergence of large labeled datasets. The purpose of this study is to investigate the performance of a deep learning algorithm on the detection of pathologies in chest radiographs compared with practicing radiologists.METHODS AND FINDINGS: We developed CheXNeXt, a convolutional neural network to concurrently detect the presence of 14 different pathologies, including pneumonia, pleural effusion, pulmonary masses, and nodules in frontal-view chest radiographs. CheXNeXt was trained and internally validated on the ChestX-ray8 dataset, with a held-out validation set consisting of 420 images, sampled to contain at least 50 cases of each of the original pathology labels. On this validation set, the majority vote of a panel of 3 board-certified cardiothoracic specialist radiologists served as reference standard. We compared CheXNeXt's discriminative performance on the validation set to the performance of 9 radiologists using the area under the receiver operating characteristic curve (AUC). The radiologists included 6 board-certified radiologists (average experience 12 years, range 4-28 years) and 3 senior radiology residents, from 3 academic institutions. We found that CheXNeXt achieved radiologist-level performance on 11 pathologies and did not achieve radiologist-level performance on 3 pathologies. The radiologists achieved statistically significantly higher AUC performance on cardiomegaly, emphysema, and hiatal hernia, with AUCs of 0.888 (95% confidence interval [CI] 0.863-0.910), 0.911 (95% CI 0.866-0.947), and 0.985 (95% CI 0.974-0.991), respectively, whereas CheXNeXt's AUCs were 0.831 (95% CI 0.790-0.870), 0.704 (95% CI 0.567-0.833), and 0.851 (95% CI 0.785-0.909), respectively. CheXNeXt performed better than radiologists in detecting atelectasis, with an AUC of 0.862 (95% CI 0.825-0.895), statistically significantly higher than radiologists' AUC of 0.808 (95% CI 0.777-0.838); there were no statistically significant differences in AUCs for the other 10 pathologies. The average time to interpret the 420 images in the validation set was substantially longer for the radiologists (240 minutes) than for CheXNeXt (1.5 minutes). The main limitations of our study are that neither CheXNeXt nor the radiologists were permitted to use patient history or review prior examinations and that evaluation was limited to a dataset from a single institution.CONCLUSIONS: In this study, we developed and validated a deep learning algorithm that classified clinically important abnormalities in chest radiographs at a performance level comparable to practicing radiologists. Once tested prospectively in clinical settings, the algorithm could have the potential to expand patient access to chest radiograph diagnostics.

    View details for PubMedID 30457988