All Publications

  • Preparing for the next pandemic via transfer learning from existing diseases with hierarchical multi-modal BERT: a study on COVID-19 outcome prediction. Scientific reports Agarwal, K., Choudhury, S., Tipirneni, S., Mukherjee, P., Ham, C., Tamang, S., Baker, M., Tang, S., Kocaman, V., Gevaert, O., Rallo, R., Reddy, C. K. 2022; 12 (1): 10748


    Developing prediction models for emerging infectious diseases from relatively small numbers of cases is a critical need for improving pandemic preparedness. Using COVID-19 as an exemplar, we propose a transfer learning methodology for developing predictive models from multi-modal electronic healthcare records by leveraging information from more prevalent diseases with shared clinical characteristics. Our novel hierarchical, multi-modal model ([Formula: see text]) integrates baseline risk factors from the natural language processing of clinical notes at admission, time-series measurements of biomarkers obtained from laboratory tests, and discrete diagnostic, procedure and drug codes. We demonstrate the alignment of [Formula: see text]'s predictions with well-established clinical knowledge about COVID-19 through univariate and multivariate risk factor driven sub-cohort analysis. [Formula: see text]'s superior performance over state-of-the-art methods shows that leveraging patient data across modalities and transferring prior knowledge from similar disorders is critical for accurate prediction of patient outcomes, and this approach may serve as an important tool in the early response to future pandemics.

    View details for DOI 10.1038/s41598-022-13072-w

    View details for PubMedID 35750878

  • Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset. Scientific reports Tang, S., Ghorbani, A., Yamashita, R., Rehman, S., Dunnmon, J. A., Zou, J., Rubin, D. L. 2021; 11 (1): 8366


    The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low Shapley values improved the model performance. Furthermore, there were more mislabeled examples in low Shapley value data and more true pneumonia cases in high Shapley value data. Our results suggest that low Shapley value indicates mislabeled or poor quality images, whereas high Shapley value indicates data that are valuable for pneumonia detection. Our method can serve as a framework for using data Shapley to denoise large-scale medical imaging datasets.

    View details for DOI 10.1038/s41598-021-87762-2

    View details for PubMedID 33863957

  • Comparison of Segmentation-Free and Segmentation-Dependent Computer-Aided Diagnosis of Breast Masses on a Public Mammography Dataset. Journal of biomedical informatics Sawyer Lee, R., Dunnmon, J. A., He, A., Tang, S., Re, C., Rubin, D. L. 2020: 103656


    PURPOSE: To compare machine learning methods for classifying mass lesions on mammography images that use predefined image features computed over lesion segmentations to those that leverage segmentation-free representation learning on a standard, public evaluation dataset.METHODS: We apply several classification algorithms to the public Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM), in which each image contains a mass lesion. Segmentation-free representation learning techniques for classifying lesions as benign or malignant include both a Bag-of-Visual-Words (BoVW) method and a Convolutional Neural Network (CNN). We compare classification performance of these techniques to that obtained using two different segmentation-dependent approaches from the literature that rely on specific combinations of end classifiers (e.g. linear discriminant analysis, neural networks) and predefined features computed over the lesion segmentation (e.g. spiculation measure, morphological characteristics, intensity metrics).RESULTS: We report area under the receiver operating characteristic curve (AZ) values for malignancy classification on CBIS-DDSM for each technique. We find average AZ values of 0.73 for a segmentation-free BoVW method, 0.86 for a segmentation-free CNN method, 0.75 for a segmentation-dependent linear discriminant analysis of Rubber-Band Straightening Transform features, and 0.58 for a hybrid rule-based neural network classification using a small number of hand-designed features.CONCLUSIONS: We find that malignancy classification performance on the CBIS-DDSM dataset using segmentation-free BoVW features is comparable to that of the best segmentation-dependent methods we study, but also observe that a common segmentation-free CNN model substantially and significantly outperforms each of these (p<0.05). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNNs are advantageous for mammogram analysis because they do not require lesion segmentation, the quality and specific characteristics of which can vary substantially across datasets. We further observe that segmentation-dependent methods achieve performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. Each of these findings reinforces the need for standardization of datasets, segmentation techniques, and model implementations in performance assessments of automated classifiers for medical imaging.

    View details for DOI 10.1016/j.jbi.2020.103656

    View details for PubMedID 33309994

  • Reconciling Dimensional and Categorical Models of Autism Heterogeneity: A Brain Connectomics and Behavioral Study. Biological psychiatry Tang, S., Sun, N., Floris, D. L., Zhang, X., Di Martino, A., Yeo, B. T. 2019


    BACKGROUND: Heterogeneity in autism spectrum disorder (ASD) has hindered the development of biomarkers, thus motivating subtyping efforts. Most subtyping studies divide individuals with ASD into nonoverlapping (categorical) subgroups. However, continuous interindividual variation in ASD suggests that there is a need for a dimensional approach.METHODS: A Bayesian model was employed to decompose resting-state functional connectivity (RSFC) of individuals with ASD into multiple abnormal RSFC patterns, i.e., categorical subtypes, henceforth referred to as "factors." Importantly, the model allowed each individual to express one or more factors to varying degrees (dimensional subtyping). The model was applied to 306 individuals with ASD (5.2-57 years of age) from two multisite repositories. Post hoc analyses associated factors with symptoms and demographics.RESULTS: Analyses yielded three factors with dissociable whole-brain hypo- and hyper-RSFC patterns. Most participants expressed multiple (categorical) factors, suggestive of a mosaic of subtypes within individuals. All factors shared abnormal RSFC involving the default mode network, but the directionality (hypo- or hyper-RSFC) differed across factors. Factor 1 was associated with core ASD symptoms. Factors 1 and 2 were associated with distinct comorbid symptoms. Older male participants preferentially expressed factor 3. Factors were robust across control analyses and were not associated with IQ or head motion.CONCLUSIONS: There exist at least three ASD factors with dissociable whole-brain RSFC patterns, behaviors, and demographics. Heterogeneous default mode network hypo- and hyper-RSFC across the factors might explain previously reported inconsistencies. The factors differentiated between core ASD and comorbid symptoms-a less appreciated domain of heterogeneity in ASD. These factors are coexpressed in individuals with ASD with different degrees, thus reconciling categorical and dimensional perspectives of ASD heterogeneity.

    View details for DOI 10.1016/j.biopsych.2019.11.009

    View details for PubMedID 31955916

  • Somatosensory-Motor Dysconnectivity Spans Multiple Transdiagnostic Dimensions of Psychopathology. Biological psychiatry Kebets, V., Holmes, A. J., Orban, C., Tang, S., Li, J., Sun, N., Kong, R., Poldrack, R. A., Yeo, B. T. 2019


    BACKGROUND: There is considerable interest in a dimensional transdiagnostic approach to psychiatry. Most transdiagnostic studies have derived factors based only on clinical symptoms, which might miss possible links between psychopathology, cognitive processes, and personality traits. Furthermore, many psychiatric studies focus on higher-order association brain networks, thereby neglecting the potential influence of huge swaths of the brain.METHODS: A multivariate data-driven approach (partial least squares) was used to identify latent components linking a large set of clinical, cognitive, and personality measures to whole-brain resting-state functional connectivity patterns across 224 participants. The participants were either healthy (n= 110) or diagnosed with bipolar disorder (n= 40), attention-deficit/hyperactivity disorder (n= 37), schizophrenia (n= 29), or schizoaffective disorder (n= 8). In contrast to traditional case-control analyses, the diagnostic categories were not used in the partial least squares analysis but were helpful for interpreting the components.RESULTS: Our analyses revealed three latent components corresponding to general psychopathology, cognitive dysfunction, and impulsivity. Each component was associated with a unique whole-brain resting-state functional connectivity signature and was shared across all participants. The components were robust across multiple control analyses and replicated using independent task functional magnetic resonance imaging data from the same participants. Strikingly, all three components featured connectivity alterations within the somatosensory-motor network and its connectivity with subcortical structures and cortical executive networks.CONCLUSIONS: We identified three distinct dimensions with dissociable (but overlapping) whole-brain resting-state functional connectivity signatures across healthy individuals and individuals with psychiatric illness, providing potential intermediate phenotypes that span across diagnostic categories. Our results suggest expanding the focus of psychiatric neuroscience beyond higher-order brain networks.

    View details for DOI 10.1016/j.biopsych.2019.06.013

    View details for PubMedID 31515054