Bio


Shengtian Sang is currently a post-doctoral scholar at the Laboratory of Artificial Intelligence in Medicine and Biomedical Physics in the department of Radiation Oncology at Stanford University. He received his Ph.D. degree from the College of Computer Science and Technology, Dalian University of Technology, Dalian, China. His current research interests are high-dimensional data mining, medical image computing, and machine learning. In his Ph.D. study, he worked on the biomedical literature-based discovery and data mining.

Stanford Advisors


  • Lei Xing, Postdoctoral Faculty Sponsor

All Publications


  • Leveraging data-driven self-consistency for high-fidelity gene expression recovery. Nature communications Islam, M. T., Wang, J., Ren, H., Li, X., Khuzani, M. B., Sang, S., Yu, L., Shen, L., Zhao, W., Xing, L. 2022; 13 (1): 7142

    Abstract

    Single cell RNA sequencing is a promising technique to determine the states of individual cells and classify novel cell subtypes. In current sequence data analysis, however, genes with low expressions are omitted, which leads to inaccurate gene counts and hinders downstream analysis. Recovering these omitted expression values presents a challenge because of the large size of the data. Here, we introduce a data-driven gene expression recovery framework, referred to as self-consistent expression recovery machine (SERM), to impute the missing expressions. Using a neural network, the technique first learns the underlying data distribution from a subset of the noisy data. It then recovers the overall expression data by imposing a self-consistency on the expression matrix, thus ensuring that the expression levels are similarly distributed in different parts of the matrix. We show that SERM improves the accuracy of gene imputation with orders of magnitude enhancement in computational efficiency in comparison to the state-of-the-art imputation techniques.

    View details for DOI 10.1038/s41467-022-34595-w

    View details for PubMedID 36414658

  • Small-Object Sensitive Segmentation Using Across Feature Map Attention. IEEE transactions on pattern analysis and machine intelligence Sang, S., Zhou, Y., Islam, M. T., Xing, L. 2022; PP

    Abstract

    Semantic segmentation is an important step in understanding the scene for many practical applications such as autonomous driving. Although Deep Convolutional Neural Networks-based methods have significantly improved segmentation accuracy, small/thin objects remain challenging to segment due to convolutional and pooling operations that result in information loss, especially for small objects. This paper presents a novel attention-based method called Across Feature Map Attention (AFMA) to address this challenge. It quantifies the inner-relationship between small and large objects belonging to the same category by utilizing the different feature levels of the original image. The AFMA could compensate for the loss of high-level feature information of small objects and improve the small/thin object segmentation. Our method can be used as an efficient plug-in for a wide range of existing architectures and produces much more interpretable feature representation than former studies. Extensive experiments on eight widely used segmentation methods and other existing small-object segmentation models on CamVid and Cityscapes demonstrate that our method substantially and consistently improves the segmentation of small/thin objects.

    View details for DOI 10.1109/TPAMI.2022.3211171

    View details for PubMedID 36178991

  • A Scalable Embedding Based Neural Network Method for Discovering Knowledge From Biomedical Literature IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Sang, S., Liu, X., Chen, X., Zhao, D. 2022; 19 (3): 1294-1301

    Abstract

    Nowadays, the amount of biomedical literatures is growing at an explosive speed, and much useful knowledge is yet undiscovered in the literature. Classical information retrieval techniques allow to access explicit information from a given collection of information, but are not able to recognize implicit connections. Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting literature. It could significantly support scientific research by identifying new connections between biomedical entities. However, most of the existing approaches to LBD are not scalable and may not be sufficient to detect complex associations in non-directly-connected literature. In this article, we present a model which incorporates biomedical knowledge graph, graph embedding, and deep learning methods for literature-based discovery. First, the relations between biomedical entities are extracted from biomedical abstracts and then a knowledge graph is constructed by using these obtained relations. Second, the graph embedding technologies are applied to convert the entities and relations in the knowledge graph into a low-dimensional vector space. Third, a bidirectional Long Short-Term Memory (BLSTM) network is trained based on the entity associations represented by the pre-trained graph embeddings. Finally, the learned model is used for open and closed literature-based discovery tasks. The experimental results show that our method could not only effectively discover hidden associations between entities, but also reveal the corresponding mechanism of interactions. It suggests that incorporating knowledge graph and deep learning methods is an effective way for capturing the underlying complex associations between entities hidden in the literature.

    View details for DOI 10.1109/TCBB.2020.3003947

    View details for Web of Science ID 000805807200006

    View details for PubMedID 32750871

  • Type 1 Diabetes Management With Technology: Patterns of Utilization and Effects on Glucose Control Using Real-World Evidence. Clinical diabetes : a publication of the American Diabetes Association Sun, R., Banerjee, I., Sang, S., Joseph, J., Schneider, J., Hernandez-Boussard, T. 2021; 39 (3): 284-292

    Abstract

    This retrospective cohort study evaluated diabetes device utilization and the effectiveness of these devices for newly diagnosed type 1 diabetes. Investigators examined the use of continuous glucose monitoring (CGM) systems, self-monitoring of blood glucose (SMBG), continuous subcutaneous insulin infusion (CSII), and multiple daily injection (MDI) insulin regimens and their effects on A1C. The researchers identified 6,250 patients with type 1 diabetes, of whom 32% used CGM and 37.1% used CSII. A higher adoption rate of either CGM or CSII in newly diagnosed type 1 diabetes was noted among White patients and those with private health insurance. CGM users had lower A1C levels than nonusers (P = 0.039), whereas no difference was noted between CSII users and nonusers (P = 0.057). Furthermore, CGM use combined with CSII yielded lower A1C than MDI regimens plus SMBG (P <0.001).

    View details for DOI 10.2337/cd20-0098

    View details for PubMedID 34421204

  • Geometric resistant polar quaternion discrete Fourier transform and its application in color image zero-hiding. ISA transactions Wang, C., Ma, B., Xia, Z., Li, J., Li, Q., Liu, X., Sang, S. 2021

    Abstract

    As a typical frequency-domain analysis method, quaternion discrete Fourier transform (QDFT) has been widely used in information hiding in color images. However, due to the sensitivity of QDFT to geometric attacks, existing QDFT-based information hiding schemes have limited ability in resisting geometric attacks. In this study, a kind of novel geometrically resilient polar QDFT (PQDFT) is constructed and the properties of the proposed PQDFT are analyzed. Subsequently, a PQDFT-based color image zero-hiding scheme robust to geometric attacks is proposed for lossless copyright protection of color images, which experimentally shows reasonable resistance against geometric and common attacks, indicating better robustness compared with the existing QDFT-based information hiding schemes and other leading-edge zero-hiding schemes.

    View details for DOI 10.1016/j.isatra.2021.06.019

    View details for PubMedID 34176603

  • Learning from Past Respiratory Failure Patients to Triage COVID-19 Patient Ventilator Needs: A Multi-Institutional Study. Journal of biomedical informatics Carmichael, H., Coquet, J., Sun, R., Sang, S., Groat, D., Asch, S. M., Bledsoe, J., Peltan, I. D., Jacobs, J. R., Hernandez-Boussard, T. 2021: 103802

    Abstract

    BACKGROUND: Unlike well-established diseases that base clinical care on randomized trials, past experiences, and training, prognosis in COVID19 relies on a weaker foundation. Knowledge from other respiratory failure diseases may inform clinical decisions in this novel disease. The objective was to predict 48-hour invasive mechanical ventilation (IMV) within 48 hours in patients hospitalized with COVID-19 using COVID-like diseases (CLD).METHODS: This retrospective multicenter study trained machine learning (ML) models on patients hospitalized with CLD to predict IMV within 48 hours in COVID-19 patients. CLD patients were identified using diagnosis codes for bacterial pneumonia, viral pneumonia, influenza, unspecified pneumonia and acute respiratory distress syndrome (ARDS), 2008-2019. A total of 16 cohorts were constructed, including any combinations of the four diseases plus an exploratory ARDS cohort, to determine the most appropriate cohort to use. Candidate predictors included demographic and clinical parameters that were previously associated with poor COVID-19 outcomes. Model development included the implementation of logistic regression and three ensemble tree-based algorithms: decision tree, AdaBoost, and XGBoost. Models were validated in hospitalized COVID-19 patients at two healthcare systems, March 2020-July 2020. ML models were trained on CLD patients at Stanford Hospital Alliance (SHA). Models were validated on hospitalized COVID-19 patients at both SHA and Intermountain Healthcare.RESULTS: CLD training data were obtained from SHA (n=14,030), and validation data included 444 adult COVID-19 hospitalized patients from SHA (n=185) and Intermountain (n=259). XGBoost was the top-performing ML model, and among the 16 CLD training cohorts, the best model achieved an area under curve (AUC) of 0.883 in the validation set. In COVID-19 patients, the prediction models exhibited moderate discrimination performance, with the best models achieving an AUC of 0.77 at SHA and 0.65 at Intermountain. The model trained on all pneumonia and influenza cohorts had the best overall performance (SHA: positive predictive value (PPV) 0.29, negative predictive value (NPV) 0.97, positive likelihood ratio (PLR) 10.7; Intermountain: PPV, 0.23, NPV 0.97, PLR 10.3). We identified important factors associated with IMV that are not traditionally considered for respiratory diseases.CONCLUSIONS: The performance of prediction models derived from CLD for 48-hour IMV in patients hospitalized with COVID-19 demonstrate high specificity and can be used as a triage tool at point of care. Novel predictors of IMV identified in COVID-19 are often overlooked in clinical practice. Lessons learned from our approach may assist other research institutes seeking to build artificial intelligence technologies for novel or rare diseases with limited data for training and validation.

    View details for DOI 10.1016/j.jbi.2021.103802

    View details for PubMedID 33965640

  • Learning from past respiratory infections to predict COVID-19 Outcomes: A retrospective study. Journal of medical Internet research Sang, S. n., Sun, R. n., Coquet, J. n., Carmichael, H. n., Seto, T. n., Hernandez-Boussard, T. n. 2021

    Abstract

    In the clinical care of well-established diseases, randomized trials, literature and research are supplemented by clinical judgment to understand disease prognosis and inform treatment choices. In the void created by a lack of clinical experience with COVID-19, Artificial Intelligence (AI) may be an important tool to bolster clinical judgment and decision making. However, lack of clinical data restricts the design and development of such AI tools, particularly in preparation of an impending crisis or pandemic.This study aimed to develop and test the feasibility of a 'patients-like-me' framework to predict COVID-19 patient deterioration using a retrospective cohort of similar respiratory diseases.Our framework used COVID-like cohorts to design and train AI models that were then validated on the COVID-19 population. The COVID-like cohorts included patients diagnosed with bacterial pneumonia, viral pneumonia, unspecified pneumonia, influenza, and acute respiratory distress syndrome (ARDS) from an academic medical center, 2008-2019. Fifteen training cohorts were created using different combinations of the COVID-like cohorts with the ARDS cohort for exploratory purpose. Two machine learning (ML) models were developed, one to predict invasive mechanical ventilation (IMV) within 48 hours for each hospitalized day, and one to predict all-cause mortality at the time of admission. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). We established model interpretability by calculating SHapley Additive exPlanations (SHAP) scores to identify important features.Compared to the COVID-like cohorts (n=16,509), the COVID-19 hospitalized patients (n=159) were significantly younger, with a higher proportion of Hispanic ethnicity, lower proportion of smoking history and fewer comorbidities (P <0.001). COVID-19 patients had a lower IMV rate (15.1 vs 23.2, P=0.016) and shorter time to IMV (2.9 vs 4.1, P <0.001) compared to the COVID-like patients. In the COVID-like training data, the top models achieved excellent performance (AUV > 0.90). Validating in the COVID-19 cohort, the best performing model of predicting IMV was the XGBoost model (AUC: 0.826) trained on the viral pneumonia cohort. Similarly, the XGBoost model trained on all four COVID-like cohorts without ARDS achieved the best performance (AUC: 0.928) in predicting mortality. Important predictors included demographic information (age), vital signs (oxygen saturation), and laboratory values (white blood count, cardiac troponin, albumin, etc.). Our models suffered from class imbalance, that resulted in high negative predictive values and low positive predictive values.We provided a feasible framework for modeling patient deterioration using existing data and AI technology to address data limitations during the onset of a novel, rapidly changing pandemic.

    View details for DOI 10.2196/23026

    View details for PubMedID 33534724

  • Using Alias Sampling Strategy Based on Network Embeddings to Detect Protein Complexes IEEE ACCESS Liu, X., Sang, S., Wang, X. 2020; 8: 211773–83