I received my PhD from Dalian University of Technology (China), where I was working in the DUTIR team (information retrieval, natural language processing, data mining). My research area have focused on literature based discovery - mining new knowledge from biomedical literature. In Boussard Lab, my research is to establish different novel strategies to analyze Electronic Health Records for improving clinical decisions. Specifically, I am working on Cerebrospinal Fluid (CSF) Leak project, the goal of the project is to improve the diagnosis of CSF leaks though artificial intelligence methodologies.

Stanford Advisors

  • Lei Xing, Postdoctoral Faculty Sponsor

All Publications

  • A Scalable Embedding Based Neural Network Method for Discovering Knowledge From Biomedical Literature IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Sang, S., Liu, X., Chen, X., Zhao, D. 2022; 19 (3): 1294-1301


    Nowadays, the amount of biomedical literatures is growing at an explosive speed, and much useful knowledge is yet undiscovered in the literature. Classical information retrieval techniques allow to access explicit information from a given collection of information, but are not able to recognize implicit connections. Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting literature. It could significantly support scientific research by identifying new connections between biomedical entities. However, most of the existing approaches to LBD are not scalable and may not be sufficient to detect complex associations in non-directly-connected literature. In this article, we present a model which incorporates biomedical knowledge graph, graph embedding, and deep learning methods for literature-based discovery. First, the relations between biomedical entities are extracted from biomedical abstracts and then a knowledge graph is constructed by using these obtained relations. Second, the graph embedding technologies are applied to convert the entities and relations in the knowledge graph into a low-dimensional vector space. Third, a bidirectional Long Short-Term Memory (BLSTM) network is trained based on the entity associations represented by the pre-trained graph embeddings. Finally, the learned model is used for open and closed literature-based discovery tasks. The experimental results show that our method could not only effectively discover hidden associations between entities, but also reveal the corresponding mechanism of interactions. It suggests that incorporating knowledge graph and deep learning methods is an effective way for capturing the underlying complex associations between entities hidden in the literature.

    View details for DOI 10.1109/TCBB.2020.3003947

    View details for Web of Science ID 000805807200006

    View details for PubMedID 32750871

  • Type 1 Diabetes Management With Technology: Patterns of Utilization and Effects on Glucose Control Using Real-World Evidence. Clinical diabetes : a publication of the American Diabetes Association Sun, R., Banerjee, I., Sang, S., Joseph, J., Schneider, J., Hernandez-Boussard, T. 2021; 39 (3): 284-292


    This retrospective cohort study evaluated diabetes device utilization and the effectiveness of these devices for newly diagnosed type 1 diabetes. Investigators examined the use of continuous glucose monitoring (CGM) systems, self-monitoring of blood glucose (SMBG), continuous subcutaneous insulin infusion (CSII), and multiple daily injection (MDI) insulin regimens and their effects on A1C. The researchers identified 6,250 patients with type 1 diabetes, of whom 32% used CGM and 37.1% used CSII. A higher adoption rate of either CGM or CSII in newly diagnosed type 1 diabetes was noted among White patients and those with private health insurance. CGM users had lower A1C levels than nonusers (P = 0.039), whereas no difference was noted between CSII users and nonusers (P = 0.057). Furthermore, CGM use combined with CSII yielded lower A1C than MDI regimens plus SMBG (P <0.001).

    View details for DOI 10.2337/cd20-0098

    View details for PubMedID 34421204

  • Geometric resistant polar quaternion discrete Fourier transform and its application in color image zero-hiding. ISA transactions Wang, C., Ma, B., Xia, Z., Li, J., Li, Q., Liu, X., Sang, S. 2021


    As a typical frequency-domain analysis method, quaternion discrete Fourier transform (QDFT) has been widely used in information hiding in color images. However, due to the sensitivity of QDFT to geometric attacks, existing QDFT-based information hiding schemes have limited ability in resisting geometric attacks. In this study, a kind of novel geometrically resilient polar QDFT (PQDFT) is constructed and the properties of the proposed PQDFT are analyzed. Subsequently, a PQDFT-based color image zero-hiding scheme robust to geometric attacks is proposed for lossless copyright protection of color images, which experimentally shows reasonable resistance against geometric and common attacks, indicating better robustness compared with the existing QDFT-based information hiding schemes and other leading-edge zero-hiding schemes.

    View details for DOI 10.1016/j.isatra.2021.06.019

    View details for PubMedID 34176603

  • Learning from Past Respiratory Failure Patients to Triage COVID-19 Patient Ventilator Needs: A Multi-Institutional Study. Journal of biomedical informatics Carmichael, H., Coquet, J., Sun, R., Sang, S., Groat, D., Asch, S. M., Bledsoe, J., Peltan, I. D., Jacobs, J. R., Hernandez-Boussard, T. 2021: 103802


    BACKGROUND: Unlike well-established diseases that base clinical care on randomized trials, past experiences, and training, prognosis in COVID19 relies on a weaker foundation. Knowledge from other respiratory failure diseases may inform clinical decisions in this novel disease. The objective was to predict 48-hour invasive mechanical ventilation (IMV) within 48 hours in patients hospitalized with COVID-19 using COVID-like diseases (CLD).METHODS: This retrospective multicenter study trained machine learning (ML) models on patients hospitalized with CLD to predict IMV within 48 hours in COVID-19 patients. CLD patients were identified using diagnosis codes for bacterial pneumonia, viral pneumonia, influenza, unspecified pneumonia and acute respiratory distress syndrome (ARDS), 2008-2019. A total of 16 cohorts were constructed, including any combinations of the four diseases plus an exploratory ARDS cohort, to determine the most appropriate cohort to use. Candidate predictors included demographic and clinical parameters that were previously associated with poor COVID-19 outcomes. Model development included the implementation of logistic regression and three ensemble tree-based algorithms: decision tree, AdaBoost, and XGBoost. Models were validated in hospitalized COVID-19 patients at two healthcare systems, March 2020-July 2020. ML models were trained on CLD patients at Stanford Hospital Alliance (SHA). Models were validated on hospitalized COVID-19 patients at both SHA and Intermountain Healthcare.RESULTS: CLD training data were obtained from SHA (n=14,030), and validation data included 444 adult COVID-19 hospitalized patients from SHA (n=185) and Intermountain (n=259). XGBoost was the top-performing ML model, and among the 16 CLD training cohorts, the best model achieved an area under curve (AUC) of 0.883 in the validation set. In COVID-19 patients, the prediction models exhibited moderate discrimination performance, with the best models achieving an AUC of 0.77 at SHA and 0.65 at Intermountain. The model trained on all pneumonia and influenza cohorts had the best overall performance (SHA: positive predictive value (PPV) 0.29, negative predictive value (NPV) 0.97, positive likelihood ratio (PLR) 10.7; Intermountain: PPV, 0.23, NPV 0.97, PLR 10.3). We identified important factors associated with IMV that are not traditionally considered for respiratory diseases.CONCLUSIONS: The performance of prediction models derived from CLD for 48-hour IMV in patients hospitalized with COVID-19 demonstrate high specificity and can be used as a triage tool at point of care. Novel predictors of IMV identified in COVID-19 are often overlooked in clinical practice. Lessons learned from our approach may assist other research institutes seeking to build artificial intelligence technologies for novel or rare diseases with limited data for training and validation.

    View details for DOI 10.1016/j.jbi.2021.103802

    View details for PubMedID 33965640

  • Learning from past respiratory infections to predict COVID-19 Outcomes: A retrospective study. Journal of medical Internet research Sang, S. n., Sun, R. n., Coquet, J. n., Carmichael, H. n., Seto, T. n., Hernandez-Boussard, T. n. 2021


    In the clinical care of well-established diseases, randomized trials, literature and research are supplemented by clinical judgment to understand disease prognosis and inform treatment choices. In the void created by a lack of clinical experience with COVID-19, Artificial Intelligence (AI) may be an important tool to bolster clinical judgment and decision making. However, lack of clinical data restricts the design and development of such AI tools, particularly in preparation of an impending crisis or pandemic.This study aimed to develop and test the feasibility of a 'patients-like-me' framework to predict COVID-19 patient deterioration using a retrospective cohort of similar respiratory diseases.Our framework used COVID-like cohorts to design and train AI models that were then validated on the COVID-19 population. The COVID-like cohorts included patients diagnosed with bacterial pneumonia, viral pneumonia, unspecified pneumonia, influenza, and acute respiratory distress syndrome (ARDS) from an academic medical center, 2008-2019. Fifteen training cohorts were created using different combinations of the COVID-like cohorts with the ARDS cohort for exploratory purpose. Two machine learning (ML) models were developed, one to predict invasive mechanical ventilation (IMV) within 48 hours for each hospitalized day, and one to predict all-cause mortality at the time of admission. Model performance was assessed using the area under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). We established model interpretability by calculating SHapley Additive exPlanations (SHAP) scores to identify important features.Compared to the COVID-like cohorts (n=16,509), the COVID-19 hospitalized patients (n=159) were significantly younger, with a higher proportion of Hispanic ethnicity, lower proportion of smoking history and fewer comorbidities (P <0.001). COVID-19 patients had a lower IMV rate (15.1 vs 23.2, P=0.016) and shorter time to IMV (2.9 vs 4.1, P <0.001) compared to the COVID-like patients. In the COVID-like training data, the top models achieved excellent performance (AUV > 0.90). Validating in the COVID-19 cohort, the best performing model of predicting IMV was the XGBoost model (AUC: 0.826) trained on the viral pneumonia cohort. Similarly, the XGBoost model trained on all four COVID-like cohorts without ARDS achieved the best performance (AUC: 0.928) in predicting mortality. Important predictors included demographic information (age), vital signs (oxygen saturation), and laboratory values (white blood count, cardiac troponin, albumin, etc.). Our models suffered from class imbalance, that resulted in high negative predictive values and low positive predictive values.We provided a feasible framework for modeling patient deterioration using existing data and AI technology to address data limitations during the onset of a novel, rapidly changing pandemic.

    View details for DOI 10.2196/23026

    View details for PubMedID 33534724

  • Using Alias Sampling Strategy Based on Network Embeddings to Detect Protein Complexes IEEE ACCESS Liu, X., Sang, S., Wang, X. 2020; 8: 211773–83