Feng Xie is currently a postdoctoral scholar at Stanford University School of Medicine, and he recently graduated with a joint Ph.D. degree from Duke University and the National University of Singapore. He previously obtained his bachelor’s degree from Tsinghua University, Beijing, China, in 2017. His research focuses on developing novel informatics methodologies and applying them to various healthcare domains, including children’s health, critical care, and emergency medicine. He extensively utilized large-scale multimodal data including electronic health records (EHR), clinical notes, and medical signal data, to address critical healthcare challenges. In his Ph.D. and postdoctoral training, he developed multiple advanced methods and informatics tools, including AutoScore, MIMIC-IV-ED benchmark, and NeonatalBERT. Used by other researchers globally, some of them have been applied to a wide range of clinical applications including risk prediction and model benchmarking, resulting in dozens of publications by other users. Specifically, AutoScore software has been downloaded more than 10,000 times from the R CRAN platform. and the original paper has garnered over 70 official citations for about 2 years.

Over 5 years, he published 8 first-author research papers in high-impact journals in the field, with a total impact factor of over 60. His extensive collaborations with clinicians, engineers, and health service researchers also resulted in 12 co-author papers.

Boards, Advisory Committees, Professional Organizations

  • Associate Editor, Journal of Medical Internet Research (2024 - Present)
  • Student Editorial Board, Journal of Biomedical Informatics (2022 - Present)
  • Reviewer, EBioMedicine (by The Lancet) (2022 - Present)
  • Reviewer, Patterns (by Cell Press) (2022 - Present)
  • Reviewer, BMC Medical Research Methodology (2022 - Present)
  • Reviewer, International Conference on Health Informatics (ICHI) (2023 - Present)
  • Reviewer, BMC Medical Informatics and Decision Making (2022 - Present)
  • Reviewer, International Journal of Medical Informatics (2022 - Present)
  • Reviewer, Mathematical Biosciences and Engineering (2023 - Present)
  • Reviewer, Scientific Report (2022 - Present)
  • Reviewer, AMIA Annual Symposium (2020 - 2022)
  • Reviewer, PLOS ONE (2019 - 2022)

Professional Education

  • Bachelor of Science, Tsinghua University (2017)
  • Doctor of Philosophy, National University Of Singapore (2022)
  • PhD, National University of Singapore / Duke University (2022)
  • Bachelor of Science, Tsinghua University (2017)

Research Interests

  • Data Sciences
  • Research Methods

Lab Affiliations

All Publications

  • A universal AutoScore framework to develop interpretable scoring systems for predicting common types of clinical outcomes. STAR protocols Xie, F., Ning, Y., Liu, M., Li, S., Saffari, S. E., Yuan, H., Volovici, V., Ting, D. S., Goldstein, B. A., Ong, M. E., Vaughan, R., Chakraborty, B., Liu, N. 2023; 4 (2): 102302


    The AutoScore framework can automatically generate data-driven clinical scores in various clinical applications. Here, we present a protocol for developing clinical scoring systems for binary, survival, and ordinal outcomes using the open-source AutoScore package. We describe steps for package installation, detailed data processing and checking, and variable ranking. We then explain how to iterate through steps for variable selection, score generation, fine-tuning, and evaluation to generate understandable and explainable scoring systems using data-driven evidence and clinical knowledge. For complete details on the use and execution of this protocol, please refer to Xie etal. (2020),1 Xie etal. (2022)2, Saffari etal. (2022)3 and the online tutorial

    View details for DOI 10.1016/j.xpro.2023.102302

    View details for PubMedID 37178115

  • Benchmarking emergency department prediction models with machine learning and public electronic health records. Scientific data Xie, F., Zhou, J., Lee, J. W., Tan, M., Li, S., Rajnthern, L. S., Chee, M. L., Chakraborty, B., Wong, A. I., Dagan, A., Ong, M. E., Gao, F., Liu, N. 2022; 9 (1): 658


    The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop prediction models and decision support systems to address these challenges. To date, there is no widely accepted clinical prediction benchmark related to the ED based on large-scale public EHRs. An open-source benchmark data platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. Based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we created a benchmark dataset and proposed three clinical prediction benchmarks. This study provides future researchers with insights, suggestions, and protocols for managing data and developing predictive tools for emergency care.

    View details for DOI 10.1038/s41597-022-01782-9

    View details for PubMedID 36302776

  • Development and validation of an interpretable machine learning scoring tool for estimating time to emergency readmissions ECLINICALMEDICINE Xie, F., Liu, N., Yan, L., Ning, Y., Lim, K., Gong, C., Kwan, Y., Ho, A., Low, L., Chakraborty, B., Ong, M. 2022; 45: 101315


    Emergency readmission poses an additional burden on both patients and healthcare systems. Risk stratification is the first step of transitional care interventions targeted at reducing readmission. To accurately predict the short- and intermediate-term risks of readmission and provide information for further temporal risk stratification, we developed and validated an interpretable machine learning risk scoring system.In this retrospective study, all emergency admission episodes from January 1st 2009 to December 31st 2016 at a tertiary hospital in Singapore were assessed. The primary outcome was time to emergency readmission within 90 days post discharge. The Score for Emergency ReAdmission Prediction (SERAP) tool was derived via an interpretable machine learning-based system for time-to-event outcomes. SERAP is six-variable survival score, and takes the number of emergency admissions last year, age, history of malignancy, history of renal diseases, serum creatinine level, and serum albumin level during index admission into consideration.A total of 293,589 ED admission episodes were finally included in the whole cohort. Among them, 203,748 episodes were included in the training cohort, 50,937 episodes in the validation cohort, and 38,904 in the testing cohort. Readmission within 90 days was documented in 80,213 (27.3%) episodes, with a median time to emergency readmission of 22 days (Interquartile range: 8-47). For different time points, the readmission rates observed in the whole cohort were 6.7% at 7 days, 10.6% at 14 days, 13.6% at 21 days, 16.4% at 30 days, and 23.0% at 60 days. In the testing cohort, the SERAP achieved an integrated area under the curve of 0.737 (95% confidence interval: 0.730-0.743). For a specific 30-day readmission prediction, SERAP outperformed the LACE index (Length of stay, Acuity of admission, Charlson comorbidity index, and Emergency department visits in past six months) and the HOSPITAL score (Hemoglobin at discharge, discharge from an Oncology service, Sodium level at discharge, Procedure during the index admission, Index Type of admission, number of Admissions during the last 12 months, and Length of stay). Besides 30-day readmission, SERAP can predict readmission rates at any time point during the 90-day period.Better performance in risk prediction was achieved by the SERAP than other existing scores, and accurate information about time to emergency readmission was generated for further temporal risk stratification and clinical decision-making. In the future, external validation studies are needed to evaluate the SERAP at different settings and assess their real-world performance.This study was supported by the Singapore National Medical Research Council under the PULSES Center Grant, and Duke-NUS Medical School.

    View details for DOI 10.1016/j.eclinm.2022.101315

    View details for Web of Science ID 000823395500019

    View details for PubMedID 35284804

    View details for PubMedCentralID PMC8904223

  • Deep learning for temporal data representation in electronic health records: A systematic review of challenges and methodologies JOURNAL OF BIOMEDICAL INFORMATICS Xie, F., Yuan, H., Ning, Y., Ong, M., Feng, M., Hsu, W., Chakraborty, B., Liu, N. 2022; 126: 103980


    Temporal electronic health records (EHRs) contain a wealth of information for secondary uses, such as clinical events prediction and chronic disease management. However, challenges exist for temporal data representation. We therefore sought to identify these challenges and evaluate novel methodologies for addressing them through a systematic examination of deep learning solutions.We searched five databases (PubMed, Embase, the Institute of Electrical and Electronics Engineers [IEEE] Xplore Digital Library, the Association for Computing Machinery [ACM] Digital Library, and Web of Science) complemented with hand-searching in several prestigious computer science conference proceedings. We sought articles that reported deep learning methodologies on temporal data representation in structured EHR data from January 1, 2010, to August 30, 2020. We summarized and analyzed the selected articles from three perspectives: nature of time series, methodology, and model implementation.We included 98 articles related to temporal data representation using deep learning. Four major challenges were identified, including data irregularity, heterogeneity, sparsity, and model opacity. We then studied how deep learning techniques were applied to address these challenges. Finally, we discuss some open challenges arising from deep learning.Temporal EHR data present several major challenges for clinical prediction modeling and data utilization. To some extent, current deep learning solutions can address these challenges. Future studies may consider designing comprehensive and integrated solutions. Moreover, researchers should incorporate clinical domain knowledge into study designs and enhance model interpretability to facilitate clinical implementation.

    View details for DOI 10.1016/j.jbi.2021.103980

    View details for Web of Science ID 000767887400004

    View details for PubMedID 34974189

  • AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data JOURNAL OF BIOMEDICAL INFORMATICS Xie, F., Ning, Y., Yuan, H., Goldstein, B., Ong, M., Liu, N., Chakraborty, B. 2022; 125: 103959


    Scoring systems are highly interpretable and widely used to evaluate time-to-event outcomes in healthcare research. However, existing time-to-event scores are predominantly created ad-hoc using a few manually selected variables based on clinician's knowledge, suggesting an unmet need for a robust and efficient generic score-generating method.AutoScore was previously developed as an interpretable machine learning score generator, integrating both machine learning and point-based scores in the strong discriminability and accessibility. We have further extended it to the time-to-event outcomes and developed AutoScore-Survival, for generating time-to-event scores with right-censored survival data. Random survival forest provided an efficient solution for selecting variables, and Cox regression was used for score weighting. We implemented our proposed method as an R package. We illustrated our method in a study of 90-day survival prediction for patients in intensive care units and compared its performance with other survival models, the random survival forest, and two traditional clinical scores.The AutoScore-Survival-derived scoring system was more parsimonious than survival models built using traditional variable selection methods (e.g., penalized likelihood approach and stepwise variable selection), and its performance was comparable to survival models using the same set of variables. Although AutoScore-Survival achieved a comparable integrated area under the curve of 0.782 (95% CI: 0.767-0.794), the integer-valued time-to-event scores generated are favorable in clinical applications because they are easier to compute and interpret.Our proposed AutoScore-Survival provides a robust and easy-to-use machine learning-based clinical score generator to studies of time-to-event outcomes. It gives a systematic guideline to facilitate the future development of time-to-event scores for clinical applications.

    View details for DOI 10.1016/j.jbi.2021.103959

    View details for Web of Science ID 000735573800005

    View details for PubMedID 34826628

  • Development and Assessment of an Interpretable Machine Learning Triage Tool for Estimating Mortality After Emergency Admissions JAMA NETWORK OPEN Xie, F., Ong, M., Liew, J., Tan, K., Ho, A., Nadarajan, G., Low, L., Kwan, Y., Goldstein, B., Matchar, D., Chakraborty, B., Liu, N. 2021; 4 (8): e2118467


    Triage in the emergency department (ED) is a complex clinical judgment based on the tacit understanding of the patient's likelihood of survival, availability of medical resources, and local practices. Although a scoring tool could be valuable in risk stratification, currently available scores have demonstrated limitations.To develop an interpretable machine learning tool based on a parsimonious list of variables available at ED triage; provide a simple, early, and accurate estimate of patients' risk of death; and evaluate the tool's predictive accuracy compared with several established clinical scores.This single-site, retrospective cohort study assessed all ED patients between January 1, 2009, and December 31, 2016, who were subsequently admitted to a tertiary hospital in Singapore. The Score for Emergency Risk Prediction (SERP) tool was derived using a machine learning framework. To estimate mortality outcomes after emergency admissions, SERP was compared with several triage systems, including Patient Acuity Category Scale, Modified Early Warning Score, National Early Warning Score, Cardiac Arrest Risk Triage, Rapid Acute Physiology Score, and Rapid Emergency Medicine Score. The initial analyses were completed in October 2020, and additional analyses were conducted in May 2021.Three SERP scores, namely SERP-2d, SERP-7d, and SERP-30d, were developed using the primary outcomes of interest of 2-, 7-, and 30-day mortality, respectively. Secondary outcomes included 3-day mortality and inpatient mortality. The SERP's predictive power was measured using the area under the curve in the receiver operating characteristic analysis.The study included 224 666 ED episodes in the model training cohort (mean [SD] patient age, 63.60 [16.90] years; 113 426 [50.5%] female), 56 167 episodes in the validation cohort (mean [SD] patient age, 63.58 [16.87] years; 28 427 [50.6%] female), and 42 676 episodes in the testing cohort (mean [SD] patient age, 64.85 [16.80] years; 21 556 [50.5%] female). The mortality rates in the training cohort were 0.8% at 2 days, 2.2% at 7 days, and 5.9% at 30 days. In the testing cohort, the areas under the curve of SERP-30d were 0.821 (95% CI, 0.796-0.847) for 2-day mortality, 0.826 (95% CI, 0.811-0.841) for 7-day mortality, and 0.823 (95% CI, 0.814-0.832) for 30-day mortality and outperformed several benchmark scores.In this retrospective cohort study, SERP had better prediction performance than existing triage scores while maintaining easy implementation and ease of ascertainment in the ED. It has the potential to be widely applied and validated in different circumstances and health care settings.

    View details for DOI 10.1001/jamanetworkopen.2021.18467

    View details for Web of Science ID 000689731500001

    View details for PubMedID 34448870

    View details for PubMedCentralID PMC8397930

  • AutoScore: A Machine Learning-Based Automatic Clinical Score Generator and Its Application to Mortality Prediction Using Electronic Health Records JMIR MEDICAL INFORMATICS Xie, F., Chakraborty, B., Ong, M., Goldstein, B., Liu, N. 2020; 8 (10): e21798


    Risk scores can be useful in clinical risk stratification and accurate allocations of medical resources, helping health providers improve patient care. Point-based scores are more understandable and explainable than other complex models and are now widely used in clinical decision making. However, the development of the risk scoring model is nontrivial and has not yet been systematically presented, with few studies investigating methods of clinical score generation using electronic health records.This study aims to propose AutoScore, a machine learning-based automatic clinical score generator consisting of 6 modules for developing interpretable point-based scores. Future users can employ the AutoScore framework to create clinical scores effortlessly in various clinical applications.We proposed the AutoScore framework comprising 6 modules that included variable ranking, variable transformation, score derivation, model selection, score fine-tuning, and model evaluation. To demonstrate the performance of AutoScore, we used data from the Beth Israel Deaconess Medical Center to build a scoring model for mortality prediction and then compared the data with other baseline models using the receiver operating characteristic analysis. A software package in R 3.5.3 (R Foundation) was also developed to demonstrate the implementation of AutoScore.Implemented on the data set with 44,918 individual admission episodes of intensive care, the AutoScore-created scoring models performed comparably well as other standard methods (ie, logistic regression, stepwise regression, least absolute shrinkage and selection operator, and random forest) in terms of predictive accuracy and model calibration but required fewer predictors and presented high interpretability and accessibility. The nine-variable, AutoScore-created, point-based scoring model achieved an area under the curve (AUC) of 0.780 (95% CI 0.764-0.798), whereas the model of logistic regression with 24 variables had an AUC of 0.778 (95% CI 0.760-0.795). Moreover, the AutoScore framework also drives the clinical research continuum and automation with its integration of all necessary modules.We developed an easy-to-use, machine learning-based automatic clinical score generator, AutoScore; systematically presented its structure; and demonstrated its superiority (predictive performance and interpretability) over other conventional methods using a benchmark database. AutoScore will emerge as a potential scoring tool in various medical applications.

    View details for DOI 10.2196/21798

    View details for Web of Science ID 000587474400023

    View details for PubMedID 33084589

    View details for PubMedCentralID PMC7641783

  • Novel model for predicting inpatient mortality after emergency admission to hospital in Singapore: retrospective observational study BMJ OPEN Xie, F., Liu, N., Wu, S., Ang, Y., Low, L., Ho, A., Lam, S., Matchar, D., Ong, M., Chakraborty, B. 2019; 9 (9): e031382


    To identify risk factors for inpatient mortality after patients' emergency admission and to create a novel model predicting inpatient mortality risk.This was a retrospective observational study using data extracted from electronic health records (EHRs). The data were randomly split into a derivation set and a validation set. The stepwise model selection was employed. We compared our model with one of the current clinical scores, Cardiac Arrest Risk Triage (CART) score.A single tertiary hospital in Singapore.All adult hospitalised patients, admitted via emergency department (ED) from 1 January 2008 to 31 October 2017 (n=433 187 by admission episodes).The primary outcome of interest was inpatient mortality following this admission episode. The area under the curve (AUC) of the receiver operating characteristic curve of the predictive model with sensitivity and specificity for optimised cut-offs.15 758 (3.64%) of the episodes were observed inpatient mortality. 19 variables were observed as significant predictors and were included in our final regression model. Our predictive model outperformed the CART score in terms of predictive power. The AUC of CART score and our final model was 0.705 (95% CI 0.697 to 0.714) and 0.817 (95% CI 0.810 to 0.824), respectively.We developed and validated a model for inpatient mortality using EHR data collected in the ED. The performance of our model was more accurate than the CART score. Implementation of our model in the hospital can potentially predict imminent adverse events and institute appropriate clinical management.

    View details for DOI 10.1136/bmjopen-2019-031382

    View details for Web of Science ID 000497787600368

    View details for PubMedID 31558458

    View details for PubMedCentralID PMC6773418

  • Comprehensive overview of the anesthesiology research landscape: A machine Learning Analysis of 737 NIH-funded anesthesiology primary Investigator's publication trends. Heliyon Ghanem, M., Espinosa, C., Chung, P., Reincke, M., Harrison, N., Phongpreecha, T., Shome, S., Saarunya, G., Berson, E., James, T., Xie, F., Shu, C. H., Hazra, D., Mataraso, S., Kim, Y., Seong, D., Chakraborty, D., Studer, M., Xue, L., Marić, I., Chang, A. L., Tjoa, E., Gaudillière, B., Tawfik, V. L., Mackey, S., Aghaeepour, N. 2024; 10 (7): e29050


    Anesthesiology plays a crucial role in perioperative care, critical care, and pain management, impacting patient experiences and clinical outcomes. However, our understanding of the anesthesiology research landscape is limited. Accordingly, we initiated a data-driven analysis through topic modeling to uncover research trends, enabling informed decision-making and fostering progress within the field.The easyPubMed R package was used to collect 32,300 PubMed abstracts spanning from 2000 to 2022. These abstracts were authored by 737 Anesthesiology Principal Investigators (PIs) who were recipients of National Institute of Health (NIH) funding from 2010 to 2022. Abstracts were preprocessed, vectorized, and analyzed with the state-of-the-art BERTopic algorithm to identify pillar topics and trending subtopics within anesthesiology research. Temporal trends were assessed using the Mann-Kendall test.The publishing journals with most abstracts in this dataset were Anesthesia & Analgesia 1133, Anesthesiology 992, and Pain 671. Eight pillar topics were identified and categorized as basic or clinical sciences based on a hierarchical clustering analysis. Amongst the pillar topics, "Cells & Proteomics" had both the highest annual and total number of abstracts. Interestingly, there was an overall upward trend for all topics spanning the years 2000-2022. However, when focusing on the period from 2015 to 2022, topics "Cells & Proteomics" and "Pulmonology" exhibit a downward trajectory. Additionally, various subtopics were identified, with notable increasing trends in "Aneurysms", "Covid 19 Pandemic", and "Artificial intelligence & Machine Learning".Our work offers a comprehensive analysis of the anesthesiology research landscape by providing insights into pillar topics, and trending subtopics. These findings contribute to a better understanding of anesthesiology research and can guide future directions.

    View details for DOI 10.1016/j.heliyon.2024.e29050

    View details for PubMedID 38623206

    View details for PubMedCentralID PMC11016610

  • Inter hospital external validation of interpretable machine learning based triage score for the emergency department using common data model. Scientific reports Yu, J. Y., Kim, D., Yoon, S., Kim, T., Heo, S., Chang, H., Han, G. S., Jeong, K. W., Park, R. W., Gwon, J. M., Xie, F., Ong, M. E., Ng, Y. Y., Joo, H. J., Cha, W. C. 2024; 14 (1): 6666


    Emergency departments (ED) are complex, triage is a main task in the ED to prioritize patient with limited medical resources who need them most. Machine learning (ML) based ED triage tool, Score for Emergency Risk Prediction (SERP), was previously developed using an interpretable ML framework with single center. We aimed to develop SERP with 3 Korean multicenter cohorts based on common data model (CDM) without data sharing and compare performance with inter-hospital validation design. This retrospective cohort study included all adult emergency visit patients of 3 hospitals in Korea from 2016 to 2017. We adopted CDM for the standardized multicenter research. The outcome of interest was 2-day mortality after the patients' ED visit. We developed each hospital SERP using interpretable ML framework and validated inter-hospital wisely. We accessed the performance of each hospital's score based on some metrics considering data imbalance strategy. The study population for each hospital included 87,670, 83,363 and 54,423 ED visits from 2016 to 2017. The 2-day mortality rate were 0.51%, 0.56% and 0.65%. Validation results showed accurate for inter hospital validation which has at least AUROC of 0.899 (0.858-0.940). We developed multicenter based Interpretable ML model using CDM for 2-day mortality prediction and executed Inter-hospital external validation which showed enough high accuracy.

    View details for DOI 10.1038/s41598-024-54364-7

    View details for PubMedID 38509133

    View details for PubMedCentralID 7340358

  • Corrigendum to "Development and Asian-wide validation of the Grade for Interpretable Field Triage (GIFT) for predicting mortality in pre-hospital patients using the Pan-Asian Trauma Outcomes Study (PATOS)" [The Lancet Regional Health - Western Pacific 34 (2023) 100733]. The Lancet regional health. Western Pacific Yu, J. Y., Heo, S., Xie, F., Liu, N., Yoon, S. Y., Chang, H. S., Kim, T., Lee, S. U., Hock Ong, M. E., Ng, Y. Y., Do Shin, S., Kajino, K., Chiang, W. C., Cha, W. C. 2024; 44: 100996


    [This corrects the article DOI: 10.1016/j.lanwpc.2023.100733.].

    View details for DOI 10.1016/j.lanwpc.2023.100996

    View details for PubMedID 38532823

    View details for PubMedCentralID PMC10964468

  • FedScore: A Privacy-Preserving Framework for Federated Scoring System Development. Journal of biomedical informatics Li, S., Ning, Y., Eng Hock Ong, M., Chakraborty, B., Hong, C., Xie, F., Yuan, H., Liu, M., Buckland, D. M., Chen, Y., Liu, N. 2023: 104485


    We propose FedScore, a privacy-preserving federated learning framework for scoring system generation across multiple sites to facilitate cross-institutional collaborations.The FedScore framework includes five modules: federated variable ranking, federated variable transformation, federated score derivation, federated model selection and federated model evaluation. To illustrate usage and assess FedScore's performance, we built a hypothetical global scoring system for mortality prediction within 30 days after a visit to an emergency department using 10 simulated sites divided from a tertiary hospital in Singapore. We employed a pre-existing score generator to construct 10 local scoring systems independently at each site and we also developed a scoring system using centralized data for comparison.We compared the acquired FedScore model's performance with that of other scoring models using the receiver operating characteristic (ROC) analysis. The FedScore model achieved an average area under the curve (AUC) value of 0.763 across all sites, with a standard deviation (SD) of 0.020. We also calculated the average AUC values and SDs for each local model, and the FedScore model showed promising accuracy and stability with a high average AUC value which was closest to the one of the pooled model and SD which was lower than that of most local models.This study demonstrates that FedScore is a privacy-preserving scoring system generator with potentially good generalizability.

    View details for DOI 10.1016/j.jbi.2023.104485

    View details for PubMedID 37660960

  • Federated and distributed learning applications for electronic health records and structured medical data: a scoping review. Journal of the American Medical Informatics Association : JAMIA Li, S., Liu, P., Nascimento, G. G., Wang, X., Leite, F. R., Chakraborty, B., Hong, C., Ning, Y., Xie, F., Teo, Z. L., Ting, D. S., Haddadi, H., Ong, M. E., Peres, M. A., Liu, N. 2023


    Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations, and discusses potential innovations.We searched 5 databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from 3 primary perspectives, including data quality, modeling strategies, and FL frameworks.Out of the 1193 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis.The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.

    View details for DOI 10.1093/jamia/ocad170

    View details for PubMedID 37639629

  • Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques. Artificial intelligence in medicine Liu, M., Li, S., Yuan, H., Ong, M. E., Ning, Y., Xie, F., Saffari, S. E., Shang, Y., Volovici, V., Chakraborty, B., Liu, N. 2023; 142: 102587


    The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data.We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models.Out of 1822 articles, a total of 111 were included, of which tabular static data (29%, 32/111) and temporal data (40%, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The "integrated" imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52%, 23/44) and multi-modal data (56%, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies.The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.

    View details for DOI 10.1016/j.artmed.2023.102587

    View details for PubMedID 37316097

  • Development and Asian-wide validation of the Grade for Interpretable Field Triage (GIFT) for predicting mortality in pre-hospital patients using the Pan-Asian Trauma Outcomes Study (PATOS). The Lancet regional health. Western Pacific Yu, J. Y., Heo, S., Xie, F., Liu, N., Yoon, S. Y., Chang, H. S., Kim, T., Lee, S. U., Hock Ong, M. E., Ng, Y. Y., Do Shin, S., Kajino, K., Cha, W. C. 2023; 34: 100733


    Background: Field triage is critical in injury patients as the appropriate transport of patients to trauma centers is directly associated with clinical outcomes. Several prehospital triage scores have been developed in Western and European cohorts; however, their validity and applicability in Asia remains unclear. Therefore, we aimed to develop and validate an interpretable field triage scoring systems based on a multinational trauma registry in Asia.Methods: This retrospective and multinational cohort study included all adult transferred injury patients from Korea, Malaysia, Vietnam, and Taiwan between 2016 and 2018. The outcome of interest was a death in the emergency department (ED) after the patients' ED visit. Using these results, we developed the interpretable field triage score with the Korea registry using an interpretable machine learning framework and validated the score externally. The performance of each country's score was assessed using the area under the receiver operating characteristic curve (AUROC). Furthermore, a website for real-world application was developed using R Shiny.Findings: The study population included 26,294, 9404, 673 and 826 transferred injury patients between 2016 and 2018 from Korea, Malaysia, Vietnam, and Taiwan, respectively. The corresponding rates of a death in the ED were 0.30%, 0.60%, 4.0%, and 4.6% respectively. Age and vital sign were found to be the significant variables for predicting mortality. External validation showed the accuracy of the model with an AUROC of 0.756-0.850.Interpretation: The Grade for Interpretable Field Triage (GIFT) score is an interpretable and practical tool to predict mortality in field triage for trauma.Funding: This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (Grant Number: HI19C1328).

    View details for DOI 10.1016/j.lanwpc.2023.100733

    View details for PubMedID 37283981

  • Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case-control study. Research square Roger, J., Xie, F., Costello, J., Tang, A., Liu, J., Oskotsky, T., Woldemariam, S., Kosti, I., Le, B., Snyder, M. P., Giudice, L. C., Torgerson, D., Shaw, G. M., Stevenson, D. K., Rajkovic, A., Glymour, M. M., Aghaeepour, N., Cakmak, H., Lathi, R. B., Sirota, M. 2023


    Recurrent pregnancy loss (RPL), defined as 2 or more pregnancy losses, affects 5-6% of ever-pregnant individuals. Approximately half of these cases have no identifiable explanation. To generate hypotheses about RPL etiologies, we implemented a case-control study comparing the history of over 1,600 diagnoses between RPL and live-birth patients, leveraging the University of California San Francisco (UCSF) and Stanford University electronic health record databases. In total, our study included 8,496 RPL (UCSF: 3,840, Stanford: 4,656) and 53,278 Control (UCSF: 17,259, Stanford: 36,019) patients. Menstrual abnormalities and infertility-associated diagnoses were significantly positively associated with RPL in both medical centers. Age-stratified analysis revealed that the majority of RPL-associated diagnoses had higher odds ratios for patients <35 compared with 35+ patients. While Stanford results were sensitive to control for healthcare utilization, UCSF results were stable across analyses with and without utilization. Intersecting significant results between medical centers was an effective filter to identify associations that are robust across center-specific utilization patterns.

    View details for DOI 10.21203/

    View details for PubMedID 36993325

    View details for PubMedCentralID PMC10055527

  • AutoScore-Ordinal: an interpretable machine learning framework for generating scoring models for ordinal outcomes. BMC medical research methodology Saffari, S. E., Ning, Y., Xie, F., Chakraborty, B., Volovici, V., Vaughan, R., Ong, M. E., Liu, N. 2022; 22 (1): 286


    BACKGROUND: Risk prediction models are useful tools in clinical decision-making which help with risk stratification and resource allocations and may lead to a better health care for patients. AutoScore is a machine learning-based automatic clinical score generator for binary outcomes. This study aims to expand the AutoScore framework to provide a tool for interpretable risk prediction for ordinal outcomes.METHODS: The AutoScore-Ordinal framework is generated using the same 6 modules of the original AutoScore algorithm including variable ranking, variable transformation, score derivation (from proportional odds models), model selection, score fine-tuning, and model evaluation. To illustrate the AutoScore-Ordinal performance, the method was conducted on electronic health records data from the emergency department at Singapore General Hospital over 2008 to 2017. The model was trained on 70% of the data, validated on 10% and tested on the remaining 20%.RESULTS: This study included 445,989 inpatient cases, where the distribution of the ordinal outcome was 80.7% alive without 30-day readmission, 12.5% alive with 30-day readmission, and 6.8% died inpatient or by day 30 post discharge. Two point-based risk prediction models were developed using two sets of 8 predictor variables identified by the flexible variable selection procedure. The two models indicated reasonably good performance measured by mean area under the receiver operating characteristic curve (0.758 and 0.793) and generalized c-index (0.737 and 0.760), which were comparable to alternative models.CONCLUSION: AutoScore-Ordinal provides an automated and easy-to-use framework for development and validation of risk prediction models for ordinal outcomes, which can systematically identify potential predictors from high-dimensional data.

    View details for DOI 10.1186/s12874-022-01770-y

    View details for PubMedID 36333672

  • An external validation study of the Score for Emergency Risk Prediction (SERP), an interpretable machine learning-based triage score for the emergency department. Scientific reports Yu, J. Y., Xie, F., Nan, L., Yoon, S., Ong, M. E., Ng, Y. Y., Cha, W. C. 2022; 12 (1): 17466


    Emergency departments (EDs) are experiencing complex demands. An ED triage tool, the Score for Emergency Risk Prediction (SERP), was previously developed using an interpretable machine learning framework. It achieved a good performance in the Singapore population. We aimed to externally validate the SERP in a Korean cohort for all ED patients and compare its performance with Korean triage acuity scale (KTAS). This retrospective cohort study included all adult ED patients of Samsung Medical Center from 2016 to 2020. The outcomes were 30-day and in-hospital mortality after the patients' ED visit. We used the area under the receiver operating characteristic curve (AUROC) to assess the performance of the SERP and other conventional scores, including KTAS. The study population included 285,523 ED visits, of which 53,541 were after the COVID-19 outbreak (2020). The whole cohort, in-hospital, and 30days mortality rates were 1.60%, and 3.80%. The SERP achieved an AUROC of 0.821 and 0.803, outperforming KTAS of 0.679 and 0.729 for in-hospital and 30-day mortality, respectively. SERP was superior to other scores for in-hospital and 30-day mortality prediction in an external validation cohort. SERP is a generic, intuitive, and effective triage tool to stratify general patients who present to the emergency department.

    View details for DOI 10.1038/s41598-022-22233-w

    View details for PubMedID 36261457

  • A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study. PLOS digital health Ning, Y., Li, S., Ong, M. E., Xie, F., Chakraborty, B., Ting, D. S., Liu, N. 2022; 1 (6): e0000062


    Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors to create parsimonious scores, but such 'black box' variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability in variable importance across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions across models, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission after hospital discharge, ShapleyVIC selected 6 variables from 41 candidates to create a well-performing risk score, which had similar performance to a 16-variable model from machine-learning-based ranking. Our work contributes to the recent emphasis on interpretability of prediction models for high-stakes decision making, providing a disciplined solution to detailed assessment of variable importance and transparent development of parsimonious clinical risk scores.

    View details for DOI 10.1371/journal.pdig.0000062

    View details for PubMedID 36812536

  • Development and validation of an interpretable clinical score for early identification of acute kidney injury at the emergency department SCIENTIFIC REPORTS Ang, Y., Li, S., Ong, M., Xie, F., Teo, S., Choong, L., Koniman, R., Chakraborty, B., Ho, A., Liu, N. 2022; 12 (1): 7111


    Acute kidney injury (AKI) in hospitalised patients is a common syndrome associated with poorer patient outcomes. Clinical risk scores can be used for the early identification of patients at risk of AKI. We conducted a retrospective study using electronic health records of Singapore General Hospital emergency department patients who were admitted from 2008 to 2016. The primary outcome was inpatient AKI of any stage within 7 days of admission based on the Kidney Disease Improving Global Outcome (KDIGO) 2012 guidelines. A machine learning-based framework AutoScore was used to generate clinical scores from the study sample which was randomly divided into training, validation and testing cohorts. Model performance was evaluated using area under the curve (AUC). Among the 119,468 admissions, 10,693 (9.0%) developed AKI. 8491 were stage 1 (79.4%), 906 stage 2 (8.5%) and 1296 stage 3 (12.1%). The AKI Risk Score (AKI-RiSc) was a summation of the integer scores of 6 variables: serum creatinine, serum bicarbonate, pulse, systolic blood pressure, diastolic blood pressure, and age. AUC of AKI-RiSc was 0.730 (95% CI 0.714-0.747), outperforming an existing AKI Prediction Score model which achieved AUC of 0.665 (95% CI 0.646-0.679) on the testing cohort. At a cut-off of 4 points, AKI-RiSc had a sensitivity of 82.6% and specificity of 46.7%. AKI-RiSc is a simple clinical score that can be easily implemented on the ground for early identification of AKI and potentially be applied in international settings.

    View details for DOI 10.1038/s41598-022-11129-4

    View details for Web of Science ID 000789854100016

    View details for PubMedID 35501411

    View details for PubMedCentralID PMC9061747

  • AutoScore-Imbalance: An interpretable machine learning tool for development of clinical scores with rare events data JOURNAL OF BIOMEDICAL INFORMATICS Yuan, H., Xie, F., Ong, M., Ning, Y., Chee, M., Saffari, S., Abdullah, H., Goldstein, B., Chakraborty, B., Liu, N. 2022; 129: 104072


    Medical decision-making impacts both individual and public health. Clinical scores are commonly used among various decision-making models to determine the degree of disease deterioration at the bedside. AutoScore was proposed as a useful clinical score generator based on machine learning and a generalized linear model. However, its current framework still leaves room for improvement when addressing unbalanced data of rare events.Using machine intelligence approaches, we developed AutoScore-Imbalance, which comprises three components: training dataset optimization, sample weight optimization, and adjusted AutoScore. Baseline techniques for performance comparison included the original AutoScore, full logistic regression, stepwise logistic regression, least absolute shrinkage and selection operator (LASSO), full random forest, and random forest with a reduced number of variables. These models were evaluated based on their area under the curve (AUC) in the receiver operating characteristic analysis and balanced accuracy (i.e., mean value of sensitivity and specificity). By utilizing a publicly accessible dataset from Beth Israel Deaconess Medical Center, we assessed the proposed model and baseline approaches to predict inpatient mortality.AutoScore-Imbalance outperformed baselines in terms of AUC and balanced accuracy. The nine-variable AutoScore-Imbalance sub-model achieved the highest AUC of 0.786 (0.732-0.839), while the eleven-variable original AutoScore obtained an AUC of 0.723 (0.663-0.783), and the logistic regression with 21 variables obtained an AUC of 0.743 (0.685-0.801). The AutoScore-Imbalance sub-model (using a down-sampling algorithm) yielded an AUC of 0.771 (0.718-0.823) with only five variables, demonstrating a good balance between performance and variable sparsity. Furthermore, AutoScore-Imbalance obtained the highest balanced accuracy of 0.757 (0.702-0.805), compared to 0.698 (0.643-0.753) by the original AutoScore and the maximum of 0.720 (0.664-0.769) by other baseline models.We have developed an interpretable tool to handle clinical data imbalance, presented its structure, and demonstrated its superiority over baselines. The AutoScore-Imbalance tool can be applied to highly unbalanced datasets to gain further insight into rare medical events and facilitate real-world clinical decision-making.

    View details for DOI 10.1016/j.jbi.2022.104072

    View details for Web of Science ID 000794840600004

    View details for PubMedID 35421602

  • Leveraging Large-Scale Electronic Health Records and Interpretable Machine Learning for Clinical Decision Making at the Emergency Department: Protocol for System Development and Validation JMIR RESEARCH PROTOCOLS Liu, N., Xie, F., Siddiqui, F., Ho, A., Chakraborty, B., Nadarajan, G., Tan, K., Ong, M. 2022; 11 (3): e34201


    There is a growing demand globally for emergency department (ED) services. An increase in ED visits has resulted in overcrowding and longer waiting times. The triage process plays a crucial role in assessing and stratifying patients' risks and ensuring that the critically ill promptly receive appropriate priority and emergency treatment. A substantial amount of research has been conducted on the use of machine learning tools to construct triage and risk prediction models; however, the black box nature of these models has limited their clinical application and interpretation.In this study, we plan to develop an innovative, dynamic, and interpretable System for Emergency Risk Triage (SERT) for risk stratification in the ED by leveraging large-scale electronic health records (EHRs) and machine learning.To achieve this objective, we will conduct a retrospective, single-center study based on a large, longitudinal data set obtained from the EHRs of the largest tertiary hospital in Singapore. Study outcomes include adverse events experienced by patients, such as the need for an intensive care unit and inpatient death. With preidentified candidate variables drawn from expert opinions and relevant literature, we will apply an interpretable machine learning-based AutoScore to develop 3 SERT scores. These 3 scores can be used at different times in the ED, that is, on arrival, during ED stay, and at admission. Furthermore, we will compare our novel SERT scores with established clinical scores and previously described black box machine learning models as baselines. Receiver operating characteristic analysis will be conducted on the testing cohorts for performance evaluation.The study is currently being conducted. The extracted data indicate approximately 1.8 million ED visits by over 810,000 unique patients. Modelling results are expected to be published in 2022.The SERT scoring system proposed in this study will be unique and innovative because of its dynamic nature and modelling transparency. If successfully validated, our proposed solution will establish a standard for data processing and modelling by taking advantage of large-scale EHRs and interpretable machine learning tools.DERR1-10.2196/34201.

    View details for DOI 10.2196/34201

    View details for Web of Science ID 000779979500009

    View details for PubMedID 35333179

  • External validation of the Survival After ROSC in Cardiac Arrest (SARICA) score for predicting survival after return of spontaneous circulation using multinational pan-asian cohorts. Frontiers in medicine Rajendram, M. F., Zarisfi, F., Xie, F., Shahidah, N., Pek, P. P., Yeo, J. W., Tan, B. Y., Ma, M., Do Shin, S., Tanaka, H., Ong, M. E., Liu, N., Ho, A. F. 2022; 9: 930226


    Aim: Accurate and timely prognostication of patients with out-of-hospital cardiac arrest (OHCA) who attain return of spontaneous circulation (ROSC) is crucial in clinical decision-making, resource allocation, and communication with family. A clinical decision tool, Survival After ROSC in Cardiac Arrest (SARICA), was recently developed, showing excellent performance on internal validation. We aimed to externally validate SARICA in multinational cohorts within the Pan-Asian Resuscitation Outcomes Study.Materials and methods: This was an international, retrospective cohort study of patients who attained ROSC after OHCA in the Asia Pacific between January 2009 and August 2018. Pediatric (age <18 years) and traumatic arrests were excluded. The SARICA score was calculated for each patient. The primary outcome was survival. We used receiver operating characteristics (ROC) analysis to calculate the model performance of the SARICA score in predicting survival. A calibration belt plot was used to assess calibration.Results: Out of 207,450 cases of OHCA, 24,897 cases from Taiwan, Japan and South Korea were eligible for inclusion. Of this validation cohort, 30.4% survived. The median SARICA score was 4. Area under the ROC curve (AUC) was 0.759 (95% confidence interval, CI 0.753-0.766) for the total population. A higher AUC was observed in subgroups that received bystander CPR (AUC 0.791, 95% CI 0.782-0.801) and of presumed cardiac etiology (AUC 0.790, 95% CI 0.782-0.797). The model was well-calibrated.Conclusion: This external validation study of SARICA demonstrated high model performance in a multinational Pan-Asian cohort. Further modification and validation in other populations can be performed to assess its readiness for clinical translation.

    View details for DOI 10.3389/fmed.2022.930226

    View details for PubMedID 36160129

  • Heart rate n-variability (HRnV) and its application to risk stratification of chest pain patients in the emergency department. BMC cardiovascular disorders Liu, N. n., Guo, D. n., Koh, Z. X., Ho, A. F., Xie, F. n., Tagami, T. n., Sakamoto, J. T., Pek, P. P., Chakraborty, B. n., Lim, S. H., Tan, J. W., Ong, M. E. 2020; 20 (1): 168


    Chest pain is one of the most common complaints among patients presenting to the emergency department (ED). Causes of chest pain can be benign or life threatening, making accurate risk stratification a critical issue in the ED. In addition to the use of established clinical scores, prior studies have attempted to create predictive models with heart rate variability (HRV). In this study, we proposed heart rate n-variability (HRnV), an alternative representation of beat-to-beat variation in electrocardiogram (ECG), and investigated its association with major adverse cardiac events (MACE) in ED patients with chest pain.We conducted a retrospective analysis of data collected from the ED of a tertiary hospital in Singapore between September 2010 and July 2015. Patients > 20 years old who presented to the ED with chief complaint of chest pain were conveniently recruited. Five to six-minute single-lead ECGs, demographics, medical history, troponin, and other required variables were collected. We developed the HRnV-Calc software to calculate HRnV parameters. The primary outcome was 30-day MACE, which included all-cause death, acute myocardial infarction, and revascularization. Univariable and multivariable logistic regression analyses were conducted to investigate the association between individual risk factors and the outcome. Receiver operating characteristic (ROC) analysis was performed to compare the HRnV model (based on leave-one-out cross-validation) against other clinical scores in predicting 30-day MACE.A total of 795 patients were included in the analysis, of which 247 (31%) had MACE within 30 days. The MACE group was older, with a higher proportion being male patients. Twenty-one conventional HRV and 115 HRnV parameters were calculated. In univariable analysis, eleven HRV and 48 HRnV parameters were significantly associated with 30-day MACE. The multivariable stepwise logistic regression identified 16 predictors that were strongly associated with MACE outcome; these predictors consisted of one HRV, seven HRnV parameters, troponin, ST segment changes, and several other factors. The HRnV model outperformed several clinical scores in the ROC analysis.The novel HRnV representation demonstrated its value of augmenting HRV and traditional risk factors in designing a robust risk stratification tool for patients with chest pain in the ED.

    View details for DOI 10.1186/s12872-020-01455-8

    View details for PubMedID 32276602