Bio


François Grolleau MD, MPH, PhD is a Postdoctoral Scholar at the Stanford Center for Biomedical Informatics Research. His research work centers on developing and evaluating computational systems that use large language models and other advanced methods from statistics and machine learning to assist medical decision-making.

François is a certified Anesthesiologist and Critical Care Medicine specialist from France. He holds an MPH degree and a PhD in Biostatistics from Paris Cité University. In 2016/2017, he worked as a research fellow in the Department of Health Research Methods, Evidence, and Impact at McMaster University, Canada (Profs Yannick Le Manach and Gordon Guyatt). During his doctorate with Prof. Raphaël Porcher, he utilized causal inference, personalized medicine methods, and statistical reinforcement learning for medical applications in the ICU.

Professional Education


  • Fellowship, Centre for Research in Epidemiology and Statistics (2023)
  • PhD, Paris Cité University (2023)
  • Board Certification, French Board of Anesthesiology and Critical Care Medicine (2019)
  • Residency, University of Caen Normandy, Critical Care Medicine, Anesthesiology, and Nephrology (2019)
  • MPH, Paris Descartes University (2017)
  • MD, Toulouse III - Paul Sabatier University (2013)

Stanford Advisors


All Publications


  • Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Chen, W., Haredasht, F. N., Black, K. C., Grolleau, F., Alsentzer, E., Chen, J. H., Ma, S. P. 2026; 31: 189-204

    Abstract

    Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload, prompting interest in large language models (LLMs) to assist with draft responses. However, LLM outputs may contain clinical inaccuracies, omissions, or tone mismatches, making robust evaluation essential. Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes, developed through inductive coding and expert adjudication; (2) we develop a Retrieval-Augmented Error Checking (RAEC) pipeline that leverages semantically similar historical message-response pairs to improve judgment quality; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection. Our approach assesses the quality of drafts both in isolation and with reference to similar past message-response pairs retrieved from institutional archives. Using a two-stage DSPy pipeline, we compared baseline and reference-enhanced evaluations on over 1,500 patient messages. Retrieval context improved error identification in domains such as clinical completeness and workflow appropriateness. Human validation on 100 messages demonstrated superior agreement (concordance = 50% vs. 33%) and performance (F1 = 0.500 vs. 0.256) of context-enhanced labels vs. baseline, supporting the use of our RAEC pipeline as AI guardrails for patient messaging.

    View details for DOI 10.1142/9789819824755_0014

    View details for PubMedID 41758142

  • A New His-Ventricular Threshold for Myotonic Dystrophy Type 1-Reply. JAMA cardiology Grolleau, F., Porcher, R., Wahbi, K. 2026

    View details for DOI 10.1001/jamacardio.2025.5630

    View details for PubMedID 41739447

  • Holistic evaluation of large language models for medical tasks with MedHELM. Nature medicine Bedi, S., Cui, H., Fuentes, M., Unell, A., Wornow, M., Banda, J. M., Kotecha, N., Keyes, T., Mai, Y., Oez, M., Qiu, H., Jain, S., Schettini, L., Kashyap, M., Fries, J. A., Swaminathan, A., Chung, P., Haredasht, F. N., Lopez, I., Aali, A., Tse, G., Nayak, A., Vedak, S., Jain, S. S., Patel, B., Fayanju, O., Shah, S., Goh, E., Yao, D. H., Soetikno, B., Reis, E., Gatidis, S., Divi, V., Capasso, R., Saralkar, R., Chiang, C. C., Jindal, J., Pham, T., Ghoddusi, F., Lin, S., Chiou, A. S., Hong, H. J., Roy, M., Gensheimer, M. F., Patel, H., Schulman, K., Dash, D., Char, D., Downing, L., Grolleau, F., Black, K., Mieso, B., Zahedivash, A., Yim, W. W., Sharma, H., Lee, T., Kirsch, H., Lee, J., Ambers, N., Lugtu, C., Sharma, A., Mawji, B., Alekseyev, A., Zhou, V., Kakkar, V., Helzer, J., Revri, A., Bannett, Y., Daneshjou, R., Chen, J., Alsentzer, E., Morse, K., Ravi, N., Aghaeepour, N., Kennedy, V., Chaudhari, A., Wang, T., Koyejo, S., Lungren, M. P., Horvitz, E., Liang, P., Pfeffer, M. A., Shah, N. H. 2026

    Abstract

    While large language models (LLMs) achieve near-perfect scores on medical licensing exams, these evaluations inadequately reflect the complexity and diversity of real-world clinical practice. Here we introduce MedHELM, an extensible evaluation framework with three contributions. First, a clinician-validated taxonomy organizing medical AI applications into five categories that mirror real clinical tasks-clinical decision support (diagnostic decisions, treatment planning), clinical note generation (visit documentation, procedure reports), patient communication (education materials, care instructions), medical research (literature analysis, clinical data analysis) and administration (scheduling, workflow coordination). These encompass 22 subcategories and 121 specific tasks reflecting daily medical practice. Second, a comprehensive benchmark suite of 37 evaluations covering all subcategories. Third, systematic comparison of nine frontier LLMs-Claude 3.5 Sonnet, Claude 3.7 Sonnet, DeepSeek R1, Gemini 1.5 Pro, Gemini 2.0 Flash, GPT-4o, GPT-4o mini, Llama 3.3 and o3-mini-using an automated LLM-jury evaluation method. Our LLM-jury uses multiple AI evaluators to assess model outputs against expert-defined criteria. Advanced reasoning models (DeepSeek R1, o3-mini) demonstrated superior performance with win rates of 66%, although Claude 3.5 Sonnet achieved comparable results at 15% lower computational cost. These results not only highlight current model capabilities but also demonstrate how MedHELM could enable evidence-based selection of medical AI systems for healthcare applications.

    View details for DOI 10.1038/s41591-025-04151-2

    View details for PubMedID 41559415

    View details for PubMedCentralID 10916499

  • powerROC: An Interactive Web Tool for Sample Size Calculation in Assessing Models' Discriminative Abilities. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Grolleau, F., Tibshirani, R., Chen, J. H. 2025; 2025: 196-204

    Abstract

    Rigorous external validation is crucial for assessing the generalizability of prediction models, particularly by evaluating their discrimination (AUROC) on new data. This often involves comparing a new model's AUROC to that of an established reference model. However, many studies rely on arbitrary rules of thumb for sample size calculations, often resulting in underpowered analyses and unreliable conclusions. This paper reviews crucial concepts for accurate sample size determination in AUROC-based external validation studies, making the theory and practice more accessible to researchers and clinicians. We introduce powerROC, an open-source web tool designed to simplify these calculations, enabling both the evaluation of a single model and the comparison of two models. The tool offers guidance on selecting target precision levels and employs flexible approaches, leveraging either pilot data or user-defined probability distributions. We illustrate powerROC's utility through a case study on hospital mortality prediction using the MIMIC database.

    View details for PubMedID 40502274

    View details for PubMedCentralID PMC12150715

  • Systematic Exploration of Hospital Cost Variability: A Conformal Prediction-Based Outlier Detection Method for Electronic Health Records. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Grolleau, F., Goh, E., Ma, S. P., Masterson, J., Ross, T., Milstein, A., Chen, J. H. 2025; 2025: 187-195

    Abstract

    Marked variability in inpatient hospitalization costs poses significant challenges to healthcare quality, resource allocation, and patient outcomes. Traditional methods like Diagnosis-Related Groups (DRGs) aid in cost management but lack practical solutions for enhancing hospital care value. We introduce a novel methodology for outlier detection in Electronic Health Records (EHRs) using Conformal Prediction. This approach identifies and prioritizes areas for optimizing high-value care processes. Unlike conventional predictive models that neglect uncertainty, our method employs Conformal Quantile Regression (CQR) to generate robust prediction intervals, offering a comprehensive view of cost variability. By integrating Conformal Prediction with machine learning models, healthcare professionals can more accurately pinpoint opportunities for quality and efficiency improvements. Our framework systematically evaluates unexplained hospital cost variations and generates interpretable hypotheses for refining clinical practices associated with atypical costs. This data-driven approach offers a systematic method to generate clinically sound hypotheses that may inform processes to enhance care quality and optimize resource utilization.

    View details for PubMedID 40502259

    View details for PubMedCentralID PMC12150741

  • Monitoring strategies for continuous evaluation of deployed clinical prediction models. Journal of biomedical informatics Kim, G. Y., Corbin, C. K., Grolleau, F., Baiocchi, M., Chen, J. H. 2025: 104854

    Abstract

    OBJECTIVE: As machine learning adoption in clinical practice continues to grow, deployed classifiers must be continuously monitored and updated (retrained) to protect against data drift that stems from inevitable changes, including evolving medical practices and shifting patient populations. However, successful clinical machine learning classifiers will lead to a change in care which may change the distribution of features, labels, and their relationship. For example, "high risk" cases that were correctly identified by the model may ultimately get labeled as "low risk" thanks to an intervention prompted by the model's alert. Classifier surveillance systems naive to such deployment-induced feedback loops will estimate lower model performance and lead to degraded future classifier retrains. The objective of this study is to simulate the impact of these feedback loops, propose feedback aware monitoring strategies as a solution, and assess the performance of these alternative monitoring strategies through simulations.METHODS: We propose Adherence Weighted and Sampling Weighted Monitoring as two feedback loop-aware surveillance strategies. Through simulation we evaluate their ability to accurately appraise post deployment model performance and to initiate safe and accurate classifier retraining.RESULTS: Measured across accuracy, area under the receiver operating characteristic curve, average precision, brier score, expected calibration error, F1, precision, sensitivity, and specificity, in the presence of feedback loops, Adherence Weighted and Sampling Weighted strategies have the highest fidelity to the ground truth classifier performance while standard approaches yield the most inaccurate estimations. Furthermore, in simulations with true data drift, retraining using standard unweighted approaches results in a AUROC score of 0.52 (drop from 0.72). In contrast, retraining based on Adherence Weighted and Sampling Weighted strategies recover performance to 0.67 which is comparable to what a new model trained from scratch on the existing and shifted data would obtain.CONCLUSION: Compared to standard approaches, Adherence Weighted and Sampling Weighted strategies yield more accurate classifier performance estimates, measured according to the no-treatment potential outcome. Retraining based on these strategies bring stronger performance recovery when tested against data drift and feedback loops than do standard approaches.

    View details for DOI 10.1016/j.jbi.2025.104854

    View details for PubMedID 40482691

  • powerROC: An Interactive Web Tool for Sample Size Calculation in Assessing Models' Discriminative Abilities. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Grolleau, F., Tibshirani, R., Chen, J. H. 2025; 2025: 196-204

    Abstract

    Rigorous external validation is crucial for assessing the generalizability of prediction models, particularly by evaluating their discrimination (AUROC) on new data. This often involves comparing a new model's AUROC to that of an established reference model. However, many studies rely on arbitrary rules of thumb for sample size calculations, often resulting in underpowered analyses and unreliable conclusions. This paper reviews crucial concepts for accurate sample size determination in AUROC-based external validation studies, making the theory and practice more accessible to researchers and clinicians. We introduce powerROC, an open-source web tool designed to simplify these calculations, enabling both the evaluation of a single model and the comparison of two models. The tool offers guidance on selecting target precision levels and employs flexible approaches, leveraging either pilot data or user-defined probability distributions. We illustrate powerROC's utility through a case study on hospital mortality prediction using the MIMIC database.

    View details for PubMedID 40502274

  • Systematic Exploration of Hospital Cost Variability: A Conformal Prediction-Based Outlier Detection Method for Electronic Health Records. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Grolleau, F., Goh, E., Ma, S. P., Masterson, J., Ross, T., Milstein, A., Chen, J. H. 2025; 2025: 187-195

    Abstract

    Marked variability in inpatient hospitalization costs poses significant challenges to healthcare quality, resource allocation, and patient outcomes. Traditional methods like Diagnosis-Related Groups (DRGs) aid in cost management but lack practical solutions for enhancing hospital care value. We introduce a novel methodology for outlier detection in Electronic Health Records (EHRs) using Conformal Prediction. This approach identifies and prioritizes areas for optimizing high-value care processes. Unlike conventional predictive models that neglect uncertainty, our method employs Conformal Quantile Regression (CQR) to generate robust prediction intervals, offering a comprehensive view of cost variability. By integrating Conformal Prediction with machine learning models, healthcare professionals can more accurately pinpoint opportunities for quality and efficiency improvements. Our framework systematically evaluates unexplained hospital cost variations and generates interpretable hypotheses for refining clinical practices associated with atypical costs. This data-driven approach offers a systematic method to generate clinically sound hypotheses that may inform processes to enhance care quality and optimize resource utilization.

    View details for PubMedID 40502259

  • Right Patient, Right Specialist, Right Time: Retrieval Augmented Generation for Specialty Referral Routing. AMIA ... Annual Symposium proceedings. AMIA Symposium Haredasht, F. N., Goh, E., Ravi, V., Ashtari, P., Jiang, Y., Yuldashev, N., Grolleau, F., Gallo, R. J., Shah, A., Hur, E., Chopra, K., Jee, O., Lee, J. J., Rosengaus, L., Giang, L., Schulman, K., Hom, J., Milstein, A., Ng, A. Y., Chen, J. H. 2024; 2024: 443-450

    Abstract

    We present an embedding-based retrieval system that automatically directs physician clinical questions to the most relevant specialist-curated question template, which is necessary for the specialist to provide a clinically relevant response. The system utilizes MPNet, a transformer-based model, to generate dense vector representations of both clinical queries and 24 predefined clinical templates. Given a clinical question, the system computes cosine similarity between the query and template embeddings to retrieve the most relevant matches. When validated against real-world, retrospective eConsults across five specialties, the system accurately identified the most relevant template in 87% of cases (success@1) and included it in the top three results 99% of the time (success@3). Automating specialty selection and clinical question referrals reduces the administrative burden on physicians, minimizes care delivery delays, and improves specialist responses by providing proper context.

    View details for PubMedID 41726438

    View details for PubMedCentralID PMC12919621

  • Personalizing renal replacement therapy initiation in the intensive care unit: a reinforcement learning-based strategy with external validation on the AKIKI randomized controlled trials JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Grolleau, F., Petit, F., Gaudry, S., Diard, E., Quenot, J., Dreyfuss, D., Tran, V., Porcher, R. 2024; 31 (5): 1074-1083

    Abstract

    The timely initiation of renal replacement therapy (RRT) for acute kidney injury (AKI) requires sequential decision-making tailored to individuals' evolving characteristics. To learn and validate optimal strategies for RRT initiation, we used reinforcement learning on clinical data from routine care and randomized controlled trials.We used the MIMIC-III database for development and AKIKI trials for validation. Participants were adult ICU patients with severe AKI receiving mechanical ventilation or catecholamine infusion. We used a doubly robust estimator to learn when to start RRT after the occurrence of severe AKI for three days in a row. We developed a "crude strategy" maximizing the population-level hospital-free days at day 60 (HFD60) and a "stringent strategy" recommending RRT when there is significant evidence of benefit for an individual. For validation, we evaluated the causal effects of implementing our learned strategies versus following current best practices on HFD60.We included 3748 patients in the development set and 1068 in the validation set. Through external validation, the crude and stringent strategies yielded an average difference of 13.7 [95% CI -5.3 to 35.7] and 14.9 [95% CI -3.2 to 39.2] HFD60, respectively, compared to current best practices. The stringent strategy led to initiating RRT within 3 days in 14% of patients versus 38% under best practices.Implementing our strategies could improve the average number of days that ICU patients spend alive and outside the hospital while sparing RRT for many.We developed and validated a practical and interpretable dynamic decision support system for RRT initiation in the ICU.

    View details for DOI 10.1093/jamia/ocae004

    View details for Web of Science ID 001180151600001

    View details for PubMedID 38452293

    View details for PubMedCentralID PMC11031229

  • Authors' Reply: Predicting Kidney Response to Plasma Exchange in ANCA-Associated Vasculitis: Need for Plausible Models JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY Nezam, D., Porcher, R., Grolleau, F., Terrier, B., French Vasculitis Study Grp 2022; 33 (6): 1224-1225

    View details for DOI 10.1681/ASN.2022030269

    View details for Web of Science ID 000790248800001

    View details for PubMedID 35410877

    View details for PubMedCentralID PMC9161786

  • Continuous renal replacement therapy <i>versus</i> intermittent hemodialysis as first modality for renal replacement therapy in severe acute kidney injury: a secondary analysis of AKIKI and IDEAL-ICU studies CRITICAL CARE Gaudry, S., Grolleau, F., Barbar, S., Martin-Lefevre, L., Pons, B., Boulet, E., Boyer, A., Chevrel, G., Montini, F., Bohe, J., Badie, J., Rigaud, J., Vinsonneau, C., Porcher, R., Quenot, J., Dreyfuss, D. 2022; 26 (1): 93

    Abstract

    Intermittent hemodialysis (IHD) and continuous renal replacement therapy (CRRT) are the two main RRT modalities in patients with severe acute kidney injury (AKI). Meta-analyses conducted more than 10 years ago did not show survival difference between these two modalities. As the quality of RRT delivery has improved since then, we aimed to reassess whether the choice of IHD or CRRT as first modality affects survival of patients with severe AKI.This is a secondary analysis of two multicenter randomized controlled trials (AKIKI and IDEAL-ICU) that compared an early RRT initiation strategy with a delayed one. We included patients allocated to the early strategy in order to emulate a trial where patients would have been randomized to receive either IHD or CRRT within twelve hours after the documentation of severe AKI. We determined each patient's modality group as the first RRT modality they received. The primary outcome was 60-day overall survival. We used two propensity score methods to balance the differences in baseline characteristics between groups and the primary analysis relied on inverse probability of treatment weighting.A total of 543 patients were included. Continuous RRT was the first modality in 269 patients and IHD in 274. Patients receiving CRRT had higher cardiovascular and total-SOFA scores. Inverse probability weighting allowed to adequately balance groups on all predefined confounders. The weighted Kaplan-Meier death rate at day 60 was 54·4% in the CRRT group and 46·5% in the IHD group (weighted HR 1·26, 95% CI 1·01-1·60). In a complementary analysis of less severely ill patients (SOFA score: 3-10), receiving IHD was associated with better day 60 survival compared to CRRT (weighted HR 1.82, 95% CI 1·01-3·28; p < 0.01). We found no evidence of a survival difference between the two RRT modalities in more severe patients.Compared to IHD, CRRT as first modality seemed to convey no benefit in terms of survival or of kidney recovery and might even have been associated with less favorable outcome in patients with lesser severity of disease. A prospective randomized non-inferiority trial should be implemented to solve the persistent conundrum of the optimal RRT technique.

    View details for DOI 10.1186/s13054-022-03955-9

    View details for Web of Science ID 000778125800002

    View details for PubMedID 35379300

    View details for PubMedCentralID PMC8981658

  • Personalization of renal replacement therapy initiation: a secondary analysis of the AKIKI and IDEAL-ICU trials CRITICAL CARE Grolleau, F., Porcher, R., Barbar, S., Hajage, D., Bourredjem, A., Quenot, J., Dreyfuss, D., Gaudry, S. 2022; 26 (1): 64

    Abstract

    Trials comparing early and delayed strategies of renal replacement therapy in patients with severe acute kidney injury may have missed differences in survival as a result of mixing together patients at heterogeneous levels of risks. Our aim was to evaluate the heterogeneity of treatment effect on 60-day mortality from an early vs a delayed strategy across levels of risk for renal replacement therapy initiation under a delayed strategy.We used data from the AKIKI, and IDEAL-ICU randomized controlled trials to develop a multivariable logistic regression model for renal replacement therapy initiation within 48 h after allocation to a delayed strategy. We then used an interaction with spline terms in a Cox model to estimate treatment effects across the predicted risks of RRT initiation.We analyzed data from 1107 patients (619 and 488 in the AKIKI and IDEAL-ICU trial respectively). In the pooled sample, we found evidence for heterogeneous treatment effects (P = 0.023). Patients at an intermediate-high risk of renal replacement therapy initiation within 48 h may have benefited from an early strategy (absolute risk difference, - 14%; 95% confidence interval, - 27% to - 1%). For other patients, we found no evidence of benefit from an early strategy of renal replacement therapy initiation but a trend for harm (absolute risk difference, 8%; 95% confidence interval, - 5% to 21% in patients at intermediate-low risk).We have identified a clinically sound heterogeneity of treatment effect of an early vs a delayed strategy of renal replacement therapy initiation that may reflect varying degrees of kidney demand-capacity mismatch.

    View details for DOI 10.1186/s13054-022-03936-y

    View details for Web of Science ID 000771483500004

    View details for PubMedID 35313942

    View details for PubMedCentralID PMC8939225

  • Kidney Histopathology Can Predict Kidney Function in ANCA-Associated Vasculitides with Acute Kidney Injury Treated with Plasma Exchanges JOURNAL OF THE AMERICAN SOCIETY OF NEPHROLOGY Nezam, D., Porcher, R., Grolleau, F., Morel, P., Titeca-Beauport, D., Faguer, S., Karras, A., Solignac, J., Jourde-Chiche, N., Maurier, F., Sakhi, H., El Karoui, K., Mesbah, R., Carron, P., Audard, V., Ducloux, D., Paule, R., Augusto, J., Aniort, J., Tiple, A., Rafat, C., Beaudreuil, S., Puechal, X., Gobert, P., Massy, Z., Hanrotel, C., Bally, S., Martis, N., Durel, C., Desbuissons, G., Godmer, P., Hummel, A., Perrin, F., Neel, A., De Moreuil, C., Goulenok, T., Guerrot, D., Grange, S., Foucher, A., Deroux, A., Cordonnier, C., Guilbeau-Frugier, C., Modesto-Segonds, A., Nochy, D., Daniel, L., Moktefi, A., Rabant, M., Guillevin, L., Regent, A., Terrier, B., French Vasculitis Study Grp 2022; 33 (3): 628-637

    Abstract

    Data from the PEXIVAS trial challenged the role of plasma exchange (PLEX) in ANCA-associated vasculitides (AAV). We aimed to describe kidney biopsy from patients with AAV treated with PLEX, evaluate whether histopathologic findings could predict kidney function, and identify which patients would most benefit from PLEX.We performed a multicenter, retrospective study on 188 patients with AAV and AKI treated with PLEX and 237 not treated with PLEX. The primary outcome was mortality or KRT at 12 months (M12).No significant benefit of PLEX for the primary outcome was found. To identify patients benefitting from PLEX, we developed a model predicting the average treatment effect of PLEX for an individual depending on covariables. Using the prediction model, 223 patients had a better predicted outcome with PLEX than without PLEX, and 177 of them had >5% increased predicted probability with PLEX compared with without PLEX of being alive and free from KRT at M12, which defined the PLEX-recommended group. Risk difference for death or KRT at M12 was significantly lower with PLEX in the PLEX-recommended group (-15.9%; 95% CI, -29.4 to -2.5) compared with the PLEX not recommended group (-4.8%; 95% CI, 14.9 to 5.3). Microscopic polyangiitis, MPO-ANCA, higher serum creatinine, crescentic and sclerotic classes, and higher Brix score were more frequent in the PLEX-recommended group. An easy to use score identified patients who would benefit from PLEX. The average treatment effect of PLEX for those with recommended treatment corresponded to an absolute risk reduction for death or KRT at M12 of 24.6%.PLEX was not associated with a better primary outcome in the whole study population, but we identified a subset of patients who could benefit from PLEX. However, these findings must be validated before utilized in clinical decision making.

    View details for DOI 10.1681/ASN.2021060771

    View details for Web of Science ID 000749400000001

    View details for PubMedID 35074934

    View details for PubMedCentralID PMC8975074

  • Delayed Cerebral Ischemia After Subarachnoid Hemorrhage: Is There a Relevant Experimental Model? A Systematic Review of Preclinical Literature FRONTIERS IN CARDIOVASCULAR MEDICINE Goursaud, S., de Lizarrondo, S., Grolleau, F., Chagnot, A., Agin, V., Maubert, E., Gauberti, M., Vivien, D., Ali, C., Gakuba, C. 2021; 8: 752769

    Abstract

    Delayed cerebral ischemia (DCI) is one of the main prognosis factors for disability after aneurysmal subarachnoid hemorrhage (SAH). The lack of a consensual definition for DCI had limited investigation and care in human until 2010, when a multidisciplinary research expert group proposed to define DCI as the occurrence of cerebral infarction (identified on imaging or histology) associated with clinical deterioration. We performed a systematic review to assess whether preclinical models of SAH meet this definition, focusing on the combination of noninvasive imaging and neurological deficits. To this aim, we searched in PUBMED database and included all rodent SAH models that considered cerebral ischemia and/or neurological outcome and/or vasospasm. Seventy-eight publications were included. Eight different methods were performed to induce SAH, with blood injection in the cisterna magna being the most widely used (n = 39, 50%). Vasospasm was the most investigated SAH-related complication (n = 52, 67%) compared to cerebral ischemia (n = 30, 38%), which was never investigated with imaging. Neurological deficits were also explored (n = 19, 24%). This systematic review shows that no preclinical SAH model meets the 2010 clinical definition of DCI, highlighting the inconsistencies between preclinical and clinical standards. In order to enhance research and favor translation to humans, pertinent SAH animal models reproducing DCI are urgently needed.

    View details for DOI 10.3389/fcvm.2021.752769

    View details for Web of Science ID 000726140600001

    View details for PubMedID 34869659

    View details for PubMedCentralID PMC8634441

  • Research response to COVID-19 needed better coordination and collaboration: a living mapping of registered trials. Journal of clinical epidemiology Nguyen, V. T., Rivière, P., Ripoll, P., Barnier, J., Vuillemot, R., Ferrand, G., Cohen-Boulkia, S., Ravaud, P., Boutron, I. 2020

    Abstract

    Researchers worldwide are actively engaging in research activities to search for preventive and therapeutic interventions against COVID-19. Our aim was to describe the planning of randomized controlled trials (RCTs) in terms of timing related to the course of the COVID-19 epidemic and research question evaluated.We performed a living mapping of RCTs registered in the WHO International Clinical Trials Registry Platform. We systematically search the platform every week for all RCTs evaluating preventive interventions and treatments for COVID-19 and created a publicly available interactive mapping tool at https://covid-nma.com to visualize all trials registered.By August 12, 2020, 1,568 trials for COVID-19 were registered worldwide. Overall, the median ([Q1-Q3]; range) delay between the first case recorded in each country and the first RCT registered was 47 days ([33-67]; 15-163). For the 9 countries with the highest number of trials registered, most trials were registered after the peak of the epidemic (from 100% trials in Italy to 38% in the United States). Most trials evaluated treatments (1,333 trials; 85%); only 223 (14%) evaluated preventive strategies and 12 post-acute period intervention. A total of 254 trials were planned to assess different regimens of hydroxychloroquine with an expected sample size of 110,883 patients.This living mapping analysis showed that COVID-19 trials have relatively small sample size with certain redundancy in research questions. Most trials were registered when the first peak of the pandemic have passed.

    View details for DOI 10.1016/j.jclinepi.2020.10.010

    View details for PubMedID 33096223

    View details for PubMedCentralID PMC7575422

  • Fold-stratified cross-validation for unbiased and privacy-preserving federated learning JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Bey, R., Goussault, R., Grolleau, F., Benchoufi, M., Porcher, R. 2020; 27 (8): 1244-1251

    Abstract

    We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

    View details for DOI 10.1093/jamia/ocaa096

    View details for Web of Science ID 000584507600008

    View details for PubMedID 32620945

    View details for PubMedCentralID PMC7647321

  • The Fragility and Reliability of Conclusions of Anesthesia and Critical Care Randomized Trials With Statistically Significant Findings: A Systematic Review CRITICAL CARE MEDICINE Grolleau, F., Collins, G. S., Smarandache, A., Pirracchio, R., Gakuba, C., Boutron, I., Busse, J. W., Devereaux, P. J., Le Manach, Y. 2019; 47 (3): 456-462

    Abstract

    The Fragility Index, which represents the number of patients responsible for a statistically significant finding, has been suggested as an aid for interpreting the robustness of results from clinical trials. A small Fragility Index indicates that the statistical significance of a trial depends on only a few events. Our objectives were to calculate the Fragility Index of statistically significant results from randomized controlled trials of anesthesia and critical care interventions and to determine the frequency of distorted presentation of results or "spin".We systematically searched MEDLINE from January 01, 2007, to February 22, 2017, to identify randomized controlled trials exploring the effect of critical care medicine or anesthesia interventions.Studies were included if they randomized patients 1:1 into two parallel arms and reported at least one statistically significant (p < 0.05) binary outcome (primary or secondary).Two reviewers independently assessed eligibility and extracted data. The Fragility Index was determined for the chosen outcome. We assessed the level of spin in negative trials and the presence of recommendations for clinical practice in positive trials.We identified 166 eligible randomized controlled trials with a median sample size of 207 patients (interquartile range, 109-497). The median Fragility Index was 3 (interquartile range, 1-7), which means that adding three events to one of the trials treatment arms eliminated its statistical significance. High spin was identified in 42% (n = 30) of negative randomized controlled trials, whereas 21% (n = 20) of positive randomized controlled trials provided recommendations. Lower levels of spin and recommendations were associated with publication in journals with high impact factors (p < 0.001 for both).Statistically significant results in anesthesia and critical care randomized controlled trials are often fragile, and study conclusions are frequently affected by spin. Routine calculation of the Fragility Index in medical literature may allow for better understanding of trials and therefore enhance the quality of reporting.

    View details for DOI 10.1097/CCM.0000000000003527

    View details for Web of Science ID 000458886600032

    View details for PubMedID 30394920