Academic Appointments

2020-21 Courses

Stanford Advisees

  • Doctoral Dissertation Reader (AC)
    Saba Eskandarian, Sadjad Fouladi, Yilong Li, Yawen Wang
  • Postdoctoral Faculty Sponsor
    Fiodar Kazhamiaka
  • Doctoral Dissertation Advisor (AC)
    Firas Abuzaid, Deepak Narayanan
  • Master's Program Advisor
    Nikhil Athreya, Danny Cho, Silvia Gong, Dominick Hing, Arjun Kunnasagaran, Advay Pal, Julius Stener, Emily Yang
  • Doctoral Dissertation Co-Advisor (AC)
    Lingjiao Chen, Daniel Kang, Omar Khattab, James Thomas
  • Doctoral (Program)
    Cody Coleman, Trevor Gale, Peter Kraft, Deepak Narayanan, Deepti Raghavan, Keshav Santhanam, Pratiksha Thaker, Gina Yuan

All Publications

  • Machine Learned Cellular Phenotypes Predict Outcome in Ischemic Cardiomyopathy. Circulation research Rogers, A. J., Selvalingam, A., Alhusseini, M. I., Krummen, D. E., Corrado, C., Abuzaid, F., Baykaner, T., Meyer, C., Clopton, P., Giles, W. R., Bailis, P., Niederer, S. A., Wang, P. J., Rappel, W., Zaharia, M., Narayan, S. M. 2020


    RATIONALE: Susceptibility to ventricular arrhythmias (VT/VF) is difficult to predict in patients with ischemic cardiomyopathy either by clinical tools or by attempting to translate cellular mechanisms to the bedside.OBJECTIVE: To develop computational phenotypes of patients with ischemic cardiomyopathy, by training then interpreting machine learning (ML) of ventricular monophasic action potentials (MAPs) to reveal phenotypes that predict long-term outcomes.METHODS AND RESULTS: We recorded 5706 ventricular MAPs in 42 patients with coronary disease (CAD) and left ventricular ejection fraction (LVEF) {less than or equal to}40% during steady-state pacing. Patients were randomly allocated to independent training and testing cohorts in a 70:30 ratio, repeated K=10 fold. Support vector machines (SVM) and convolutional neural networks (CNN) were trained to 2 endpoints: (i) sustained VT/VF or (ii) mortality at 3 years. SVM provided superior classification. For patient-level predictions, we computed personalized MAP scores as the proportion of MAP beats predicting each endpoint. Patient-level predictions in independent test cohorts yielded c-statistics of 0.90 for sustained VT/VF (95% CI: 0.76-1.00) and 0.91 for mortality (95% CI: 0.83-1.00) and were the most significant multivariate predictors. Interpreting trained SVM revealed MAP morphologies that, using in silico modeling, revealed higher L-type calcium current or sodium calcium exchanger as predominant phenotypes for VT/VF.CONCLUSIONS: Machine learning of action potential recordings in patients revealed novel phenotypes for long-term outcomes in ischemic cardiomyopathy. Such computational phenotypes provide an approach which may reveal cellular mechanisms for clinical outcomes and could be applied to other conditions.

    View details for DOI 10.1161/CIRCRESAHA.120.317345

    View details for PubMedID 33167779

  • DIFF: a relational interface for large-scale data explanation VLDB JOURNAL Abuzaid, F., Kraft, P., Suri, S., Gan, E., Xu, E., Shenoy, A., Ananthanarayan, A., Sheu, J., Meijer, E., Wu, X., Naughton, J., Bailis, P., Zaharia, M. 2020
  • Machine Learning to Classify Intracardiac Electrical Patterns during Atrial Fibrillation. Circulation. Arrhythmia and electrophysiology Alhusseini, M. I., Abuzaid, F., Rogers, A. J., Zaman, J. A., Baykaner, T., Clopton, P., Bailis, P., Zaharia, M., Wang, P. J., Rappel, W., Narayan, S. M. 2020


    Background - Advances in ablation for atrial fibrillation (AF) continue to be hindered by ambiguities in mapping, even between experts. We hypothesized that convolutional neural networks (CNN) may enable objective analysis of intracardiac activation in AF, which could be applied clinically if CNN classifications could also be explained. Methods - We performed panoramic recording of bi-atrial electrical signals in AF. We used the Hilbert-transform to produce 175,000 image grids in 35 patients, labeled for rotational activation by experts who showed consistency but with variability (kappa=0.79). In each patient, ablation terminated AF. A CNN was developed and trained on 100,000 AF image grids, validated on 25,000 grids, then tested on a separate 50,000 grids. Results - In the separate test cohort (50,000 grids), CNN reproducibly classified AF image grids into those with/without rotational sites with 95.0% accuracy (CI 94.8-95.2%). This accuracy exceeded that of support vector machines, traditional linear discriminant and k-nearest neighbor statistical analyses. To probe the CNN, we applied Gradient-weighted Class Activation Mapping which revealed that the decision logic closely mimicked rules used by experts (C-statistic 0.96). Conclusions - Convolutional neural networks improved the classification of intracardiac AF maps compared to other analyses, and agreed with expert evaluation. Novel explainability analyses revealed that the CNN operated using a decision logic similar to rules used by experts, even though these rules were not provided in training. We thus describe a scaleable platform for robust comparisons of complex AF data from multiple systems, which may provide immediate clinical utility to guide ablation.

    View details for DOI 10.1161/CIRCEP.119.008160

    View details for PubMedID 32631100

  • Approximate Selection with Guarantees using Proxies PROCEEDINGS OF THE VLDB ENDOWMENT Kang, D., Gan, E., Bailis, P., Hashimoto, T., Zaharia, M. 2020; 13 (11): 1990–2003
  • PREDICTING SUDDEN CARDIAC DEATH BY MACHINE LEARNING OF VENTRICULAR ACTION POTENTIALS Selvalingam, A., Alhusseini, M., Rogers, A. J., Krummen, D., Abuzaid, F. M., Baykaner, T., Clopton, P., Bailis, P., Zaharia, M., Wang, P., Narayan, S. ELSEVIER SCIENCE INC. 2020: 427
  • Fleet: A Framework for Massively Parallel Streaming on FPGAs Thomas, J., Hanrahan, P., Zaharia, M., ACM ASSOC COMPUTING MACHINERY. 2020: 639–51
  • BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics PROCEEDINGS OF THE VLDB ENDOWMENT Kang, D., Bailis, P., Zaharia, M. 2019; 13 (4): 533–46
  • To Index or Not to Index: Optimizing Exact Maximum Inner Product Search Abuzaid, F., Sethi, G., Bailis, P., Zaharia, M., IEEE IEEE. 2019: 1250–61
  • PipeDream: Generalized Pipeline Parallelism for DNN Training Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., Zaharia, M., ACM ASSOC COMPUTING MACHINERY. 2019: 1–15
  • TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., Aiken, A., ACM ASSOC COMPUTING MACHINERY. 2019: 47–62
  • Optimizing Data-Intensive Computations in Existing Libraries with Split Annotations Palkar, S., Zaharia, M., ACM ASSOC COMPUTING MACHINERY. 2019: 291–305
  • From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers Fouladi, S., Romero, F., Iter, D., Li, Q., Chatterjee, S., Kozyrakis, C., Zaharia, M., Winstein, K., USENIX Assoc USENIX ASSOC. 2019: 475–88
  • DIFF: A Relational Interface for Large-Scale Data Explanation PROCEEDINGS OF THE VLDB ENDOWMENT Abuzaid, F., Kraft, P., Suri, S., Gan, E., Xu, E., Shenoy, A., Ananthanarayan, A., Sheu, J., Meijer, E., Wu, X., Naughton, J., Bailis, P., Zaharia, M. 2018; 12 (4): 419–32
  • Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., Zaharia, M., Das, G., Jermaine, C., Bernstein, P., Eldawy, A. ASSOC COMPUTING MACHINERY. 2018: 601–13
  • MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis Vartak, M., da Trindade, J. F., Madden, S., Zaharia, M., Das, G., Jermaine, C., Bernstein, P., Eldawy, A. ASSOC COMPUTING MACHINERY. 2018: 1285–1300
  • NoScope: Optimizing Neural Network Queries over Video at Scale PROCEEDINGS OF THE VLDB ENDOWMENT Kang, D., Emmons, J., Abuzaid, F., Bailis, P., Zaharia, M. 2017; 10 (11): 1586–97
  • Splinter: Practical Private Queries on Public Data Wang, F., Yun, C., Goldwasser, S., Vaikuntanathan, V., Zaharia, M., USENIX Assoc USENIX ASSOC. 2017: 299–313
  • DIY Hosting for Online Privacy Palkar, S., Zaharia, M., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2017: 1–7
  • Making Caches Work for Graph Analytics Zhang, Y., Kiriansky, V., Mendis, C., Amarasinghe, S., Zaharia, M., Nie, J. Y., Obradovic, Z., Suzumura, T., Ghosh, R., Nambiar, R., Wang, C., Zang, H., BaezaYates, R., Hu, Kepner, J., Cuzzocrea, A., Tang, J., Toyoda, M. IEEE. 2017: 293–302
  • Apache Spark: A Unified Engine for Big Data Processing COMMUNICATIONS OF THE ACM Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I. 2016; 59 (11): 56-65

    View details for DOI 10.1145/2934664

    View details for Web of Science ID 000387897700022

  • Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware PROCEEDINGS OF THE VLDB ENDOWMENT Pirk, H., Moll, O., Zaharia, M., Madden, S. 2016; 9 (14): 1707–18
  • MLlib: Machine Learning in Apache Spark JOURNAL OF MACHINE LEARNING RESEARCH Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D. B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M. J., Zadeh, R., Zaharia, M., Talwalkar, A. 2016; 17
  • GraphFrames: An Integrated API for Mixing Graph and Relational Queries Dave, A., Jindal, A., Li, L., Xin, R., Gonzalez, J., Zaharia, M., ACM ASSOC COMPUTING MACHINERY. 2016
  • FairRide: Near-Optimal, Fair Cache Sharing Pu, Q., Li, H., Zaharia, M., Ghodsi, A., Stoica, I., USENIX Assoc USENIX ASSOC. 2016: 393–406
  • SparkR: Scaling R Programs with Spark Venkataraman, S., Yang, Z., Liu, D., Liang, E., Falaki, H., Meng, X., Xin, R., Ghodsi, A., Franklin, M., Stoica, I., Zaharia, M., ACM SIGMOD ASSOC COMPUTING MACHINERY. 2016: 1099–1104
  • Introduction to Spark 2.0 for Database Researchers Armbrust, M., Bateman, D., Xin, R., Zaharia, M., ACM SIGMOD ASSOC COMPUTING MACHINERY. 2016: 2193–94
  • Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale Abuzaid, F., Bradley, J., Liang, F., Feng, A., Yang, L., Zaharia, M., Talwalkar, A., Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
  • Matrix Computations and Optimization in Apache Spark Zadeh, R., Meng, X., Ulanov, A., Yavuz, B., Pu, L., Venkataraman, S., Sparks, E., Staple, A., Zaharia, M., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2016: 31–38
  • Scaling Spark in the Real World: Performance and Usability PROCEEDINGS OF THE VLDB ENDOWMENT Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Stoica, I., Wendell, P., Xin, R., Zaharia, M. 2015; 8 (12): 1840–43
  • Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis van den Hooff, J., Lazar, D., Zaharia, M., Zeldovich, N., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2015: 137–52
  • Spark SQL: Relational Data Processing in Spark Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklint, M. J., Ghodsi, A., Zaharia, M., ACM SIGMOD ASSOC COMPUTING MACHINERY. 2015: 1383–94
  • Optimally designing games for behavioural research PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES Rafferty, A. N., Zaharia, M., Griffiths, T. L. 2014; 470 (2167): 20130828


    Computer games can be motivating and engaging experiences that facilitate learning, leading to their increasing use in education and behavioural experiments. For these applications, it is often important to make inferences about the knowledge and cognitive processes of players based on their behaviour. However, designing games that provide useful behavioural data are a difficult task that typically requires significant trial and error. We address this issue by creating a new formal framework that extends optimal experiment design, used in statistics, to apply to game design. In this framework, we use Markov decision processes to model players' actions within a game, and then make inferences about the parameters of a cognitive model from these actions. Using a variety of concept learning games, we show that in practice, this method can predict which games will result in better estimates of the parameters of interest. The best games require only half as many players to attain the same level of precision.

    View details for DOI 10.1098/rspa.2013.0828

    View details for Web of Science ID 000336184600004

    View details for PubMedID 25002821

    View details for PubMedCentralID PMC4032552

  • A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples GENOME RESEARCH Naccache, S. N., Federman, S., Veeraraghavan, N., Zaharia, M., Lee, D., Samayoa, E., Bouquet, J., Greninger, A. L., Luk, K., Enge, B., Wadford, D. A., Messenger, S. L., Genrich, G. L., Pellegrino, K., Grard, G., Leroy, E., Schneider, B. S., Fair, J. N., Martinez, M. A., Isa, P., Crump, J. A., DeRisi, J. L., Sittler, T., Hackett, J., Miller, S., Chiu, C. Y. 2014; 24 (7): 1180–92


    Unbiased next-generation sequencing (NGS) approaches enable comprehensive pathogen detection in the clinical microbiology laboratory and have numerous applications for public health surveillance, outbreak investigation, and the diagnosis of infectious diseases. However, practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe. Here we describe SURPI ("sequence-based ultrarapid pathogen identification"), a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples, and demonstrate use of the pipeline in the analysis of 237 clinical samples comprising more than 1.1 billion sequences. Deployable on both cloud-based and standalone servers, SURPI leverages two state-of-the-art aligners for accelerated analyses, SNAP and RAPSearch, which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance. In fast mode, SURPI detects viruses and bacteria by scanning data sets of 7-500 million reads in 11 min to 5 h, while in comprehensive mode, all known microorganisms are identified, followed by de novo assembly and protein homology searches for divergent viruses in 50 min to 16 h. SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients, underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.

    View details for DOI 10.1101/gr.171934.113

    View details for Web of Science ID 000338185000012

    View details for PubMedID 24899342

    View details for PubMedCentralID PMC4079973

  • Multi-Resource Fair Queueing for Packet Processing ACM SIGCOMM COMPUTER COMMUNICATION REVIEW Ghodsi, A., Sekar, V., Zaharia, M., Stoica, I. 2012; 42 (4): 1–12
  • Managing Data Transfers in Computer Clusters with Orchestra ACM SIGCOMM COMPUTER COMMUNICATION REVIEW Chowdhury, M., Zaharia, M., Ma, J., Jordan, M. I., Stoica, I. 2011; 41 (4): 98–109