Christopher (Chris) Re is an associate professor in the Department of Computer Science at Stanford University. He is in the Stanford AI Lab and is affiliated with the Machine Learning Group and the Center for Research on Foundation Models. His recent work is to understand how software and hardware systems will change because of machine learning along with a continuing, petulant drive to work on math problems. Research from his group has been incorporated into scientific and humanitarian efforts, such as the fight against human trafficking, along with products from technology and companies including Apple, Google, YouTube, and more. He has also cofounded companies, including Snorkel, SambaNova, and Together, and a venture firm, called Factory.

His family still brags that he received the MacArthur Foundation Fellowship, but his closest friends are confident that it was a mistake. His research contributions have spanned database theory, database systems, and machine learning, and his work has won best paper at a premier venue in each area, respectively, at PODS 2012, SIGMOD 2014, and ICML 2016. Due to great collaborators, he received the NeurIPS 2020 test-of-time award and the PODS 2022 test-of-time award. Due to great students, he received best paper at MIDL 2022, best paper runner up at ICLR22 and ICML22, and best student-paper runner up at UAI22.

Academic Appointments

Program Affiliations

  • Stanford SystemX Alliance

Current Research and Scholarly Interests

Algorithms, systems, and theory for the next generation of data processing and data analytics systems.

2023-24 Courses

Stanford Advisees

All Publications

  • Extracting chemical reactions from text using Snorkel. BMC bioinformatics Mallory, E. K., de Rochemonteix, M. n., Ratner, A. n., Acharya, A. n., Re, C. n., Bright, R. A., Altman, R. B. 2020; 21 (1): 217


    Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often require a large amount of labeled training data and thus lack scalability for both larger document corpora and new relationship types.We developed an application of Snorkel, a weakly supervised learning framework, for extracting chemical reaction relationships from biomedical literature abstracts. For this work, we defined a chemical reaction relationship as the transformation of chemical A to chemical B. We built and evaluated our system on small annotated sets of chemical reaction relationships from two corpora: curated bacteria-related abstracts from the MetaCyc database (MetaCyc_Corpus) and a more general set of abstracts annotated with MeSH (Medical Subject Headings) term Bacteria (Bacteria_Corpus; a superset of MetaCyc_Corpus). For the MetaCyc_Corpus, we obtained 84% precision and 41% recall (55% F1 score). Extending to the more general Bacteria_Corpus decreased precision to 62% with only a four-point drop in recall to 37% (46% F1 score). Overall, the Bacteria_Corpus contained two orders of magnitude more candidate chemical reaction relationships (nine million candidates vs 68,0000 candidates) and had a larger class imbalance (2.5% positives vs 5% positives) as compared to the MetaCyc_Corpus. In total, we extracted 6871 chemical reaction relationships from nine million candidates in the Bacteria_Corpus.With this work, we built a database of chemical reaction relationships from almost 900,000 scientific abstracts without a large training set of labeled annotations. Further, we showed the generalizability of our initial application built on MetaCyc documents enriched with chemical reactions to a general set of articles related to bacteria.

    View details for DOI 10.1186/s12859-020-03542-1

    View details for PubMedID 32460703

  • Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs. Radiology Dunnmon, J. A., Yi, D., Langlotz, C. P., Re, C., Rubin, D. L., Lungren, M. P. 2018: 181422


    Purpose To assess the ability of convolutional neural networks (CNNs) to enable high-performance automated binary classification of chest radiographs. Materials and Methods In a retrospective study, 216 431 frontal chest radiographs obtained between 1998 and 2012 were procured, along with associated text reports and a prospective label from the attending radiologist. This data set was used to train CNNs to classify chest radiographs as normal or abnormal before evaluation on a held-out set of 533 images hand-labeled by expert radiologists. The effects of development set size, training set size, initialization strategy, and network architecture on end performance were assessed by using standard binary classification metrics; detailed error analysis, including visualization of CNN activations, was also performed. Results Average area under the receiver operating characteristic curve (AUC) was 0.96 for a CNN trained with 200 000 images. This AUC value was greater than that observed when the same model was trained with 2000 images (AUC = 0.84, P < .005) but was not significantly different from that observed when the model was trained with 20 000 images (AUC = 0.95, P > .05). Averaging the CNN output score with the binary prospective label yielded the best-performing classifier, with an AUC of 0.98 (P < .005). Analysis of specific radiographs revealed that the model was heavily influenced by clinically relevant spatial regions but did not reliably generalize beyond thoracic disease. Conclusion CNNs trained with a modestly sized collection of prospectively labeled chest radiographs achieved high diagnostic performance in the classification of chest radiographs as normal or abnormal; this function may be useful for automated prioritization of abnormal chest radiographs. © RSNA, 2018 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.

    View details for PubMedID 30422093

  • Snuba: Automating Weak Supervision to Label Training Data PROCEEDINGS OF THE VLDB ENDOWMENT Varma, P., Re, C. 2018; 12 (3): 223–36
  • Research for Practice: Knowledge Base Construction in the Machine Learning Era COMMUNICATIONS OF THE ACM Ratner, A., Re, C. 2018; 61 (11): 95–97

    View details for DOI 10.1145/3233243

    View details for Web of Science ID 000448785200030

  • A Relational Framework for Classifier Engineering Kimelfeld, B., Re, C. ASSOC COMPUTING MACHINERY. 2018

    View details for DOI 10.1145/3268931

    View details for Web of Science ID 000457121900001

  • A Cloud-Based Metabolite and Chemical Prioritization System for the Biology/Disease-Driven Human Proteome Project. Journal of proteome research Yu, K., Lee, T. M., Chen, Y., Re, C., Kou, S. C., Chiang, J., Snyder, M., Kohane, I. S. 2018


    Targeted metabolomics and biochemical studies complement the ongoing investigations led by the Human Proteome Organization (HUPO) Biology/Disease-Driven Human Proteome Project (B/D-HPP). However, it is challenging to identify and prioritize metabolite and chemical targets. Literature-mining-based approaches have been proposed for target proteomics studies, but text mining methods for metabolite and chemical prioritization are hindered by a large number of synonyms and nonstandardized names of each entity. In this study, we developed a cloud-based literature mining and summarization platform that maps metabolites and chemicals in the literature to unique identifiers and summarizes the copublication trends of metabolites/chemicals and B/D-HPP topics using Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores. We successfully prioritized metabolites and chemicals associated with the B/D-HPP targeted fields and validated the results by checking against expert-curated associations and enrichment analyses. Compared with existing algorithms, our system achieved better precision and recall in retrieving chemicals related to B/D-HPP focused areas. Our cloud-based platform enables queries on all biological terms in multiple species, which will contribute to B/D-HPP and targeted metabolomics/chemical studies.

    View details for PubMedID 30094994

  • Fonduer: Knowledge Base Construction from Richly Formatted Data. Proceedings. ACM-Sigmod International Conference on Management of Data Wu, S., Hsiao, L., Cheng, X., Hancock, B., Rekatsinas, T., Levis, P., Re, C. 2018; 2018: 1301–16


    We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base-and in some cases produces up to 1.87* the number of correct entries-compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.

    View details for PubMedID 29937618

  • It's All a Matter of Degree Using Degree Information to Optimize Multiway Joins THEORY OF COMPUTING SYSTEMS Joglekar, M., Re, C. 2018; 62 (4): 810–53
  • Systematic Protein Prioritization for Targeted Proteomics Studies through Literature Mining JOURNAL OF PROTEOME RESEARCH Yu, K., Lee, T., Wan, C., Chen, Y., Re, C., Kou, S. C., Chiang, J., Kohane, I. S., Snyder, M. 2018; 17 (4): 1383–96


    There are more than 3.7 million published articles on the biological functions or disease implications of proteins, constituting an important resource of proteomics knowledge. However, it is difficult to summarize the millions of proteomics findings in the literature manually and quantify their relevance to the biology and diseases of interest. We developed a fully automated bioinformatics framework to identify and prioritize proteins associated with any biological entity. We used the 22 targeted areas of the Biology/Disease-driven (B/D)-Human Proteome Project (HPP) as examples, prioritized the relevant proteins through their Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores, validated the relevance of the score by comparing the protein prioritization results with a curated database, computed the scores of proteins across the topics of B/D-HPP, and characterized the top proteins in the common model organisms. We further extended the bioinformatics workflow to identify the relevant proteins in all organ systems and human diseases and deployed a cloud-based tool to prioritize proteins related to any custom search terms in real time. Our tool can facilitate the prioritization of proteins for any organ system or disease of interest and can contribute to the development of targeted proteomic studies for precision medicine.

    View details for PubMedID 29505266

  • Worst-case Optimal Join Algorithms JOURNAL OF THE ACM Ngo, H. Q., Porat, E., Re, C., Rudra, A. 2018; 65 (3)

    View details for DOI 10.1145/3180143

    View details for Web of Science ID 000433477000005

  • A Relational Framework for Classifier Engineering SIGMOD RECORD Kimelfeld, B., Re, C. 2018; 47 (1): 6–13
  • Weighted SGD for l(p) Regression with Randomized Preconditioning JOURNAL OF MACHINE LEARNING RESEARCH Yang, J., Chow, Y., Re, C., Mahoney, M. W. 2018; 18
  • Software 2.0 and Snorkel: Beyond Hand-Labeled Data Re, C., ACM ASSOC COMPUTING MACHINERY. 2018: 2876
  • Association of Omics Features with Histopathology Patterns in Lung Adenocarcinoma CELL SYSTEMS Yu, K., Berry, G. J., Rubin, D. L., Re, C., Altman, R. B., Snyder, M. 2017; 5 (6): 620-+


    Adenocarcinoma accounts for more than 40% of lung malignancy, and microscopic pathology evaluation is indispensable for its diagnosis. However, how histopathology findings relate to molecular abnormalities remains largely unknown. Here, we obtained H&E-stained whole-slide histopathology images, pathology reports, RNA sequencing, and proteomics data of 538 lung adenocarcinoma patients from The Cancer Genome Atlas and used these to identify molecular pathways associated with histopathology patterns. We report cell-cycle regulation and nucleotide binding pathways underpinning tumor cell dedifferentiation, and we predicted histology grade using transcriptomics and proteomics signatures (area under curve >0.80). We built an integrative histopathology-transcriptomics model to generate better prognostic predictions for stage I patients (p = 0.0182 ± 0.0021) compared with gene expression or histopathology studies alone, and the results were replicated in an independent cohort (p = 0.0220 ± 0.0070). These results motivate the integration of histopathology and omics data to investigate molecular mechanisms of pathology findings and enhance clinical prognostic prediction.

    View details for PubMedID 29153840

    View details for PubMedCentralID PMC5746468

  • Inferring Generative Model Structure with Static Analysis. Advances in neural information processing systems Varma, P., He, B., Bajaj, P., Banerjee, I., Khandwala, N., Rubin, D. L., Re, C. 2017; 30: 239–49


    Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects training label quality, but is difficult to learn without any ground truth labels. We instead rely on these weak supervision sources having some structure by virtue of being encoded programmatically. We present Coral, a paradigm that infers generative model structure by statically analyzing the code for these heuristics, thus reducing the data required to learn structure significantly. We prove that Coral's sample complexity scales quasilinearly with the number of heuristics and number of relations found, improving over the standard sample complexity, which is exponential in n for identifying nth degree relations. Experimentally, Coral matches or outperforms traditional structure learning approaches by up to 3.81 F1 points. Using Coral to model dependencies instead of assuming independence results in better performance than a fully supervised model by 3.07 accuracy points when heuristics are used to label radiology data without ground truth labels.

    View details for PubMedID 29391769

  • Gaussian Quadrature for Kernel Features. Advances in neural information processing systems Dao, T., De Sa, C., Re, C. 2017; 30: 6109–19


    Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that O(epsilon-2) samples are required to achieve an approximation error of at most epsilon. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any gamma > 0, to achieve error epsilon with O(egamma + epsilon-1/gamma) samples as epsilon goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

    View details for PubMedID 29398882

  • Learning to Compose Domain-Specific Transformations for Data Augmentation. Advances in neural information processing systems Ratner, A. J., Ehrenberg, H. R., Hussain, Z., Dunnmon, J., Re, C. 2017; 30: 3239–49


    Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual transformations, constructing and tuning the more sophisticated compositions typically needed to achieve state-of-the-art results is a time-consuming manual task in practice. We propose a method for automating this process by learning a generative sequence model over user-specified transformation functions using a generative adversarial approach. Our method can make use of arbitrary, non-deterministic transformation functions, is robust to misspecified user input, and is trained on unlabeled data. The learned transformation model can then be used to perform data augmentation for any end discriminative model. In our experiments, we show the efficacy of our approach on both image and text datasets, achieving improvements of 4.0 accuracy points on CIFAR-10, 1.4 F1 points on the ACE relation extraction task, and 3.4 accuracy points when using domain-specific transformation operations on a medical imaging dataset as compared to standard heuristic augmentation approaches.

    View details for PubMedID 29375240

  • Snorkel: Rapid Training Data Creation with Weak Supervision PROCEEDINGS OF THE VLDB ENDOWMENT Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., Re, C. 2017; 11 (3): 269–82


    Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of- the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

    View details for PubMedID 29770249

  • EmptyHeaded: A Relational Engine for Graph Processing Aberger, C. R., Lamb, A., Tu, S., Noetzli, A., Olukotun, K., Re, C. ASSOC COMPUTING MACHINERY. 2017

    View details for DOI 10.1145/3129246

    View details for Web of Science ID 000419302700001

  • Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded PROCEEDINGS OF THE VLDB ENDOWMENT Aberger, C. R., Lamb, A., Olukotun, K., Re, C. 2017; 10 (12): 1849–52
  • HoloClean: Holistic Data Repairs with Probabilistic Inference PROCEEDINGS OF THE VLDB ENDOWMENT Rekatsinas, T., Chu, X., Ilyas, I. F., Re, C. 2017; 10 (11): 1190–1201
  • Report from the third workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR' 16) SIGMOD RECORD Afrati, F. N., Hidders, J., Re, C., Sroka, J., Ullman, J. 2017; 46 (2): 43–48
  • Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent De Sa, C., Feldman, M., Re, C., Olukotun, K., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2017: 561–74


    Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

    View details for PubMedID 29391770

    View details for PubMedCentralID PMC5789782

  • Snorkel: Fast Training Set Generation for Information Extraction Ratner, A. J., Bach, S. H., Ehrenberg, H. R., Re, C., ACM SIGMOD ASSOC COMPUTING MACHINERY. 2017: 1683–86
  • Learning to Compose Domain-Specific Transformations for Data Augmentation Ratner, A. J., Ehrenberg, H. R., Hussain, Z., Dunnmon, J., Re, C., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • Inferring Generative Model Structure with Static Analysis Varma, P., He, B., Bajaj, P., Khandwala, N., Banerjee, I., Rubin, D., Re, C., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • SLiM Fast: Guaranteed Results for Data Fusion and Source Reliability Rekatsinas, T., Joglekar, M., Garcia-Molina, H., Parameswaran, A., Re, C., ACM SIGMOD ASSOC COMPUTING MACHINERY. 2017: 1399–1414
  • Data Programming: Creating Large Training Sets, Quickly. Advances in neural information processing systems Ratner, A., De Sa, C., Wu, S., Selsam, D., Re, C. 2016; 29: 3567–75


    Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.

    View details for PubMedID 29872252

  • Joins via Geometric Resolutions: Worst Case and Beyond Khamis, M., Ngo, H. Q., Re, C., Rudra, A. ASSOC COMPUTING MACHINERY. 2016

    View details for DOI 10.1145/2967101

    View details for Web of Science ID 000393183800002

  • Extracting Databases from Dark Data with DeepDive. Proceedings. ACM-Sigmod International Conference on Management of Data Zhang, C., Shin, J., Ré, C., Cafarella, M., Niu, F. 2016; 2016: 847-859


    DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.

    View details for DOI 10.1145/2882903.2904442

    View details for PubMedID 28316365

  • EmptyHeaded: A Relational Engine for Graph Processing. Proceedings. ACM-Sigmod International Conference on Management of Data Aberger, C. R., Tu, S., Olukotun, K., Ré, C. 2016; 2016: 431-446


    There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP.

    View details for DOI 10.1145/2882903.2915213

    View details for PubMedID 28077912

  • Materialization Optimizations for Feature Selection Workloads ACM TRANSACTIONS ON DATABASE SYSTEMS Zhang, C., Kumar, A., Re, C. 2016; 41 (1)

    View details for DOI 10.1145/2877204

    View details for Web of Science ID 000373901300003

  • DeepDive: Declarative Knowledge Base Construction SIGMOD RECORD De Sa, C., Ratner, A., Re, C., Shin, J., Wang, F., Wu, S., Zhang, C. 2016; 45 (1): 60-67


    The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.

    View details for DOI 10.1145/2949741.2949756

    View details for Web of Science ID 000377814200014

    View details for PubMedCentralID PMC5361060

  • Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nature communications Yu, K., Zhang, C., Berry, G. J., Altman, R. B., Ré, C., Rubin, D. L., Snyder, M. 2016; 7: 12474-?


    Lung cancer is the most prevalent cancer worldwide, and histopathological assessment is indispensable for its diagnosis. However, human evaluation of pathology slides cannot accurately predict patients' prognoses. In this study, we obtain 2,186 haematoxylin and eosin stained histopathology whole-slide images of lung adenocarcinoma and squamous cell carcinoma patients from The Cancer Genome Atlas (TCGA), and 294 additional images from Stanford Tissue Microarray (TMA) Database. We extract 9,879 quantitative image features and use regularized machine-learning methods to select the top features and to distinguish shorter-term survivors from longer-term survivors with stage I adenocarcinoma (P<0.003) or squamous cell carcinoma (P=0.023) in the TCGA data set. We validate the survival prediction framework with the TMA cohort (P<0.036 for both tumour types). Our results suggest that automatically derived image features can predict the prognosis of lung cancer patients and thereby contribute to precision oncology. Our methods are extensible to histopathology images of other organs.

    View details for DOI 10.1038/ncomms12474

    View details for PubMedID 27527408

  • CYCLADES: Conflict-free Asynchronous Machine Learning Pan, X., Lam, M., Tu, S., Papailiopoulos, D., Zhang, C., Jordan, M. I., Ramchandran, K., Re, C., Recht, B., Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
  • Dark Data: Are We Solving the Right Problems? Cafarella, M., Ilyas, I. F., Kornacker, M., Kraska, T., Re, C., IEEE IEEE. 2016: 1444–45
  • High Performance Parallel Stochastic Gradient Descent in Shared Memory Sallinen, S., Satish, N., Smelyanskiy, M., Sury, S. S., Re, C., IEEE IEEE. 2016: 873–82
  • Asynchrony begets Momentum, with an Application to Deep Learning Mitliagkas, I., Zhang, C., Hadjis, S., Re, C., IEEE IEEE. 2016: 997–1004
  • Weighted SGD for ℓ p Regression with Randomized Preconditioning. Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms Yang, J., Chow, Y., Re, C., Mahoney, M. W. 2016; 2016: 558–69


    In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems-e.g., ℓ2 and ℓ1 regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.By rewriting a deterministic ℓ p regression problem as a stochastic optimization problem, we connect pwSGD to several existing ℓ p solvers including RLA methods with algorithmic leveraging (RLA for short).We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.Particularly, when solving ℓ1 regression with size n by d, pwSGD returns an approximate solution with epsilon relative error in the objective value in 𝒪(log n·nnz(A)+poly(d)/epsilon2) time. This complexity is uniformly better than that of RLA methods in terms of both epsilon and d when the problem is unconstrained. In the presence of constraints, pwSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.For ℓ2 regression, pwSGD returns an approximate solution with epsilon relative error in the objective value and the solution vector measured in prediction norm in 𝒪(log n·nnz(A)+poly(d) log(1/epsilon)/epsilon) time. We show that for unconstrained ℓ2 regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy epsilon, high dimension n and low dimension d satisfy d ≥ 1/epsilon and n ≥ d2/epsilon. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that pwSGD converges to a medium-precision solution, e.g., epsilon = 10-3, more quickly.

    View details for PubMedID 29782626

  • Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much. Advances in neural information processing systems He, B., De Sa, C., Mitliagkas, I., Ré, C. 2016; 29


    Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.

    View details for PubMedID 28344429

  • Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling. JMLR workshop and conference proceedings De Sa, C., Olukotun, K., Ré, C. 2016; 48: 1567-1576


    Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes.

    View details for PubMedID 28344730

  • Large-scale extraction of gene interactions from full-text literature using DeepDive BIOINFORMATICS Mallory, E. K., Zhang, C., Re, C., Altman, R. B. 2016; 32 (1): 106-113


    A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.Application source code is publicly available at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btv476

    View details for Web of Science ID 000368357800013

    View details for PubMedCentralID PMC4681986

  • Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms. Advances in neural information processing systems De Sa, C., Zhang, C., Olukotun, K., Ré, C. 2015; 28: 2656-2664


    Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

    View details for PubMedID 27330264

  • Energy-Efficient Abundant-Data Computing: The N3XT 1,000x COMPUTER Aly, M. M., Gao, M., Hills, G., Lee, C., Pitner, G., Shulaker, M. M., Wu, T. F., Asheghi, M., Bokor, J., Franchetti, F., Goodson, K. E., Kozyrakis, C., Markov, I., Olukotun, K., Pileggi, L., Pop, E., Rabaey, J., Re, C., Wong, H. P., Mitra, S. 2015; 48 (12): 24-33
  • Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width. Advances in neural information processing systems De Sa, C., Zhang, C., Olukotun, K., Ré, C. 2015; 28: 3079-3087


    Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width-regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.

    View details for PubMedID 27279724

  • The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility. Journal of the American Medical Informatics Association Ku, J. P., Hicks, J. L., Hastie, T., Leskovec, J., Ré, C., Delp, S. L. 2015; 22 (6): 1120-1125


    Regular physical activity helps prevent heart disease, stroke, diabetes, and other chronic diseases, yet a broad range of conditions impair mobility at great personal and societal cost. Vast amounts of data characterizing human movement are available from research labs, clinics, and millions of smartphones and wearable sensors, but integration and analysis of this large quantity of mobility data are extremely challenging. The authors have established the Mobilize Center ( to harness these data to improve human mobility and help lay the foundation for using data science methods in biomedicine. The Center is organized around 4 data science research cores: biomechanical modeling, statistical learning, behavioral and social modeling, and integrative modeling. Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods. By developing these new approaches, sharing data and validated software tools, and training thousands of researchers, the Mobilize Center will transform human movement research.

    View details for DOI 10.1093/jamia/ocv071

    View details for PubMedID 26272077

    View details for PubMedCentralID PMC4639715

  • Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases Shin, J., Ré, C., Cafarella, M. 2015; 8 (12): 1920-1923


    End-to-end knowledge base construction systems using statistical inference are enabling more people to automatically extract high-quality domain-specific information from unstructured data. As a result of deploying DeepDive framework across several domains, we found new challenges in debugging and improving such end-to-end systems to construct high-quality knowledge bases. DeepDive has an iterative development cycle in which users improve the data. To help our users, we needed to develop principles for analyzing the system's error as well as provide tooling for inspecting and labeling various data products of the system. We created guidelines for error analysis modeled after our colleagues' best practices, in which data labeling plays a critical role in every step of the analysis. To enable more productive and systematic data labeling, we created Mindtagger, a versatile tool that can be configured to support a wide range of tasks. In this demonstration, we show in detail what data labeling tasks are modeled in our error analysis guidelines and how each of them is performed using Mindtagger.

    View details for PubMedID 27144082

  • Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction PROCEEDINGS OF THE VLDB ENDOWMENT Shin, J., Re, C., Cafarella, M. 2015; 8 (12): 1921–24
  • Incremental Knowledge Base Construction Using DeepDive. Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C. 2015; 8 (11): 1310-1321


    Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate Deep-Dive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

    View details for PubMedID 27144081

  • Caffe con Troll: Shallow Ideas to Speed Up Deep Learning. Proceedings of the Fourth Workshop on Data analytics at sCale (DanaC 2015) : May 31st, 2015, Melbourne, Australia. Workshop on Data Analytics in the Cloud (4th : 2015 : Melbourne, Vic.) Hadjis, S., Abuzaid, F., Zhang, C., Ré, C. 2015; 2015


    We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.

    View details for PubMedID 27314106

  • A Database Framework for Classifier Engineering. CEUR workshop proceedings Kimelfeld, B., Re, C. 2015; 1378

    View details for PubMedID 27274719

  • An Asynchronous Parallel Stochastic Coordinate Descent Algorithm JOURNAL OF MACHINE LEARNING RESEARCH Liu, J., Wright, S. J., Re, C., Bittorf, V., Sridhar, S. 2015; 16: 285-322
  • Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care Chaturapruek, S., Duchi, J. C., Re, C., Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
  • Effectively Creating Weakly Labeled Training Examples via Approximate Domain Knowledge Natarajan, S., Picado, J., Khot, T., Kersting, K., Re, C., Shavlik, J., Davis, J., Ramon, J. SPRINGER-VERLAG BERLIN. 2015: 92–107
  • Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width De Sa, C., Zhang, C., Olukotun, K., Re, C., Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
  • Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms De Sa, C., Zhang, C., Olukotun, K., Re, C., Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
  • A Machine Reading System for Assembling Synthetic Paleontological Databases PLOS ONE Peters, S. E., Zhang, C., Livny, M., Re, C. 2014; 9 (12)


    Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry.

    View details for DOI 10.1371/journal.pone.0113523

    View details for Web of Science ID 000347114900048

    View details for PubMedID 25436610

    View details for PubMedCentralID PMC4250071

  • DimmWitted: A Study of Main-Memory Statistical Analytics PROCEEDINGS OF THE VLDB ENDOWMENT Zhang, C., Re, C. 2014; 7 (12): 1283–94
  • Transducing Markov Sequences JOURNAL OF THE ACM Kimelfeld, B., Re, C. 2014; 61 (5)

    View details for DOI 10.1145/2630065

    View details for Web of Science ID 000341936800005

  • Skew Strikes Back: New Developments in the Theory of Join Algorithms SIGMOD RECORD Ngo, H. Q., Re, C., Rudra, A. 2013; 42 (4): 5–16
  • Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System PROCEEDINGS OF THE VLDB ENDOWMENT Konda, P., Kumar, A., Re, C., Sashikanth, V. 2013; 6 (12): 1306–9
  • Ringtail: Nowcasting Made Easy Antenucci, D., Cafarella, M., Levenstein, Margaret, C., Ré, C., Shapiro, M. 2013
  • Building an Entity-Centric Stream Filtering Test Collection for TREC 2102 Frank, John, R., Kleiman-Weiner, M., Roberts, Daniel, A., Niu, F., Ré, C., Soboroff, I. 2013
  • GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages. SIGMOD 13 (demo). Zhang, C., Govindaraju, V., Borchardt, J., Foltz, T., Ré, C., Peters, S. 2013
  • Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion Mathematical Programming Computation Recht, B., Ré, C. 2013
  • Using Commonsense Knowledge to Automatically Create (Noisy) Training Examples from Text StarAI with AAAI Natarajan, S., Picado, J., Khot, T., Kersting, K., Ré, C., Shavlik, J. 2013
  • Understanding Tables in Context Using Standard NLP Toolkits ACL 2013 (Short Paper) Govindaraju, V., Zhang, C., Ré, C. 2013
  • Hazy: Making it Easier to Build and Maintain Big-data Analytics Kumar, A., Niu, F., Ré, C. 2013
  • Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. Zhang, C., Ré, C. 2013
  • Robust Statistics in IceCube Initial Muon Reconstruction Wellons, M., Collaboration, t., Recht, B., Ré, C. 2013
  • Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System Konda, P., Kumar, A., Ré, C., Sashikanth, V. 2013
  • An Approximate, Efficient LP Solver for LP Rounding NIPS Sridhar, S., Bittorf, V., Liu, J., Zhang, C., Ré, C., Wright, Stephen, J. 2013
  • Brainwash: A Data System for Feature Engineering (Vision Track) Anderson, M., Antenucci, D., Bittorf, V., Burgess, M., Cafarella, M., Kumar, A. 2013
  • Ringtail: Nowcasting Made Easy. Antenucci, D., Li, E., Liu, S., Cafarella, Michael, J., Ré, C. 2013
  • DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference VLDS Niu, F., Zhang, C., Ré, C., Shavlik, J. 2012
  • Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences COLT Recht, B., Ré, C. 2012
  • Big Data versus the Crowd: Looking for Relationships in All the Right Places ACL Zhang, C., Niu, F., Ré, C., Shavlik, J. 2012
  • Factoring nonnegative matrices with linear programs. NIPS Bittorf, V., Recht, B., Ré, C., Tropp, Joel, A. 2012
  • Scaling Inference for Markov Logic via Dual Decomposition (Short Paper). ICDM Niu, F., Zhang, C., Ré, C., Shavlik, J. 2012
  • Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference IJSWIS, Special Issue on Knowledge Extraction from the Web, 2012, to appear Niu, F., Zhang, C., Ré, C., Shavlik, J. 2012
  • Understanding cardinality estimation using entropy maximization ACM Trans. Database Syst. Ré, C., Suciu, D. 2012; 37: 6
  • Worst-case Optimal Join Algorithms PODS Ngo, Hung, Q., Porat, E., Ré, C., Rudra, A. 2012
  • The MADlib Analytics Library or MAD Skills, the SQL. PVLDB Hellerstein, Joseph, M., Ré, C., Schoppmann, F., Wang, D. Z., Fratkin, E., Gorajek, A. 2012
  • Probabilistic Management of OCR using an RDBMS Kumar, A., Ré, C. 2012
  • Towards a Unified Architecture for In-Database Analytics Feng, A., Kumar, A., Recht, B., Ré, C. 2012
  • Optimizing Statistical Information Extraction Programs Over Evolving Text Chen, F., Feng, X., Ré, C., Wang, M. 2012
  • Automatic Optimization for MapReduce Programs PVLDB Jahani, E., Cafarella, Michael, J., Ré, C. 2011; 4: 385-396
  • Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS PVLDB Niu, F., Ré, C., Doan, A., Shavlik, Jude, W. 2011; 4: 373-384
  • Queries and materialized views on probabilistic databases J. Comput. Syst. Sci. Dalvi, Nilesh, N., Re, C., Suciu, D. 2011; 77: 473-490
  • Felix: Scaling Inference for Markov Logic with an Operator-based Approach ArXiv e-prints Niu, F., Zhang, C., Ré, C. 2011
  • Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS Niu, F., Recht, B., Ré, C., Wright, Stephen, J., Ré, C., Shavlik, J. 2011
  • Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion Optimization Online Recht, B., Ré, C. 2011
  • Incrementally maintaining classification using an RDBMS PVLDB Koc, M. L., Ré, C. 2011; 4: 302-313
  • Manimal: Relational Optimization for Data-Intensive Programs WebDB Cafarella, Michael, J., Ré, C. 2010
  • Approximation Trade-Offs in a Markovian Stream Warehouse: An Empirical Study (Short Paper) ICDE Letchner, J., Ré, C., Balazinska, M., Philipose, M. 2010
  • Understanding Cardinality Estimation using Entropy Maximization PODS Ré, C., Suciu, D. 2010
  • Transducing Markov Sequences PODS Kimelfeld, B., Ré, C. 2010
  • Query Containment of Tier-2 Queries over a Probabilistic Database Management of Uncertain Databases (MUD) Moore, Katherine, F., Rastogi, V., Ré, C., Suciu, D. 2009
  • Lahar Demonstration: Warehousing Markovian Streams PVLDB Letchner, J., Ré, C., Balazinska, M., Philipose, M. 2009; 2: 1610-1613
  • Large-Scale Deduplication with Constraints Using Dedupalog ICDE Arasu, A., Ré, C., Suciu, D. 2009: 952-963
  • The Trichotomy of HAVING Queries on a Probabilistic Database VLDB Journal Ré, C., Suciu, D. 2009
  • Access Methods for Markovian Streams ICDE Letchner, J., Ré, C., Balazinska, M., Philipose, M. 2009: 246-257
  • Probabilistic databases: Diamonds in the dirt Commun. ACM Volume Dalvi, Nilesh, N., Ré, C., Suciu, D. 2009; 52: 86-94
  • General Database Statistics Using Entropy Maximization DBPL Kaushik, R., Ré, C., Suciu, D. 2009: 84-99
  • Managing Large-Scale Probabilistic Databases University of Washington, Seattle Ré, C. 2009
  • Repeatability & Workability Evaluation of SIGMOD 2009 SIGMOD Record Manegold, S., Manolescu, I., Afanasiev, L., Feng, J., Gou, G., Hadjieleftheriou, M., Re, C. M. 2009; 38: 40-43
  • Implementing NOT EXISTS Predicates over a Probabilistic Database QDB/MUD Wang, T., Ré, C., Suciu, D. 2008: 73-86
  • A demonstration of Cascadia through a digital diary application Khoussainova, N., Welbourne, E., Balazinska, M., Borriello, G., Cole, G., Letchner, J. 2008
  • Managing Probabilistic Data with Mystiq (Plenary Talk) Ré, C. 2008
  • Systems aspects of probabilistic data management (Part II) PVLDB Balazinska, M., Ré, C., Suciu, D. 2008; 1: 1520-1521
  • Systems aspects of probabilistic data management (Part I) PVLDB Balazinska, M., Ré, C., Suciu, D. 2008; 1: 1520-1521
  • Approximate lineage for probabilistic databases PVLDB Ré, C., Suciu, D. 2008; 1: 797-808
  • Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do SUM Ré, C., Suciu, D. 2008: 5-18
  • Event queries on correlated probabilistic streams Ré, C., Letchner, J., Balazinska, M., Suciu, D. 2008
  • Advances in Processing SQL Queries on Probabilistic Data Invited Abstract in INFORMS 2008, Simulation. Ré, C., Suciu, D. 2008
  • Challenges for Event Queries over Markovian Streams IEEE Internet Computing Letchner, J., Ré, C., Balazinska, M., Philipose, M. 2008; 12: 30-36
  • Structured Querying of Web Text Data: A Technical Challenge CIDR Cafarella, Michael, J., Ré, C., Suciu, D., Etzioni, O. 2007: 225-234
  • Managing Uncertainty in Social Networks IEEE Data Eng. Bull. Adar, E., Ré, C. 2007; 30: 15-22
  • Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization VLDB Re, C., Suciu, D. 2007: 51-62
  • Efficient Top-k Query Evaluation on Probabilistic Data ICDE Ré, C., Dalvi, Nilesh, N., Suciu, D. 2007: 886-895
  • Efficient Evaluation of HAVING Queries DBPL Ré, C., Suciu, D. 2007: 186-200
  • Management of data with uncertainties CIKM Re, C., Suciu, D. 2007: 3-8
  • Orderings on Annotated Collections Liber Amicorum in honor of Jan Paredaens 60th Birthday Ré, C., Suciu, D., Tannen, V. 2007
  • A Complete and Efficient Algebraic Compiler for XQuery ICDE Re, C., Sim'eon, J., Fern'andez, Mary, F. 2006: 14
  • XQuery!: An XML Query Language with Side Effects Ghelli, G., Ré, C., Sim'eon, J. 2006
  • Query Evaluation on Probabilistic Databases IEEE Data Eng. Bull. Ré, C., Dalvi, Nilesh, N., Suciu, D. 2006; 29: 25-31
  • MYSTIQ: a system for finding more answers by using probabilities Boulos, J., Dalvi, Nilesh, N., Mandhani, B., Mathur, S., Ré, C., Suciu, D. 2005
  • A Framework for XML-Based Integration of Data, Visualization and Analysis in a Biomedical Domain XSym Bales, N., Brinkley, J., Lee, E., Sally, Mathur, S., Re, C., Suciu, D. 2005: 207-221
  • Supporting workflow in a course management system SIGCSE Botev, C., Chao, H., Chao, T., Cheng, Y., Doyle, R., Grankin, S. 2005: 262-266
  • Distributed XQuery Ré, C., Brinkley, J., Hinshaw, K., Suciu, D. 2004
  • WS-Membership - Failure Management in a Web-Services World WWW (Alternate Paper Tracks) Vogels, W., Ré, C. 2003
  • A Collaborative Infrastructure for Scalable and Robust News Delivery Vogels, W., Ré, C., Renesse, R., Birman, Kenneth, P. 2002
  • Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale. Proceedings. ACM-Sigmod International Conference on Management of Data Bach, S. H., Rodriguez, D. n., Liu, Y. n., Luo, C. n., Shao, H. n., Xia, C. n., Sen, S. n., Ratner, A. n., Hancock, B. n., Alborzi, H. n., Kuchhal, R. n., Ré, C. n., Malkin, R. n. ; 2019: 362–75


    Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.

    View details for DOI 10.1145/3299869.3314036

    View details for PubMedID 31777414

    View details for PubMedCentralID PMC6879379