Christopher Re
Associate Professor of Computer Science
Web page: http://www.cs.stanford.edu/~chrismre/
Bio
Christopher (Chris) Re is an associate professor in the Department of Computer Science at Stanford University. He is in the Stanford AI Lab and is affiliated with the Machine Learning Group and the Center for Research on Foundation Models. His recent work is to understand how software and hardware systems will change because of machine learning along with a continuing, petulant drive to work on math problems. Research from his group has been incorporated into scientific and humanitarian efforts, such as the fight against human trafficking, along with products from technology and companies including Apple, Google, YouTube, and more. He has also cofounded companies, including Snorkel, SambaNova, and Together, and a venture firm, called Factory.
His family still brags that he received the MacArthur Foundation Fellowship, but his closest friends are confident that it was a mistake. His research contributions have spanned database theory, database systems, and machine learning, and his work has won best paper at a premier venue in each area, respectively, at PODS 2012, SIGMOD 2014, and ICML 2016. Due to great collaborators, he received the NeurIPS 2020 test-of-time award and the PODS 2022 test-of-time award. Due to great students, he received best paper at MIDL 2022, best paper runner up at ICLR22 and ICML22, and best student-paper runner up at UAI22.
Academic Appointments
-
Associate Professor, Computer Science
-
Member, Bio-X
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
Program Affiliations
-
Stanford SystemX Alliance
Current Research and Scholarly Interests
Algorithms, systems, and theory for the next generation of data processing and data analytics systems.
2024-25 Courses
-
Independent Studies (15)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr) - Advanced Reading and Research
CS 499P (Aut, Win, Spr) - Curricular Practical Training
CS 390A (Aut, Win, Spr) - Curricular Practical Training
CS 390B (Aut, Win, Spr) - Curricular Practical Training
CS 390C (Aut, Win, Spr) - Independent Project
CS 399 (Aut, Win, Spr) - Independent Project
CS 399P (Aut, Win, Spr) - Independent Work
CS 199 (Aut, Win, Spr) - Independent Work
CS 199P (Aut, Win, Spr) - Master's Research
CME 291 (Aut, Win, Spr) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr) - Ph.D. Research
CME 400 (Aut, Win, Spr) - Senior Project
CS 191 (Aut, Win, Spr) - Supervised Undergraduate Research
CS 195 (Aut, Win, Spr) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2022-23 Courses
- Advances in Foundation Models
CS 324 (Win) - Machine Learning
CS 229, STATS 229 (Spr)
2021-22 Courses
- Machine Learning
CS 229, STATS 229 (Spr) - Machine Learning Systems Seminar
CS 528 (Aut, Win, Spr) - Understanding and Developing Large Language Models
CS 324 (Win)
- Advances in Foundation Models
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Jimmy Smith -
Postdoctoral Faculty Sponsor
Carmen Amo Alonso -
Doctoral Dissertation Advisor (AC)
Dan Fu, Jerry Liu, Ben Viggiano -
Master's Program Advisor
Ramya Ayyagari, Sean Bai, Sajid Farook, Sathvik Nallamalli, Raj Palleti, Santino Ramos, Athena Shiravi, Lea Wang-Tomic, Shannon Xiao, James Zheng -
Doctoral Dissertation Co-Advisor (AC)
Chris Fifty, Jordan Juravsky, Jon Saad-Falcon, Vishnu Sarukkai, Alyssa Unell, Michael Wornow -
Doctoral Dissertation Co-Advisor (NonAC)
Krista Opsahl-Ong -
Doctoral (Program)
Simran Arora, Mayee Chen, Sabri Eyuboglu, Dan Fu, Neel Guha, Avanika Narayan, Benjamin Spector, Brandon Yang, Michael Zhang -
Postdoctoral Research Mentor
Dan Biderman
All Publications
-
Extracting chemical reactions from text using Snorkel.
BMC bioinformatics
2020; 21 (1): 217
Abstract
Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often require a large amount of labeled training data and thus lack scalability for both larger document corpora and new relationship types.We developed an application of Snorkel, a weakly supervised learning framework, for extracting chemical reaction relationships from biomedical literature abstracts. For this work, we defined a chemical reaction relationship as the transformation of chemical A to chemical B. We built and evaluated our system on small annotated sets of chemical reaction relationships from two corpora: curated bacteria-related abstracts from the MetaCyc database (MetaCyc_Corpus) and a more general set of abstracts annotated with MeSH (Medical Subject Headings) term Bacteria (Bacteria_Corpus; a superset of MetaCyc_Corpus). For the MetaCyc_Corpus, we obtained 84% precision and 41% recall (55% F1 score). Extending to the more general Bacteria_Corpus decreased precision to 62% with only a four-point drop in recall to 37% (46% F1 score). Overall, the Bacteria_Corpus contained two orders of magnitude more candidate chemical reaction relationships (nine million candidates vs 68,0000 candidates) and had a larger class imbalance (2.5% positives vs 5% positives) as compared to the MetaCyc_Corpus. In total, we extracted 6871 chemical reaction relationships from nine million candidates in the Bacteria_Corpus.With this work, we built a database of chemical reaction relationships from almost 900,000 scientific abstracts without a large training set of labeled annotations. Further, we showed the generalizability of our initial application built on MetaCyc documents enriched with chemical reactions to a general set of articles related to bacteria.
View details for DOI 10.1186/s12859-020-03542-1
View details for PubMedID 32460703
-
Assessment of Convolutional Neural Networks for Automated Classification of Chest Radiographs.
Radiology
2018: 181422
Abstract
Purpose To assess the ability of convolutional neural networks (CNNs) to enable high-performance automated binary classification of chest radiographs. Materials and Methods In a retrospective study, 216 431 frontal chest radiographs obtained between 1998 and 2012 were procured, along with associated text reports and a prospective label from the attending radiologist. This data set was used to train CNNs to classify chest radiographs as normal or abnormal before evaluation on a held-out set of 533 images hand-labeled by expert radiologists. The effects of development set size, training set size, initialization strategy, and network architecture on end performance were assessed by using standard binary classification metrics; detailed error analysis, including visualization of CNN activations, was also performed. Results Average area under the receiver operating characteristic curve (AUC) was 0.96 for a CNN trained with 200 000 images. This AUC value was greater than that observed when the same model was trained with 2000 images (AUC = 0.84, P < .005) but was not significantly different from that observed when the model was trained with 20 000 images (AUC = 0.95, P > .05). Averaging the CNN output score with the binary prospective label yielded the best-performing classifier, with an AUC of 0.98 (P < .005). Analysis of specific radiographs revealed that the model was heavily influenced by clinically relevant spatial regions but did not reliably generalize beyond thoracic disease. Conclusion CNNs trained with a modestly sized collection of prospectively labeled chest radiographs achieved high diagnostic performance in the classification of chest radiographs as normal or abnormal; this function may be useful for automated prioritization of abnormal chest radiographs. © RSNA, 2018 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.
View details for PubMedID 30422093
-
Snuba: Automating Weak Supervision to Label Training Data
PROCEEDINGS OF THE VLDB ENDOWMENT
2018; 12 (3): 223–36
View details for DOI 10.14778/3291264.3291268
View details for Web of Science ID 000456032800004
-
Research for Practice: Knowledge Base Construction in the Machine Learning Era
COMMUNICATIONS OF THE ACM
2018; 61 (11): 95–97
View details for DOI 10.1145/3233243
View details for Web of Science ID 000448785200030
-
A Relational Framework for Classifier Engineering
ASSOC COMPUTING MACHINERY. 2018
View details for DOI 10.1145/3268931
View details for Web of Science ID 000457121900001
-
A Cloud-Based Metabolite and Chemical Prioritization System for the Biology/Disease-Driven Human Proteome Project.
Journal of proteome research
2018
Abstract
Targeted metabolomics and biochemical studies complement the ongoing investigations led by the Human Proteome Organization (HUPO) Biology/Disease-Driven Human Proteome Project (B/D-HPP). However, it is challenging to identify and prioritize metabolite and chemical targets. Literature-mining-based approaches have been proposed for target proteomics studies, but text mining methods for metabolite and chemical prioritization are hindered by a large number of synonyms and nonstandardized names of each entity. In this study, we developed a cloud-based literature mining and summarization platform that maps metabolites and chemicals in the literature to unique identifiers and summarizes the copublication trends of metabolites/chemicals and B/D-HPP topics using Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores. We successfully prioritized metabolites and chemicals associated with the B/D-HPP targeted fields and validated the results by checking against expert-curated associations and enrichment analyses. Compared with existing algorithms, our system achieved better precision and recall in retrieving chemicals related to B/D-HPP focused areas. Our cloud-based platform enables queries on all biological terms in multiple species, which will contribute to B/D-HPP and targeted metabolomics/chemical studies.
View details for PubMedID 30094994
-
Fonduer: Knowledge Base Construction from Richly Formatted Data.
Proceedings. ACM-Sigmod International Conference on Management of Data
2018; 2018: 1301–16
Abstract
We focus on knowledge base construction (KBC) from richly formatted data. In contrast to KBC from text or tabular data, KBC from richly formatted data aims to extract relations conveyed jointly via textual, structural, tabular, and visual expressions. We introduce Fonduer, a machine-learning-based KBC system for richly formatted data. Fonduer presents a new data model that accounts for three challenging characteristics of richly formatted data: (1) prevalent document-level relations, (2) multimodality, and (3) data variety. Fonduer uses a new deep-learning model to automatically capture the representation (i.e., features) needed to learn how to extract relations from richly formatted data. Finally, Fonduer provides a new programming model that enables users to convert domain expertise, based on multiple modalities of information, to meaningful signals of supervision for training a KBC system. Fonduer-based KBC systems are in production for a range of use cases, including at a major online retailer. We compare Fonduer against state-of-the-art KBC approaches in four different domains. We show that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base-and in some cases produces up to 1.87* the number of correct entries-compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer's new programming model. We show that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches.
View details for PubMedID 29937618
-
It's All a Matter of Degree Using Degree Information to Optimize Multiway Joins
THEORY OF COMPUTING SYSTEMS
2018; 62 (4): 810–53
View details for DOI 10.1007/s00224-017-9811-8
View details for Web of Science ID 000429658000003
-
Systematic Protein Prioritization for Targeted Proteomics Studies through Literature Mining
JOURNAL OF PROTEOME RESEARCH
2018; 17 (4): 1383–96
Abstract
There are more than 3.7 million published articles on the biological functions or disease implications of proteins, constituting an important resource of proteomics knowledge. However, it is difficult to summarize the millions of proteomics findings in the literature manually and quantify their relevance to the biology and diseases of interest. We developed a fully automated bioinformatics framework to identify and prioritize proteins associated with any biological entity. We used the 22 targeted areas of the Biology/Disease-driven (B/D)-Human Proteome Project (HPP) as examples, prioritized the relevant proteins through their Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores, validated the relevance of the score by comparing the protein prioritization results with a curated database, computed the scores of proteins across the topics of B/D-HPP, and characterized the top proteins in the common model organisms. We further extended the bioinformatics workflow to identify the relevant proteins in all organ systems and human diseases and deployed a cloud-based tool to prioritize proteins related to any custom search terms in real time. Our tool can facilitate the prioritization of proteins for any organ system or disease of interest and can contribute to the development of targeted proteomic studies for precision medicine.
View details for PubMedID 29505266
-
Worst-case Optimal Join Algorithms
JOURNAL OF THE ACM
2018; 65 (3)
View details for DOI 10.1145/3180143
View details for Web of Science ID 000433477000005
-
A Relational Framework for Classifier Engineering
SIGMOD RECORD
2018; 47 (1): 6–13
View details for DOI 10.1145/3034786.3034797
View details for Web of Science ID 000444697600002
-
Weighted SGD for l(p) Regression with Randomized Preconditioning
JOURNAL OF MACHINE LEARNING RESEARCH
2018; 18
View details for Web of Science ID 000435628500001
-
Software 2.0 and Snorkel: Beyond Hand-Labeled Data
ASSOC COMPUTING MACHINERY. 2018: 2876
View details for DOI 10.1145/3219819.3219937
View details for Web of Science ID 000455346400302
-
Association of Omics Features with Histopathology Patterns in Lung Adenocarcinoma
CELL SYSTEMS
2017; 5 (6): 620-+
Abstract
Adenocarcinoma accounts for more than 40% of lung malignancy, and microscopic pathology evaluation is indispensable for its diagnosis. However, how histopathology findings relate to molecular abnormalities remains largely unknown. Here, we obtained H&E-stained whole-slide histopathology images, pathology reports, RNA sequencing, and proteomics data of 538 lung adenocarcinoma patients from The Cancer Genome Atlas and used these to identify molecular pathways associated with histopathology patterns. We report cell-cycle regulation and nucleotide binding pathways underpinning tumor cell dedifferentiation, and we predicted histology grade using transcriptomics and proteomics signatures (area under curve >0.80). We built an integrative histopathology-transcriptomics model to generate better prognostic predictions for stage I patients (p = 0.0182 ± 0.0021) compared with gene expression or histopathology studies alone, and the results were replicated in an independent cohort (p = 0.0220 ± 0.0070). These results motivate the integration of histopathology and omics data to investigate molecular mechanisms of pathology findings and enhance clinical prognostic prediction.
View details for PubMedID 29153840
View details for PubMedCentralID PMC5746468
-
Inferring Generative Model Structure with Static Analysis.
Advances in neural information processing systems
2017; 30: 239–49
Abstract
Obtaining enough labeled data to robustly train complex discriminative models is a major bottleneck in the machine learning pipeline. A popular solution is combining multiple sources of weak supervision using generative models. The structure of these models affects training label quality, but is difficult to learn without any ground truth labels. We instead rely on these weak supervision sources having some structure by virtue of being encoded programmatically. We present Coral, a paradigm that infers generative model structure by statically analyzing the code for these heuristics, thus reducing the data required to learn structure significantly. We prove that Coral's sample complexity scales quasilinearly with the number of heuristics and number of relations found, improving over the standard sample complexity, which is exponential in n for identifying nth degree relations. Experimentally, Coral matches or outperforms traditional structure learning approaches by up to 3.81 F1 points. Using Coral to model dependencies instead of assuming independence results in better performance than a fully supervised model by 3.07 accuracy points when heuristics are used to label radiology data without ground truth labels.
View details for PubMedID 29391769
-
Gaussian Quadrature for Kernel Features.
Advances in neural information processing systems
2017; 30: 6109–19
Abstract
Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that O(epsilon-2) samples are required to achieve an approximation error of at most epsilon. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any gamma > 0, to achieve error epsilon with O(egamma + epsilon-1/gamma) samples as epsilon goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.
View details for PubMedID 29398882
-
Learning to Compose Domain-Specific Transformations for Data Augmentation.
Advances in neural information processing systems
2017; 30: 3239–49
Abstract
Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual transformations, constructing and tuning the more sophisticated compositions typically needed to achieve state-of-the-art results is a time-consuming manual task in practice. We propose a method for automating this process by learning a generative sequence model over user-specified transformation functions using a generative adversarial approach. Our method can make use of arbitrary, non-deterministic transformation functions, is robust to misspecified user input, and is trained on unlabeled data. The learned transformation model can then be used to perform data augmentation for any end discriminative model. In our experiments, we show the efficacy of our approach on both image and text datasets, achieving improvements of 4.0 accuracy points on CIFAR-10, 1.4 F1 points on the ACE relation extraction task, and 3.4 accuracy points when using domain-specific transformation operations on a medical imaging dataset as compared to standard heuristic augmentation approaches.
View details for PubMedID 29375240
-
Snorkel: Rapid Training Data Creation with Weak Supervision
PROCEEDINGS OF THE VLDB ENDOWMENT
2017; 11 (3): 269–82
Abstract
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of- the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.
View details for PubMedID 29770249
-
EmptyHeaded: A Relational Engine for Graph Processing
ASSOC COMPUTING MACHINERY. 2017
View details for DOI 10.1145/3129246
View details for Web of Science ID 000419302700001
-
Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded
PROCEEDINGS OF THE VLDB ENDOWMENT
2017; 10 (12): 1849–52
View details for Web of Science ID 000416494000024
-
HoloClean: Holistic Data Repairs with Probabilistic Inference
PROCEEDINGS OF THE VLDB ENDOWMENT
2017; 10 (11): 1190–1201
View details for Web of Science ID 000416492900003
-
Report from the third workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR' 16)
SIGMOD RECORD
2017; 46 (2): 43–48
View details for Web of Science ID 000409320200005
-
Snorkel: Fast Training Set Generation for Information Extraction
ASSOC COMPUTING MACHINERY. 2017: 1683–86
View details for DOI 10.1145/3035918.3056442
View details for Web of Science ID 000452550000129
-
Learning to Compose Domain-Specific Transformations for Data Augmentation
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
View details for Web of Science ID 000452649403030
-
Inferring Generative Model Structure with Static Analysis
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
View details for Web of Science ID 000452649400023
-
SLiM Fast: Guaranteed Results for Data Fusion and Source Reliability
ASSOC COMPUTING MACHINERY. 2017: 1399–1414
View details for DOI 10.1145/3035918.3035951
View details for Web of Science ID 000452550000093
-
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent
ASSOC COMPUTING MACHINERY. 2017: 561–74
Abstract
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.
View details for PubMedID 29391770
View details for PubMedCentralID PMC5789782
-
Data Programming: Creating Large Training Sets, Quickly.
Advances in neural information processing systems
2016; 29: 3567–75
Abstract
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable.
View details for PubMedID 29872252
-
Joins via Geometric Resolutions: Worst Case and Beyond
ASSOC COMPUTING MACHINERY. 2016
View details for DOI 10.1145/2967101
View details for Web of Science ID 000393183800002
-
Extracting Databases from Dark Data with DeepDive.
Proceedings. ACM-Sigmod International Conference on Management of Data
2016; 2016: 847-859
Abstract
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
View details for DOI 10.1145/2882903.2904442
View details for PubMedID 28316365
-
EmptyHeaded: A Relational Engine for Graph Processing.
Proceedings. ACM-Sigmod International Conference on Management of Data
2016; 2016: 431-446
Abstract
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP.
View details for DOI 10.1145/2882903.2915213
View details for PubMedID 28077912
-
Materialization Optimizations for Feature Selection Workloads
ACM TRANSACTIONS ON DATABASE SYSTEMS
2016; 41 (1)
View details for DOI 10.1145/2877204
View details for Web of Science ID 000373901300003
-
DeepDive: Declarative Knowledge Base Construction
SIGMOD RECORD
2016; 45 (1): 60-67
Abstract
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
View details for DOI 10.1145/2949741.2949756
View details for Web of Science ID 000377814200014
View details for PubMedCentralID PMC5361060
-
Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features.
Nature communications
2016; 7: 12474-?
Abstract
Lung cancer is the most prevalent cancer worldwide, and histopathological assessment is indispensable for its diagnosis. However, human evaluation of pathology slides cannot accurately predict patients' prognoses. In this study, we obtain 2,186 haematoxylin and eosin stained histopathology whole-slide images of lung adenocarcinoma and squamous cell carcinoma patients from The Cancer Genome Atlas (TCGA), and 294 additional images from Stanford Tissue Microarray (TMA) Database. We extract 9,879 quantitative image features and use regularized machine-learning methods to select the top features and to distinguish shorter-term survivors from longer-term survivors with stage I adenocarcinoma (P<0.003) or squamous cell carcinoma (P=0.023) in the TCGA data set. We validate the survival prediction framework with the TMA cohort (P<0.036 for both tumour types). Our results suggest that automatically derived image features can predict the prognosis of lung cancer patients and thereby contribute to precision oncology. Our methods are extensible to histopathology images of other organs.
View details for DOI 10.1038/ncomms12474
View details for PubMedID 27527408
-
CYCLADES: Conflict-free Asynchronous Machine Learning
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
View details for Web of Science ID 000458973705018
-
Dark Data: Are We Solving the Right Problems?
IEEE. 2016: 1444–45
View details for Web of Science ID 000382554200143
-
High Performance Parallel Stochastic Gradient Descent in Shared Memory
IEEE. 2016: 873–82
View details for DOI 10.1109/IPDPS.2016.107
View details for Web of Science ID 000391251800090
-
Asynchrony begets Momentum, with an Application to Deep Learning
IEEE. 2016: 997–1004
View details for Web of Science ID 000400601400141
-
Weighted SGD for ℓ p Regression with Randomized Preconditioning.
Proceedings of the ... Annual ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM Symposium on Discrete Algorithms
2016; 2016: 558–69
Abstract
In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems-e.g., ℓ2 and ℓ1 regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.By rewriting a deterministic ℓ p regression problem as a stochastic optimization problem, we connect pwSGD to several existing ℓ p solvers including RLA methods with algorithmic leveraging (RLA for short).We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.Particularly, when solving ℓ1 regression with size n by d, pwSGD returns an approximate solution with epsilon relative error in the objective value in 𝒪(log n·nnz(A)+poly(d)/epsilon2) time. This complexity is uniformly better than that of RLA methods in terms of both epsilon and d when the problem is unconstrained. In the presence of constraints, pwSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.For ℓ2 regression, pwSGD returns an approximate solution with epsilon relative error in the objective value and the solution vector measured in prediction norm in 𝒪(log n·nnz(A)+poly(d) log(1/epsilon)/epsilon) time. We show that for unconstrained ℓ2 regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy epsilon, high dimension n and low dimension d satisfy d ≥ 1/epsilon and n ≥ d2/epsilon. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that pwSGD converges to a medium-precision solution, e.g., epsilon = 10-3, more quickly.
View details for PubMedID 29782626
-
Scan Order in Gibbs Sampling: Models in Which it Matters and Bounds on How Much.
Advances in neural information processing systems
2016; 29
Abstract
Gibbs sampling is a Markov Chain Monte Carlo sampling technique that iteratively samples variables from their conditional distributions. There are two common scan orders for the variables: random scan and systematic scan. Due to the benefits of locality in hardware, systematic scan is commonly used, even though most statistical guarantees are only for random scan. While it has been conjectured that the mixing times of random scan and systematic scan do not differ by more than a logarithmic factor, we show by counterexample that this is not the case, and we prove that that the mixing times do not differ by more than a polynomial factor under mild conditions. To prove these relative bounds, we introduce a method of augmenting the state space to study systematic scan using conductance.
View details for PubMedID 28344429
-
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling.
JMLR workshop and conference proceedings
2016; 48: 1567-1576
Abstract
Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes.
View details for PubMedID 28344730
-
Large-scale extraction of gene interactions from full-text literature using DeepDive
BIOINFORMATICS
2016; 32 (1): 106-113
Abstract
A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles.We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions.Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles.Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_appruss.altman@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv476
View details for Web of Science ID 000368357800013
View details for PubMedCentralID PMC4681986
-
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms.
Advances in neural information processing systems
2015; 28: 2656-2664
Abstract
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.
View details for PubMedID 27330264
-
Energy-Efficient Abundant-Data Computing: The N3XT 1,000x
COMPUTER
2015; 48 (12): 24-33
View details for Web of Science ID 000367689400005
-
Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width.
Advances in neural information processing systems
2015; 28: 3079-3087
Abstract
Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width-regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.
View details for PubMedID 27279724
-
The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility.
Journal of the American Medical Informatics Association
2015; 22 (6): 1120-1125
Abstract
Regular physical activity helps prevent heart disease, stroke, diabetes, and other chronic diseases, yet a broad range of conditions impair mobility at great personal and societal cost. Vast amounts of data characterizing human movement are available from research labs, clinics, and millions of smartphones and wearable sensors, but integration and analysis of this large quantity of mobility data are extremely challenging. The authors have established the Mobilize Center (http://mobilize.stanford.edu) to harness these data to improve human mobility and help lay the foundation for using data science methods in biomedicine. The Center is organized around 4 data science research cores: biomechanical modeling, statistical learning, behavioral and social modeling, and integrative modeling. Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods. By developing these new approaches, sharing data and validated software tools, and training thousands of researchers, the Mobilize Center will transform human movement research.
View details for DOI 10.1093/jamia/ocv071
View details for PubMedID 26272077
View details for PubMedCentralID PMC4639715
-
Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction.
Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases
2015; 8 (12): 1920-1923
Abstract
End-to-end knowledge base construction systems using statistical inference are enabling more people to automatically extract high-quality domain-specific information from unstructured data. As a result of deploying DeepDive framework across several domains, we found new challenges in debugging and improving such end-to-end systems to construct high-quality knowledge bases. DeepDive has an iterative development cycle in which users improve the data. To help our users, we needed to develop principles for analyzing the system's error as well as provide tooling for inspecting and labeling various data products of the system. We created guidelines for error analysis modeled after our colleagues' best practices, in which data labeling plays a critical role in every step of the analysis. To enable more productive and systematic data labeling, we created Mindtagger, a versatile tool that can be configured to support a wide range of tasks. In this demonstration, we show in detail what data labeling tasks are modeled in our error analysis guidelines and how each of them is performed using Mindtagger.
View details for PubMedID 27144082
-
Mindtagger: A Demonstration of Data Labeling in Knowledge Base Construction
PROCEEDINGS OF THE VLDB ENDOWMENT
2015; 8 (12): 1921–24
View details for Web of Science ID 000386424800066
-
Incremental Knowledge Base Construction Using DeepDive.
Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases
2015; 8 (11): 1310-1321
Abstract
Populating a database with unstructured information is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate Deep-Dive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.
View details for PubMedID 27144081
-
Caffe con Troll: Shallow Ideas to Speed Up Deep Learning.
Proceedings of the Fourth Workshop on Data analytics at sCale (DanaC 2015) : May 31st, 2015, Melbourne, Australia. Workshop on Data Analytics in the Cloud (4th : 2015 : Melbourne, Vic.)
2015; 2015
Abstract
We present Caffe con Troll (CcT), a fully compatible end-to-end version of the popular framework Caffe with rebuilt internals. We built CcT to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures. We find that, by employing standard batching optimizations for CPU training, we achieve a 4.5× throughput improvement over Caffe on popular networks like CaffeNet. Moreover, with these improvements, the end-to-end training time for CNNs is directly proportional to the FLOPS delivered by the CPU, which enables us to efficiently train hybrid CPU-GPU systems for CNNs.
View details for PubMedID 27314106
-
A Database Framework for Classifier Engineering.
CEUR workshop proceedings
2015; 1378
View details for PubMedID 27274719
-
An Asynchronous Parallel Stochastic Coordinate Descent Algorithm
JOURNAL OF MACHINE LEARNING RESEARCH
2015; 16: 285-322
View details for Web of Science ID 000369885800005
-
Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
View details for Web of Science ID 000450913103069
-
Effectively Creating Weakly Labeled Training Examples via Approximate Domain Knowledge
SPRINGER-VERLAG BERLIN. 2015: 92–107
View details for DOI 10.1007/978-3-319-23708-4_7
View details for Web of Science ID 000367791200007
-
Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
View details for Web of Science ID 000450913101017
-
Taming the Wild: A Unified Analysis of HOGWILD!-Style Algorithms
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2015
View details for Web of Science ID 000450913100085
-
A Machine Reading System for Assembling Synthetic Paleontological Databases
PLOS ONE
2014; 9 (12)
Abstract
Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry.
View details for DOI 10.1371/journal.pone.0113523
View details for Web of Science ID 000347114900048
View details for PubMedID 25436610
View details for PubMedCentralID PMC4250071
-
DimmWitted: A Study of Main-Memory Statistical Analytics
PROCEEDINGS OF THE VLDB ENDOWMENT
2014; 7 (12): 1283–94
View details for DOI 10.14778/2732977.2733001
View details for Web of Science ID 000219816900025
-
Transducing Markov Sequences
JOURNAL OF THE ACM
2014; 61 (5)
View details for DOI 10.1145/2630065
View details for Web of Science ID 000341936800005
-
Skew Strikes Back: New Developments in the Theory of Join Algorithms
SIGMOD RECORD
2013; 42 (4): 5–16
View details for DOI 10.1145/2590989.2590991
View details for Web of Science ID 000332485200001
-
Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System
PROCEEDINGS OF THE VLDB ENDOWMENT
2013; 6 (12): 1306–9
View details for DOI 10.14778/2536274.2536302
View details for Web of Science ID 000219777100029
- Ringtail: Nowcasting Made Easy 2013
- Building an Entity-Centric Stream Filtering Test Collection for TREC 2102 2013
- GeoDeepDive: Statistical Inference using Familiar Data-Processing Languages. SIGMOD 13 (demo). 2013
- Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion Mathematical Programming Computation 2013
- Using Commonsense Knowledge to Automatically Create (Noisy) Training Examples from Text StarAI with AAAI 2013
- Understanding Tables in Context Using Standard NLP Toolkits ACL 2013 (Short Paper) 2013
- Hazy: Making it Easier to Build and Maintain Big-data Analytics 2013
- Towards High-Throughput Gibbs Sampling at Scale: A Study across Storage Managers. 2013
- Robust Statistics in IceCube Initial Muon Reconstruction 2013
- Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System 2013
- An Approximate, Efficient LP Solver for LP Rounding NIPS 2013
- Brainwash: A Data System for Feature Engineering (Vision Track) 2013
- Ringtail: Nowcasting Made Easy. 2013
- DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference VLDS 2012
- Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences COLT 2012
- Big Data versus the Crowd: Looking for Relationships in All the Right Places ACL 2012
- Factoring nonnegative matrices with linear programs. NIPS 2012
- Scaling Inference for Markov Logic via Dual Decomposition (Short Paper). ICDM 2012
- Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference IJSWIS, Special Issue on Knowledge Extraction from the Web, 2012, to appear 2012
- Worst-case Optimal Join Algorithms PODS 2012
- The MADlib Analytics Library or MAD Skills, the SQL. PVLDB 2012
- Probabilistic Management of OCR using an RDBMS 2012
- Towards a Unified Architecture for In-Database Analytics 2012
- Optimizing Statistical Information Extraction Programs Over Evolving Text 2012
- Understanding cardinality estimation using entropy maximization ACM Trans. Database Syst. 2012; 37: 6
- Probabilistic Databases 2011
- Automatic Optimization for MapReduce Programs PVLDB 2011; 4: 385-396
- Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS PVLDB 2011; 4: 373-384
- Queries and materialized views on probabilistic databases J. Comput. Syst. Sci. 2011; 77: 473-490
- Felix: Scaling Inference for Markov Logic with an Operator-based Approach ArXiv e-prints 2011
- Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent NIPS 2011
- Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion Optimization Online 2011
- Incrementally maintaining classification using an RDBMS PVLDB 2011; 4: 302-313
- Manimal: Relational Optimization for Data-Intensive Programs WebDB 2010
- Approximation Trade-Offs in a Markovian Stream Warehouse: An Empirical Study (Short Paper) ICDE 2010
- Understanding Cardinality Estimation using Entropy Maximization PODS 2010
- Transducing Markov Sequences PODS 2010
- Query Containment of Tier-2 Queries over a Probabilistic Database Management of Uncertain Databases (MUD) 2009
- Lahar Demonstration: Warehousing Markovian Streams PVLDB 2009; 2: 1610-1613
- Large-Scale Deduplication with Constraints Using Dedupalog ICDE 2009: 952-963
- The Trichotomy of HAVING Queries on a Probabilistic Database VLDB Journal 2009
- Access Methods for Markovian Streams ICDE 2009: 246-257
- Probabilistic databases: Diamonds in the dirt Commun. ACM Volume 2009; 52: 86-94
- General Database Statistics Using Entropy Maximization DBPL 2009: 84-99
- Managing Large-Scale Probabilistic Databases University of Washington, Seattle 2009
- Repeatability & Workability Evaluation of SIGMOD 2009 SIGMOD Record 2009; 38: 40-43
- Implementing NOT EXISTS Predicates over a Probabilistic Database QDB/MUD 2008: 73-86
- A demonstration of Cascadia through a digital diary application 2008
- Managing Probabilistic Data with Mystiq (Plenary Talk) 2008
- Systems aspects of probabilistic data management (Part II) PVLDB 2008; 1: 1520-1521
- Systems aspects of probabilistic data management (Part I) PVLDB 2008; 1: 1520-1521
- Approximate lineage for probabilistic databases PVLDB 2008; 1: 797-808
- Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do SUM 2008: 5-18
- Event queries on correlated probabilistic streams 2008
- Advances in Processing SQL Queries on Probabilistic Data Invited Abstract in INFORMS 2008, Simulation. 2008
- Challenges for Event Queries over Markovian Streams IEEE Internet Computing 2008; 12: 30-36
- Structured Querying of Web Text Data: A Technical Challenge CIDR 2007: 225-234
- Managing Uncertainty in Social Networks IEEE Data Eng. Bull. 2007; 30: 15-22
- Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization VLDB 2007: 51-62
- Efficient Top-k Query Evaluation on Probabilistic Data ICDE 2007: 886-895
- Efficient Evaluation of HAVING Queries DBPL 2007: 186-200
- Management of data with uncertainties CIKM 2007: 3-8
- Orderings on Annotated Collections Liber Amicorum in honor of Jan Paredaens 60th Birthday 2007
- A Complete and Efficient Algebraic Compiler for XQuery ICDE 2006: 14
- XQuery!: An XML Query Language with Side Effects 2006
- Query Evaluation on Probabilistic Databases IEEE Data Eng. Bull. 2006; 29: 25-31
- MYSTIQ: a system for finding more answers by using probabilities 2005
- A Framework for XML-Based Integration of Data, Visualization and Analysis in a Biomedical Domain XSym 2005: 207-221
- Supporting workflow in a course management system SIGCSE 2005: 262-266
- Distributed XQuery 2004
- WS-Membership - Failure Management in a Web-Services World WWW (Alternate Paper Tracks) 2003
- A Collaborative Infrastructure for Scalable and Robust News Delivery 2002
-
Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale.
Proceedings. ACM-Sigmod International Conference on Management of Data
; 2019: 362–75
Abstract
Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.
View details for DOI 10.1145/3299869.3314036
View details for PubMedID 31777414
View details for PubMedCentralID PMC6879379