Bio
Jure Leskovec is Professor of Computer Science at Stanford University. He is affiliated with the Stanford AI Lab, Machine Learning Group and the Center for Research on Foundation Models. In the past, he served as a Chief Scientist at Pinterest and was an investigator at Chan Zuckerberg BioHub. Leskovec recently pioneered the field of Graph Neural Networks and co-authored PyG, the most widely-used graph neural network library. Research from his group has been used by many countries to fight COVID-19 pandemic, and has been incorporated into products at Facebook, Pinterest, Uber, YouTube, Amazon, and more.
His research received several awards including Microsoft Research Faculty Fellowship in 2011, Okawa Research award in 2012, Alfred P. Sloan Fellowship in 2012, Lagrange Prize in 2015, and ICDM Research Contributions Award in 2019. His research contributions have spanned social networks, data mining and machine learning, and computational biomedicine with the focus on drug discovery. His work has won 12 best paper awards and 5 10-year test of time awards at a premier venues in these research areas.
Leskovec received his bachelor's degree in computer science from University of Ljubljana, Slovenia, PhD in machine learning from Carnegie Mellon University and postdoctoral training at Cornell University.
Academic Appointments
-
Professor, Computer Science
-
Member, Bio-X
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
-
Member, Wu Tsai Neurosciences Institute
Professional Education
-
BSc, University of Ljubljana, Slovenia, Computer Science (2004)
-
PhD, Carnegie Mellon University, Computer Science (2008)
2024-25 Courses
- Machine Learning with Graphs
CS 224W (Aut) - Mining Massive Data Sets
CS 246 (Win) -
Independent Studies (19)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr, Sum) - Advanced Reading and Research
CS 499P (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390B (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390C (Aut, Win, Spr, Sum) - DDRL Independent Study-Work with Adviser
DDRL 191 (Aut, Win, Spr) - Directed Investigation
BIOE 392 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Study
BIOE 391 (Aut, Win, Spr, Sum) - Independent Project
CS 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399P (Aut, Win, Spr, Sum) - Independent Work
CS 199 (Aut, Win, Spr, Sum) - Independent Work
CS 199P (Aut, Win, Spr, Sum) - Master's Research
CME 291 (Aut, Win, Spr, Sum) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr, Sum) - Programming Service Project
CS 192 (Aut, Win, Spr, Sum) - Senior Project
CS 191 (Aut, Win, Spr, Sum) - Supervised Undergraduate Research
CS 195 (Aut, Win, Spr, Sum) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2023-24 Courses
- Machine Learning with Graphs
CS 224W (Aut) - Mining Massive Data Sets
CS 246 (Win)
2022-23 Courses
- Machine Learning with Graphs
CS 224W (Win) - Mining Massive Data Sets
CS 246 (Spr)
2021-22 Courses
- Machine Learning with Graphs
CS 224W (Aut) - Mining Massive Data Sets
CS 246 (Win)
- Machine Learning with Graphs
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Kristy Carpenter, Jessica Kain, Michael Wornow -
Postdoctoral Faculty Sponsor
Vijay Prakash Dwivedi -
Master's Program Advisor
Michael Atkin, Andrew Cheng, Pino Cholsaipant, Thomas Hatcher, Priya Khandelwal, Gabe Magaña, Nathan Maidi, Alex Rivas, Josh Singh, Ron Wang, Mike Yang, Hollie Zheng -
Doctoral Dissertation Co-Advisor (AC)
Jared Davis, Qian Huang, Minkai Xu -
Doctoral (Program)
Michael Bereket, Kexin Huang, Hamed Nilforoshan, Rishabh Ranjan, Marcel Roed, Yanay Rosen, Shirley Wu
All Publications
-
Mutual interactors as a principle for phenotype discovery in molecular interaction networks.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2023; 28: 61-72
Abstract
Biological networks are powerful representations for the discovery of molecular phenotypes. Fundamental to network analysis is the principle-rooted in social networks-that nodes that interact in the network tend to have similar properties. While this long-standing principle underlies powerful methods in biology that associate molecules with phenotypes on the basis of network proximity, interacting molecules are not necessarily similar, and molecules with similar properties do not necessarily interact. Here, we show that molecules are more likely to have similar phenotypes, not if they directly interact in a molecular network, but if they interact with the same molecules. We call this the mutual interactor principle and show that it holds for several kinds of molecular networks, including protein-protein interaction, genetic interaction, and signaling networks. We then develop a machine learning framework for predicting molecular phenotypes on the basis of mutual interactors. Strikingly, the framework can predict drug targets, disease proteins, and protein functions in different species, and it performs better than much more complex algorithms. The framework is robust to incomplete biological data and is capable of generalizing to phenotypes it has not seen during training. Our work represents a network-based predictive platform for phenotypic characterization of biological molecules.
View details for PubMedID 36540965
-
Foundation models for generalist medical artificial intelligence.
Nature
2023; 616 (7956): 259-265
Abstract
The exceptionally rapid development of highly flexible, reusable artificial intelligence (AI) models is likely to usher in newfound capabilities in medicine. We propose a new paradigm for medical AI, which we refer to as generalist medical AI (GMAI). GMAI models will be capable of carrying out a diverse set of tasks using very little or no task-specific labelled data. Built through self-supervision on large, diverse datasets, GMAI will flexibly interpret different combinations of medical modalities, including data from imaging, electronic health records, laboratory results, genomics, graphs or medical text. Models will in turn produce expressive outputs such as free-text explanations, spoken recommendations or image annotations that demonstrate advanced medical reasoning abilities. Here we identify a set of high-impact potential applications for GMAI and lay out specific technical capabilities and training datasets necessary to enable them. We expect that GMAI-enabled applications will challenge current strategies for regulating and validating AI devices for medicine and will shift practices associated with the collection of large medical datasets.
View details for DOI 10.1038/s41586-023-05881-4
View details for PubMedID 37045921
View details for PubMedCentralID 9792464
-
Hybrid forecasting of geopolitical events(dagger)
AI MAGAZINE
2023; 44 (1): 112-128
View details for DOI 10.1002/aaai.12085
View details for Web of Science ID 000963102400010
-
M2P2: Multimodal Persuasion Prediction Using Adaptive Fusion
IEEE TRANSACTIONS ON MULTIMEDIA
2023; 25: 942-952
View details for DOI 10.1109/TMM.2021.3134168
View details for Web of Science ID 000961977900020
-
Annotation of spatially resolved single-cell data with STELLAR.
Nature methods
2022
Abstract
Accurate cell-type annotation from spatially resolved single cells is crucial to understand functional spatial biology that is the basis of tissue organization. However, current computational methods for annotating spatially resolved single-cell data are typically based on techniques established for dissociated single-cell technologies and thus do not take spatial organization into account. Here we present STELLAR, a geometric deep learning method for cell-type discovery and identification in spatially resolved single-cell datasets. STELLAR automatically assigns cells to cell types present in the annotated reference dataset and discovers novel cell types and cell states. STELLAR transfers annotations across different dissection regions, different tissues and different donors, and learns cell representations that capture higher-order tissue structures. We successfully applied STELLAR to CODEX multiplexed fluorescent microscopy data and multiplexed RNA imaging datasets. Within the Human BioMolecular Atlas Program, STELLAR has annotated 2.6million spatially resolved single cells with dramatic time savings.
View details for DOI 10.1038/s41592-022-01651-8
View details for PubMedID 36280720
-
Combining Graph Convolutional Neural Networks and Label Propagation
ACM TRANSACTIONS ON INFORMATION SYSTEMS
2022; 40 (4)
View details for DOI 10.1145/3490478
View details for Web of Science ID 000796738200010
-
Artificial intelligence foundation for therapeutic science.
Nature chemical biology
2022
View details for DOI 10.1038/s41589-022-01131-2
View details for PubMedID 36131149
-
Fly Cell Atlas: A single-nucleus transcriptomic atlas of the adult fruit fly.
Science (New York, N.Y.)
2022; 375 (6584): eabk2432
Abstract
For more than 100 years, the fruit fly Drosophila melanogaster has been one of the most studied model organisms. Here, we present a single-cell atlas of the adult fly, Tabula Drosophilae, that includes 580,000 nuclei from 15 individually dissected sexed tissues as well as the entire head and body, annotated to >250 distinct cell types. We provide an in-depth analysis of cell type-related gene signatures and transcription factor markers, as well as sexual dimorphism, across the whole animal. Analysis of common cell types between tissues, such as blood and muscle cells, reveals rare cell types and tissue-specific subtypes. This atlas provides a valuable resource for the Drosophila community and serves as a reference to study genetic perturbations and disease models at single-cell resolution.
View details for DOI 10.1126/science.abk2432
View details for PubMedID 35239393
-
Guest Editorial: Non-Euclidean Machine Learning
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
2022; 44 (2): 723-726
View details for DOI 10.1109/TPAMI.2021.3129857
View details for Web of Science ID 000740006100014
-
Large-scale diet tracking data reveal disparate associations between food environment and diet.
Nature communications
1800; 13 (1): 267
Abstract
An unhealthy diet is a major risk factor for chronic diseases including cardiovascular disease, type 2 diabetes, and cancer1-4. Limited access to healthy food options may contribute to unhealthy diets5,6. Studying diets is challenging, typically restricted to small sample sizes, single locations, and non-uniform design across studies, and has led to mixed results on the impact of the food environment7-23. Here we leverage smartphones to track diet health, operationalized through the self-reported consumption of fresh fruits and vegetables, fast food and soda, as well as body-mass index status in a country-wide observational study of 1,164,926 U.S. participants (MyFitnessPal app users) and 2.3 billion food entries to study the independent contributions of fast food and grocery store access, income and education to diet health outcomes. This study constitutes the largest nationwide study examining the relationship between the food environment and diet to date. We find that higher access to grocery stores, lower access to fast food, higher income and college education are independently associated with higher consumption of fresh fruits and vegetables, lower consumption of fast food and soda, and lower likelihood of being affected by overweight and obesity. However, these associations vary significantly across zip codes with predominantly Black, Hispanic or white populations. For instance, high grocery store access has a significantly larger association with higher fruit and vegetable consumption in zip codes with predominantly Hispanic populations (7.4% difference) and Black populations (10.2% difference) in contrast to zip codes with predominantly white populations (1.7% difference). Policy targeted at improving food access, income and education may increase healthy eating, but intervention allocation may need to be optimized for specific subpopulations and locations.
View details for DOI 10.1038/s41467-021-27522-y
View details for PubMedID 35042849
-
LinkBERT: Pretraining Language Models with Document Links
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2022: 8003-8016
View details for Web of Science ID 000828702308009
-
Data-Driven Real-Time Strategic Placement of Mobile Vaccine Distribution Sites
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2022: 12573-12579
View details for Web of Science ID 000893639105074
-
Companies under stress: the impact of shocks on the production network
EPJ DATA SCIENCE
2021; 10 (1)
View details for DOI 10.1140/epjds/s13688-021-00310-w
View details for Web of Science ID 000728537800001
-
Companies under stress: the impact of shocks on the production network.
EPJ data science
2021; 10 (1): 57
Abstract
In this paper we analyze the effect of shocks in production networks. Our work is based on a rich dataset that contains information about companies from Slovenia right after the financial crisis of 2008. The processed data spans for 8 years and covers the transaction history as well as performance indicators and various metadata of the companies. We define sales shocks at different levels, and identify companies impacted by them. Next we investigate stress, the potential immediate upstream and downstream impact of a shock within the production network. We base our main findings on a matched pairs analysis of stressed companies. We find that both shock and stress are associated with reporting bankruptcy in the future and that stress foremost impacts the future sales of customers. Furthermore, we find evidence that stress not only results in performance losses but the reconfiguration of the production network as well. We show that stressed companies actively seek for new trading partners, and that these new links often share the industry of the shocked company. These results suggest that both stressed customers and suppliers react quickly to stress and adjust their trading relationships.
View details for DOI 10.1140/epjds/s13688-021-00310-w
View details for PubMedID 34966638
View details for PubMedCentralID PMC8660722
-
Postmortem memory of public figures in news and social media.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (38)
Abstract
Deceased public figures are often said to live on in collective memory. We quantify this phenomenon by tracking mentions of 2,362 public figures in English-language online news and social media (Twitter) 1 y before and after death. We measure the sharp spike and rapid decay of attention following death and model collective memory as a composition of communicative and cultural memory. Clustering reveals four patterns of postmortem memory, and regression analysis shows that boosts in media attention are largest for premortem popular anglophones who died a young, unnatural death; that long-term boosts are smallest for leaders and largest for artists; and that, while both the news and Twitter are triggered by young and unnatural deaths, the news additionally curates collective memory when old persons or leaders die. Overall, we illuminate the age-old question of who is remembered by society, and the distinct roles of news and social media in collective memory formation.
View details for DOI 10.1073/pnas.2106152118
View details for PubMedID 34526401
-
Leveraging the Cell Ontology to classify unseen cell types.
Nature communications
2021; 12 (1): 5556
Abstract
Single cell technologies are rapidly generating large amounts of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types that are part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, regardless of whether the cell types are present or absent in the training data, suggesting that OnClass goes beyond a simple annotation tool for single cell datasets, being the first algorithm capable to identify marker genes specific to all terms of the Cell Ontology and offering the possibility of refining the Cell Ontology using a data-centric approach.
View details for DOI 10.1038/s41467-021-25725-x
View details for PubMedID 34548483
-
Daily, weekly, seasonal and menstrual cycles in women's mood, behaviour and vital signs.
Nature human behaviour
2021
Abstract
Dimensions of human mood, behaviour and vital signs cycle over multiple timescales. However, it remains unclear which dimensions are most cyclical, and how daily, weekly, seasonal and menstrual cycles compare in magnitude. The menstrual cycle remains particularly understudied because, not being synchronized across the population, it will be averaged out unless menstrual cycles can be aligned before analysis. Here, we analyse 241 million observations from 3.3 million women across 109 countries, tracking 15 dimensions of mood, behaviour and vital signs using a women's health mobile app. Out of the daily, weekly, seasonal and menstrual cycles, the menstrual cycle had the greatest magnitude for most of the measured dimensions of mood, behaviour and vital signs. Mood, vital signs and sexual behaviour vary most substantially over the course of the menstrual cycle, while sleep and exercise behaviour remain more constant. Menstrual cycle effects are directionally consistent across countries.
View details for DOI 10.1038/s41562-020-01046-9
View details for PubMedID 33526880
-
Temporal evolution of single-cell transcriptomes of Drosophila olfactory projection neurons.
eLife
2021; 10
Abstract
Neurons undergo substantial morphological and functional changes during development to form precise synaptic connections and acquire specific physiological properties. What are the underlying transcriptomic bases? Here, we obtained the single-cell transcriptomes of Drosophila olfactory projection neurons (PNs) at four developmental stages. We decoded the identity of 21 transcriptomic clusters corresponding to 20 PN types and developed methods to match transcriptomic clusters representing the same PN type across development. We discovered that PN transcriptomes reflect unique biological processes unfolding at each stage-neurite growth and pruning during metamorphosis at an early pupal stage; peaked transcriptomic diversity during olfactory circuit assembly at mid-pupal stages; and neuronal signaling in adults. At early developmental stages, PN types with adjacent birth order share similar transcriptomes. Together, our work reveals principles of cellular diversity during brain development and provides a resource for future studies of neural development in PNs and other neuronal types.
View details for DOI 10.7554/eLife.63450
View details for PubMedID 33427646
-
WILDS: A Benchmark of in-the-Wild Distribution Shifts
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000683104605062
-
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2021: 7752-7763
View details for Web of Science ID 000860727001059
-
Bipartite Dynamic Representations for Abuse Detection
ASSOC COMPUTING MACHINERY. 2021: 3638-3648
View details for DOI 10.1145/3447548.3467141
View details for Web of Science ID 000749556803068
-
Relational Message Passing for Knowledge Graph Completion
ASSOC COMPUTING MACHINERY. 2021: 1697-1707
View details for DOI 10.1145/3447548.3467247
View details for Web of Science ID 000749556801072
-
Supporting COVID-19 policy response with large-scale mobility-based modeling
ASSOC COMPUTING MACHINERY. 2021: 2632-2642
View details for DOI 10.1145/3447548.3467182
View details for Web of Science ID 000749556802066
-
F-FADE: Frequency Factorization for Anomaly Detection in Edge Streams
ASSOC COMPUTING MACHINERY. 2021: 589-597
View details for DOI 10.1145/3437963.3441806
View details for Web of Science ID 000810499000069
-
Identity-aware Graph Neural Networks
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2021: 10737-10745
View details for Web of Science ID 000681269802048
-
LEGO: Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000768182705011
-
Inductive Learning on Commonsense Knowledge Graph Completion
IEEE. 2021
View details for DOI 10.1109/IJCNN52387.2021.9534355
View details for Web of Science ID 000722581708050
-
GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000683104603029
-
TEDIC: Neural Modeling of Behavioral Patterns in Dynamic Social Interaction Networks
ASSOC COMPUTING MACHINERY. 2021: 693-705
View details for DOI 10.1145/3442381.3450096
View details for Web of Science ID 000733621800062
-
Maximally selective single-cell target for circuit control in epilepsy models.
Neuron
2021
Abstract
Neurological and psychiatric disorders are associated with pathological neural dynamics. The fundamental connectivity patterns of cell-cell communication networks that enable pathological dynamics to emerge remain unknown. Here, we studied epileptic circuits using a newly developed computational pipeline that leveraged single-cell calcium imaging of larval zebrafish and chronically epileptic mice, biologically constrained effective connectivity modeling, and higher-order motif-focused network analysis. We uncovered a novel functional cell type that preferentially emerged in the preseizure state, the superhub, that was unusually richly connected to the rest of the network through feedforward motifs, critically enhancing downstream excitation. Perturbation simulations indicated that disconnecting superhubs was significantly more effective in stabilizing epileptic circuits than disconnecting hub cells that were defined traditionally by connection count. In the dentate gyrus of chronically epileptic mice, superhubs were predominately modeled adult-born granule cells. Collectively, these results predict a new maximally selective and minimally invasive cellular target for seizure control.
View details for DOI 10.1016/j.neuron.2021.06.007
View details for PubMedID 34197732
-
Identification of disease treatment mechanisms through the multiscale interactome.
Nature communications
2021; 12 (1): 1796
Abstract
Most diseases disrupt multiple proteins, and drugs treat such diseases by restoring the functions of the disrupted proteins. How drugs restore these functions, however, is often unknown as a drug's therapeutic effects are not limited to the proteins that the drug directly targets. Here, we develop the multiscale interactome, a powerful approach to explain disease treatment. We integrate disease-perturbed proteins, drug targets, and biological functions into a multiscale interactome network. We then develop a random walk-based method that captures how drug effects propagate through a hierarchy of biological functions and physical protein-protein interactions. On three key pharmacological tasks, the multiscale interactome predicts drug-disease treatment, identifies proteins and biological functions related to treatment, and predicts genes that alter a treatment's efficacy and adverse reactions. Our results indicate that physical interactions between proteins alone cannot explain treatment since many drugs treat diseases by affecting the biological functions disrupted by the disease rather than directly targeting disease proteins or their regulators. We provide a general framework for explaining treatment, even when drugs seem unrelated to the diseases they are recommended for.
View details for DOI 10.1038/s41467-021-21770-8
View details for PubMedID 33741907
-
Single-cell transcriptomes of developing and adult olfactory receptor neurons in Drosophila.
eLife
2021; 10
Abstract
Recognition of environmental cues is essential for the survival of all organisms. Transcriptional changes occur to enable the generation and function of the neural circuits underlying sensory perception. To gain insight into these changes, we generated single-cell transcriptomes of Drosophila olfactory- (ORNs), thermo-, and hygro-sensory neurons at an early developmental and adult stage using single-cell and single-nucleus RNA sequencing. We discovered that ORNs maintain expression of the same olfactory receptors across development. Using receptor expression and computational approaches, we matched transcriptomic clusters corresponding to anatomically and physiologically defined neuron types across multiple developmental stages. We found that cell-type-specific transcriptomes partly reflected axon trajectory choices in development and sensory modality in adults. We uncovered stage-specific genes that could regulate the wiring and sensory responses of distinct ORN types. Collectively, our data reveal transcriptomic features of sensory neuron biology and provide a resource for future studies of their development and physiology.
View details for DOI 10.7554/eLife.63856
View details for PubMedID 33555999
-
An algorithmic approach to reducing unexplained pain disparities in underserved populations.
Nature medicine
2021; 27 (1): 136–40
Abstract
Underserved populations experience higher levels of pain. These disparities persist even after controlling for the objective severity of diseases like osteoarthritis, as graded by human physicians using medical images, raising the possibility that underserved patients' pain stems from factors external to the knee, such as stress. Here we use a deep learning approach to measure the severity of osteoarthritis, by using knee X-rays to predict patients' experienced pain. We show that this approach dramatically reduces unexplained racial disparities in pain. Relative to standard measures of severity graded by radiologists, which accounted for only 9% (95% confidence interval (CI), 3-16%) of racial disparities in pain, algorithmic predictions accounted for 43% of disparities, or 4.7* more (95% CI, 3.2-11.8*), with similar results for lower-income and less-educated patients. This suggests that much of underserved patients' pain stems from factors within the knee not reflected in standard radiographic measures of severity. We show that the algorithm's ability to reduce unexplained disparities is rooted in the racial and socioeconomic diversity of the training set. Because algorithmic severity measures better capture underserved patients' pain, and severity measures influence treatment decisions, algorithmic predictions could potentially redress disparities in access to treatments like arthroplasty.
View details for DOI 10.1038/s41591-020-01192-7
View details for PubMedID 33442014
-
Mobility network models of COVID-19 explain inequities and inform reopening.
Nature
2020
Abstract
The COVID-19 pandemic dramatically changed human mobility patterns, necessitating epidemiological models which capture the effects of changes in mobility on virus spread1. We introduce a metapopulation SEIR model that integrates fine-grained, dynamic mobility networks to simulate the spread of SARS-CoV-2 in 10 of the largest US metropolitan statistical areas. Derived from cell phone data, our mobility networks map the hourly movements of 98 million people from neighborhoods (census block groups, or CBGs) to points of interest (POIs) such as restaurants and religious establishments, connecting 57k CBGs to 553k POIs with 5.4 billion hourly edges. We show that by integrating these networks, a relatively simple SEIR model can accurately fit the real case trajectory, despite substantial changes in population behavior over time. Our model predicts that a small minority of "superspreader" POIs account for a large majority of infections and that restricting maximum occupancy at each POI is more effective than uniformly reducing mobility. Our model also correctly predicts higher infection rates among disadvantaged racial and socioeconomic groups2-8 solely from differences in mobility: we find that disadvantaged groups have not been able to reduce mobility as sharply, and that the POIs they visit are more crowded and therefore higher-risk. By capturing who is infected at which locations, our model supports detailed analyses that can inform more effective and equitable policy responses to COVID-19.
View details for DOI 10.1038/s41586-020-2923-3
View details for PubMedID 33171481
-
MARS: discovering novel cell types across heterogeneous single-cell experiments.
Nature methods
2020
Abstract
Although tremendous effort has been put into cell-type annotation, identification of previously uncharacterized cell types in heterogeneous single-cell RNA-seq data remains a challenge. Here we present MARS, a meta-learning approach for identifying and annotating known as well as new cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. The method has a unique ability to discover cell types that have never been seen before and annotate experiments that are as yet unannotated. We apply MARS to a large mouse cell atlas and show its ability to accurately identify cell types, even when it has never seen them before. Further, MARS automatically generates interpretable names for new cell types by probabilistically defining a cell type in the embedding space.
View details for DOI 10.1038/s41592-020-00979-3
View details for PubMedID 33077966
-
Gender Differences in Patient Perceptions of Physicians' Communal Traits and the Impact on Physician Evaluations.
Journal of women's health (2002)
2020
Abstract
Background: Communal traits, such as empathy, warmth, and consensus-building, are not highly valued in the medical hierarchy. Devaluing communal traits is potentially harmful for two reasons. First, data suggest that patients may prefer when physicians show communal traits. Second, if female physicians are more likely to be perceived as communal, devaluing communal traits may increase the gender inequity already prevalent in medicine. We test for both these effects. Materials and Methods: This study analyzed 22,431 Press Ganey outpatient surveys assessing 480 physicians collected from 2016 to 2017 at a large tertiary hospital. The surveys asked patients to provide qualitative comments and quantitative Likert-scale ratings assessing physician effectiveness. We coded whether patients described physicians with "communal" language using a validated word scale derived from previous work. We used multivariate logistic regressions to assess whether (1) patients were more likely to describe female physicians using communal language and (2) patients gave higher quantitative ratings to physicians they described with communal language, when controlling for physician, patient, and comment characteristics. Results: Female physicians had higher odds of being described with communal language than male physicians (odds ratio 1.29, 95% confidence interval 1.18-1.40, p < 0.001). In addition, patients gave higher quantitative ratings to physicians they described with communal language. These results were robust to inclusion of controls. Conclusions: Female physicians are more likely to be perceived as communal. Being perceived as communal is associated with higher quantitative ratings, including likelihood to recommend. Our study indicates a need to reevaluate what types of behaviors academic hospitals reward in their physicians.
View details for DOI 10.1089/jwh.2019.8233
View details for PubMedID 32857642
-
OCEAN: Online Task Inference for Compositional Tasks with Context Adaptation
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020: 1378-1387
View details for Web of Science ID 000723388600139
-
Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities
INFORMATION FUSION
2019; 50: 71–91
View details for DOI 10.1016/j.inffus.2018.09.012
View details for Web of Science ID 000466056900007
-
Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities.
An international journal on information fusion
2019; 50: 71–91
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
View details for PubMedID 30467459
-
Pretraining deep learning molecular representations for property prediction
AMER CHEMICAL SOC. 2019
View details for Web of Science ID 000525055503355
-
Best practices for analyzing large-scale health data from wearables and smartphone apps.
NPJ digital medicine
2019; 2: 45
Abstract
Smartphone apps and wearable devices for tracking physical activity and other health behaviors have become popular in recent years and provide a largely untapped source of data about health behaviors in the free-living environment. The data are large in scale, collected at low cost in the "wild", and often recorded in an automatic fashion, providing a powerful complement to traditional surveillance studies and controlled trials. These data are helping to reveal, for example, new insights about environmental and social influences on physical activity. The observational nature of the datasets and collection via commercial devices and apps pose challenges, however, including the potential for measurement, population, and/or selection bias, as well as missing data. In this article, we review insights gleaned from these datasets and propose best practices for addressing the limitations of large-scale data from apps and wearables. Our goal is to enable researchers to effectively harness the data from smartphone apps and wearable devices to better understand what drives physical activity and other health behaviors.
View details for DOI 10.1038/s41746-019-0121-1
View details for PubMedID 31304391
View details for PubMedCentralID PMC6550237
-
Best practices for analyzing large-scale health data from wearables and smartphone apps
NPJ DIGITAL MEDICINE
2019; 2
View details for DOI 10.1038/s41746-019-0121-1
View details for Web of Science ID 000470039200002
-
To Embed or Not: Network Embedding as a Paradigm in Computational Biology
FRONTIERS IN GENETICS
2019; 10
View details for DOI 10.3389/fgene.2019.00381
View details for Web of Science ID 000466541400001
-
Goal-setting And Achievement In Activity Tracking Apps: A Case Study Of MyFitnessPal.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2019; 2019: 571-582
Abstract
Activity tracking apps often make use of goals as one of their core motivational tools. There are two critical components to this tool: setting a goal, and subsequently achieving that goal. Despite its crucial role in how a number of prominent self-tracking apps function, there has been relatively little investigation of the goal-setting and achievement aspects of self-tracking apps. Here we explore this issue, investigating a particular goal setting and achievement process that is extensive, recorded, and crucial for both the app and its users' success: weight loss goals in MyFitnessPal. We present a large-scale study of 1.4 million users and weight loss goals, allowing for an unprecedented detailed view of how people set and achieve their goals. We find that, even for difficult long-term goals, behavior within the first 7 days predicts those who ultimately achieve their goals, that is, those who lose at least as much weight as they set out to, and those who do not. For instance, high amounts of early weight loss, which some researchers have classified as unsustainable, leads to higher goal achievement rates. We also show that early food intake, self-monitoring motivation, and attitude towards the goal are important factors. We then show that we can use our findings to predict goal achievement with an accuracy of 79% ROC AUC just 7 days after a goal is set. Finally, we discuss how our findings could inform steps to improve goal achievement in self-tracking apps.
View details for DOI 10.1145/3308558.3313432
View details for PubMedID 32368761
View details for PubMedCentralID PMC7197296
-
To Embed or Not: Network Embedding as a Paradigm in Computational Biology.
Frontiers in genetics
2019; 10: 381
Abstract
Current technology is producing high throughput biomedical data at an ever-growing rate. A common approach to interpreting such data is through network-based analyses. Since biological networks are notoriously complex and hard to decipher, a growing body of work applies graph embedding techniques to simplify, visualize, and facilitate the analysis of the resulting networks. In this review, we survey traditional and new approaches for graph embedding and compare their application to fundamental problems in network biology with using the networks directly. We consider a broad variety of applications including protein network alignment, community detection, and protein function prediction. We find that in all of these domains both types of approaches are of value and their performance depends on the evaluation measures being used and the goal of the project. In particular, network embedding methods outshine direct methods according to some of those measures and are, thus, an essential tool in bioinformatics research.
View details for DOI 10.3389/fgene.2019.00381
View details for PubMedID 31118945
View details for PubMedCentralID PMC6504708
-
Inferring Multidimensional Rates of Aging from Cross-Sectional Data.
Proceedings of machine learning research
2019; 89: 97–107
Abstract
Modeling how individuals evolve over time is a fundamental problem in the natural and social sciences. However, existing datasets are often cross-sectional with each individual observed only once, making it impossible to apply traditional time-series methods. Motivated by the study of human aging, we present an interpretable latent-variable model that learns temporal dynamics from cross-sectional data. Our model represents each individual's features over time as a nonlinear function of a low-dimensional, linearly-evolving latent state. We prove that when this nonlinear function is constrained to be order-isomorphic, the model family is identifiable solely from cross-sectional data provided the distribution of time-independent variation is known. On the UK Biobank human health dataset, our model reconstructs the observed data while learning interpretable rates of aging associated with diseases, mortality, and aging risk factors.
View details for PubMedID 31538144
-
Evolution of resilience in protein interactomes across the tree of life.
Proceedings of the National Academy of Sciences of the United States of America
2019
Abstract
Phenotype robustness to environmental fluctuations is a common biological phenomenon. Although most phenotypes involve multiple proteins that interact with each other, the basic principles of how such interactome networks respond to environmental unpredictability and change during evolution are largely unknown. Here we study interactomes of 1,840 species across the tree of life involving a total of 8,762,166 protein-protein interactions. Our study focuses on the resilience of interactomes to network failures and finds that interactomes become more resilient during evolution, meaning that interactomes become more robust to network failures over time. In bacteria, we find that a more resilient interactome is in turn associated with the greater ability of the organism to survive in a more complex, variable, and competitive environment. We find that at the protein family level proteins exhibit a coordinated rewiring of interactions over time and that a resilient interactome arises through gradual change of the network topology. Our findings have implications for understanding molecular network structure in the context of both evolution and environment.
View details for PubMedID 30765515
-
Panel: Computational Methods about Knowledge Graph The First International Workshop on Knowledge Graph Technology and Applications
ASSOC COMPUTING MACHINERY. 2019: 677
View details for DOI 10.1145/3308560.3317712
View details for Web of Science ID 000474353100105
-
Faithful and Customizable Explanations of Black Box Models
ASSOC COMPUTING MACHINERY. 2019: 131–38
View details for DOI 10.1145/3306618.3314229
View details for Web of Science ID 000556121100019
-
Hyperbolic Graph Convolutional Neural Networks.
Advances in neural information processing systems
2019; 32: 4869–80
Abstract
Graph convolutional neural networks (GCNs) embed nodes in a graph into Euclidean space, which has been shown to incur a large distortion when embedding real-world graphs with scale-free or hierarchical structure. Hyperbolic geometry offers an exciting alternative, as it enables embeddings with much smaller distortion. However, extending GCNs to hyperbolic geometry presents several unique challenges because it is not clear how to define neural network operations, such as feature transformation and aggregation, in hyperbolic space. Furthermore, since input features are often Euclidean, it is unclear how to transform the features into hyperbolic embeddings with the right amount of curvature. Here we propose Hyperbolic Graph Convolutional Neural Network (HGCN), the first inductive hyperbolic GCN that leverages both the expressiveness of GCNs and hyperbolic geometry to learn inductive node representations for hierarchical and scale-free graphs. We derive GCNs operations in the hyperboloid model of hyperbolic space and map Euclidean input features to embeddings in hyperbolic spaces with different trainable curvature at each layer. Experiments demonstrate that HGCN learns embeddings that preserve hierarchical structure, and leads to improved performance when compared to Euclidean analogs, even with very low dimensional embeddings: compared to state-of-the-art GCNs, HGCN achieves an error reduction of up to 63.1% in ROC AUC for link prediction and of up to 47.5% in F1 score for node classification, also improving state-of-the art on the Pubmed dataset.
View details for PubMedID 32256024
-
GNNExplainer: Generating Explanations for Graph Neural Networks.
Advances in neural information processing systems
2019; 32: 9240–51
Abstract
Graph Neural Networks (GNNs) are a powerful tool for machine learning on graphs. GNNs combine node feature information with the graph structure by recursively passing neural messages along edges of the input graph. However, incorporating both graph structure and feature information leads to complex models and explaining predictions made by GNNs remains unsolved. Here we propose GnnExplainer, the first general, model-agnostic approach for providing interpretable explanations for predictions of any GNN-based model on any graph-based machine learning task. Given an instance, GnnExplainer identifies a compact subgraph structure and a small subset of node features that have a crucial role in GNN's prediction. Further, GnnExplainer can generate consistent and concise explanations for an entire class of instances. We formulate GnnExplainer as an optimization task that maximizes the mutual information between a GNN's prediction and distribution of possible subgraph structures. Experiments on synthetic and real-world graphs show that our approach can identify important graph structures as well as node features, and outperforms alternative baseline approaches by up to 43.0% in explanation accuracy. GnnExplainer provides a variety of benefits, from the ability to visualize semantically relevant structures to interpretability, to giving insights into errors of faulty GNNs.
View details for PubMedID 32265580
-
G2SAT: Learning to Generate SAT Formulas.
Advances in neural information processing systems
2019; 32: 10552–63
Abstract
The Boolean Satisfiability (SAT) problem is the canonical NP-complete problem and is fundamental to computer science, with a wide array of applications in planning, verification, and theorem proving. Developing and evaluating practical SAT solvers relies on extensive empirical testing on a set of real-world benchmark formulas. However, the availability of such real-world SAT formulas is limited. While these benchmark formulas can be augmented with synthetically generated ones, existing approaches for doing so are heavily hand-crafted and fail to simultaneously capture a wide range of characteristics exhibited by real-world SAT instances. In this work, we present G2SAT, the first deep generative framework that learns to generate SAT formulas from a given set of input formulas. Our key insight is that SAT formulas can be transformed into latent bipartite graph representations which we model using a specialized deep generative neural network. We show that G2SAT can generate SAT formulas that closely resemble given real-world SAT instances, as measured by both graph metrics and SAT solver behavior. Further, we show that our synthetic SAT formulas could be used to improve SAT solver performance on real-world benchmarks, which opens up new opportunities for the continued development of SAT solvers and a deeper understanding of their performance.
View details for PubMedID 32265581
-
Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2019; 2019: 1269–78
Abstract
Modeling sequential interactions between users and items/products is crucial in domains such as e-commerce, social networking, and education. Representation learning presents an attractive opportunity to model the dynamic evolution of users and items, where each user/item can be embedded in a Euclidean space and its evolution can be modeled by an embedding trajectory in this space. However, existing dynamic embedding methods generate embeddings only when users take actions and do not explicitly model the future trajectory of the user/item in the embedding space. Here we propose JODIE, a coupled recurrent neural network model that learns the embedding trajectories of users and items. JODIE employs two recurrent neural networks to update the embedding of a user and an item at every interaction. Crucially, JODIE also models the future embedding trajectory of a user/item. To this end, it introduces a novel projection operator that learns to estimate the embedding of the user at any time in the future. These estimated embeddings are then used to predict future user-item interactions. To make the method scalable, we develop a t-Batch algorithm that creates time-consistent batches and leads to 9× faster training. We conduct six experiments to validate JODIE on two prediction tasks- future interaction prediction and state change prediction-using four real-world datasets. We show that JODIE outperforms six state-of-the-art algorithms in these tasks by at least 20% in predicting future interactions and 12% in state change prediction.
View details for DOI 10.1145/3292500.3330895
View details for PubMedID 31538030
View details for PubMedCentralID PMC6752886
-
The Local Closure Coefficient: A New Perspective On Network Clustering
ASSOC COMPUTING MACHINERY. 2019: 303–11
View details for DOI 10.1145/3289600.3290991
View details for Web of Science ID 000482120400039
-
Hyperbolic Graph Convolutional Neural Networks
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000534424304083
-
Predicting pregnancy using large-scale data from a women's health tracking mobile application
ASSOC COMPUTING MACHINERY. 2019: 2999–3005
Abstract
Predicting pregnancy has been a fundamental problem in women's health for more than 50 years. Previous datasets have been collected via carefully curated medical studies, but the recent growth of women's health tracking mobile apps offers potential for reaching a much broader population. However, the feasibility of predicting pregnancy from mobile health tracking data is unclear. Here we develop four models - a logistic regression model, and 3 LSTM models - to predict a woman's probability of becoming pregnant using data from a women's health tracking app, Clue by BioWink GmbH. Evaluating our models on a dataset of 79 million logs from 65,276 women with ground truth pregnancy test data, we show that our predicted pregnancy probabilities meaningfully stratify women: women in the top 10% of predicted probabilities have a 89% chance of becoming pregnant over 6 menstrual cycles, as compared to a 27% chance for women in the bottom 10%. We develop a technique for extracting interpretable time trends from our deep learning models, and show these trends are consistent with previous fertility research. Our findings illustrate the potential that women's health tracking data offers for predicting pregnancy on a broader population; we conclude by discussing the steps needed to fulfill this potential.
View details for DOI 10.1145/3308558.3313512
View details for Web of Science ID 000483508403011
View details for PubMedID 31538145
View details for PubMedCentralID PMC6752881
-
Goal-setting And Achievement In Activity Tracking Apps: A Case Study Of MyFitnessPal
ASSOC COMPUTING MACHINERY. 2019: 571–82
View details for DOI 10.1145/3308558.3313432
View details for Web of Science ID 000483508400055
-
Hierarchical Temporal Convolutional Networks for Dynamic Recommender Systems
ASSOC COMPUTING MACHINERY. 2019: 2236–46
View details for DOI 10.1145/3308558.3313747
View details for Web of Science ID 000483508402028
-
Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems
ASSOC COMPUTING MACHINERY. 2019: 968–77
View details for DOI 10.1145/3292500.3330836
View details for Web of Science ID 000485562501002
-
GNNExplainer: Generating Explanations for Graph Neural Networks
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000535866900079
-
G2SAT: Learning to Generate SAT Formulas
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000535866902021
-
Complete the Look: Scene-based Complementary Product Recommendation
IEEE. 2019: 10524–33
View details for DOI 10.1109/CVPR.2019.01078
View details for Web of Science ID 000542649304015
-
Inferring Multidimensional Rates of Aging from Cross-Sectional Data
MICROTOME PUBLISHING. 2019: 97–107
View details for Web of Science ID 000509687900011
-
Network enhancement as a general method to denoise weighted biological networks.
Nature communications
2018; 9 (1): 3108
Abstract
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from molecular to the biome. However, biological networks are noisy due to the limitations of measurement technology and inherent natural variation, which can hamper discovery of network patterns and dynamics. We propose Network Enhancement (NE), a method for improving the signal-to-noise ratio of undirected, weighted networks. NE uses a doubly stochastic matrix operator that induces sparsity and provides a closed-form solution that increases spectral eigengap of the input network. As a result, NE removes weak edges, enhances real connections, and leads to better downstream performance. Experiments show that NE improves gene-function prediction by denoising tissue-specific interaction networks, alleviates interpretation of noisy Hi-C contact maps from the human genome, and boosts fine-grained identification accuracy of species. Our results indicate that NE is widely applicable for denoising biological networks.
View details for PubMedID 30082777
-
Modeling polypharmacy side effects with graph convolutional networks.
Bioinformatics (Oxford, England)
2018; 34 (13): i457–i466
Abstract
Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.Availability and implementation: Source code and preprocessed datasets are at: http://snap.stanford.edu/decagon.
View details for PubMedID 29949996
-
Prioritizing network communities.
Nature communications
2018; 9 (1): 2544
Abstract
Uncovering modular structure in networks is fundamental for systems in biology, physics, and engineering. Community detection identifies candidate modules as hypotheses, which then need to be validated through experiments, such as mutagenesis in a biological laboratory. Only a few communities can typically be validated, and it is thus important to prioritize which communities to select for downstream experimentation. Here we develop CRANK, a mathematically principled approach for prioritizing network communities. CRANK efficiently evaluates robustness and magnitude of structural features of each community and then combines these features into the community prioritization. CRANK can be used with any community detection method. It needs only information provided by the network structure and does not require any additional metadata or labels. However, when available, CRANK can incorporate domain-specific information to further boost performance. Experiments on many large networks show that CRANK effectively prioritizes communities, yielding a nearly 50-fold improvement in community prioritization.
View details for PubMedID 29959323
-
Higher-order clustering in networks
PHYSICAL REVIEW E
2018; 97 (5)
View details for DOI 10.1103/PhysRevE.97.052306
View details for Web of Science ID 000433029300004
-
Higher-order clustering in networks.
Physical review. E
2018; 97 (5-1): 052306
Abstract
A fundamental property of complex networks is the tendency for edges to cluster. The extent of the clustering is typically quantified by the clustering coefficient, which is the probability that a length-2 path is closed, i.e., induces a triangle in the network. However, higher-order cliques beyond triangles are crucial to understanding complex networks, and the clustering behavior with respect to such higher-order network structures is not well understood. Here we introduce higher-order clustering coefficients that measure the closure probability of higher-order network cliques and provide a more comprehensive view of how the edges of complex networks cluster. Our higher-order clustering coefficients are a natural generalization of the traditional clustering coefficient. We derive several properties about higher-order clustering coefficients and analyze them under common random graph models. Finally, we use higher-order clustering coefficients to gain new insights into the structure of real-world networks from several domains.
View details for DOI 10.1103/PhysRevE.97.052306
View details for PubMedID 29906904
-
Modeling Individual Cyclic Variation in Human Behavior.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2018; 2018: 107–16
Abstract
Cycles are fundamental to human health and behavior. Examples include mood cycles, circadian rhythms, and the menstrual cycle. However, modeling cycles in time series data is challenging because in most cases the cycles are not labeled or directly observed and need to be inferred from multidimensional measurements taken over time. Here, we present Cyclic Hidden Markov Models (CyH-MMs) for detecting and modeling cycles in a collection of multidimensional heterogeneous time series data. In contrast to previous cycle modeling methods, CyHMMs deal with a number of challenges encountered in modeling real-world cycles: they can model multivariate data with both discrete and continuous dimensions; they explicitly model and are robust to missing data; and they can share information across individuals to accommodate variation both within and between individual time series. Experiments on synthetic and real-world health-tracking data demonstrate that CyHMMs infer cycle lengths more accurately than existing methods, with 58% lower error on simulated data and 63% lower error on real-world data compared to the best-performing baseline. CyHMMs can also perform functions which baselines cannot: they can model the progression of individual features/symptoms over the course of the cycle, identify the most variable features, and cluster individual time series into groups with distinct characteristics. Applying CyHMMs to two real-world health-tracking datasets-of human menstrual cycle symptoms and physical activity tracking data-yields important insights including which symptoms to expect at each point during the cycle. We also find that people fall into several groups with distinct cycle patterns, and that these groups differ along dimensions not provided to the model. For example, by modeling missing data in the menstrual cycles dataset, we are able to discover a medically relevant group of birth control users even though information on birth control is not given to the model.
View details for PubMedID 29780976
-
Modeling Interdependent and Periodic Real-World Action Sequences.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2018; 2018: 803–12
Abstract
Mobile health applications, including those that track activities such as exercise, sleep, and diet, are becoming widely used. Accurately predicting human actions in the real world is essential for targeted recommendations that could improve our health and for personalization of these applications. However, making such predictions is extremely difficult due to the complexities of human behavior, which consists of a large number of potential actions that vary over time, depend on each other, and are periodic. Previous work has not jointly modeled these dynamics and has largely focused on item consumption patterns instead of broader types of behaviors such as eating, commuting or exercising. In this work, we develop a novel statistical model, called TIPAS, for Time-varying, Interdependent, and Periodic Action Sequences. Our approach is based on personalized, multivariate temporal point processes that model time-varying action propensities through a mixture of Gaussian intensities. Our model captures short-term and long-term periodic interdependencies between actions through Hawkes process-based self-excitations. We evaluate our approach on two activity logging datasets comprising 12 million real-world actions (e.g., eating, sleep, and exercise) taken by 20 thousand users over 17 months. We demonstrate that our approach allows us to make successful predictions of future user actions and their timing. Specifically, TIPAS improves predictions of actions, and their timing, over existing methods across multiple datasets by up to 156%, and up to 37%, respectively. Performance improvements are particularly large for relatively rare and periodic actions such as walking and biking, improving over baselines by up to 256%. This demonstrates that explicit modeling of dependencies and periodicities in real-world behavior enables successful predictions of future actions, with implications for modeling human behavior, app personalization, and targeting of health interventions.
View details for PubMedID 29780977
-
I'll Be Back: On the Multiple Lives of Users of a Mobile Activity Tracking Application.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2018; 2018: 1501–11
Abstract
Mobile health applications that track activities, such as exercise, sleep, and diet, are becoming widely used. While these activity tracking applications have the potential to improve our health, user engagement and retention are critical factors for their success. However, long-term user engagement patterns in real-world activity tracking applications are not yet well understood. Here we study user engagement patterns within a mobile physical activity tracking application consisting of 115 million logged activities taken by over a million users over 31 months. Specifically, we show that over 75% of users return and re-engage with the application after prolonged periods of inactivity, no matter the duration of the inactivity. We find a surprising result that the re-engagement usage patterns resemble those of the start of the initial engagement period, rather than being a simple continuation of the end of the initial engagement period. This evidence points to a conceptual model of multiple lives of user engagement, extending the prevalent single life view of user activity. We demonstrate that these multiple lives occur because the users have a variety of different primary intents or goals for using the app. These primary intents are associated with how long each life lasts and how likely the user is to re-engage for a new life. We find evidence for users being more likely to stop using the app once they achieved their primary intent or goal (e.g., weight loss). However, these users might return once their original intent resurfaces (e.g., wanting to lose newly gained weight). We discuss implications of the multiple life paradigm and propose a novel prediction task of predicting the number of lives of a user. Based on insights developed in this work, including a marker of improved primary intent performance, our prediction models achieve 71% ROC AUC. Overall, our research has implications for modeling user re-engagement in health activity tracking applications and has consequences for how notifications, recommendations as well as gamification can be used to increase engagement.
View details for PubMedID 29780978
-
HUMAN DECISIONS AND MACHINE PREDICTIONS
QUARTERLY JOURNAL OF ECONOMICS
2018; 133 (1): 237–93
Abstract
Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; and these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals. JEL Codes: C10 (Econometric and statistical methods and methodology), C55 (Large datasets: Modeling and analysis), K40 (Legal procedure, the legal system, and illegal behavior).
View details for PubMedID 29755141
View details for PubMedCentralID PMC5947971
-
Accurate Influenza Monitoring and Forecasting Using Novel Internet Data Streams: A Case Study in the Boston Metropolis.
JMIR public health and surveillance
2018; 4 (1): e4
Abstract
BACKGROUND: Influenza outbreaks pose major challenges to public health around the world, leading to thousands of deaths a year in the United States alone. Accurate systems that track influenza activity at the city level are necessary to provide actionable information that can be used for clinical, hospital, and community outbreak preparation.OBJECTIVE: Although Internet-based real-time data sources such as Google searches and tweets have been successfully used to produce influenza activity estimates ahead of traditional health care-based systems at national and state levels, influenza tracking and forecasting at finer spatial resolutions, such as the city level, remain an open question. Our study aimed to present a precise, near real-time methodology capable of producing influenza estimates ahead of those collected and published by the Boston Public Health Commission (BPHC) for the Boston metropolitan area. This approach has great potential to be extended to other cities with access to similar data sources.METHODS: We first tested the ability of Google searches, Twitter posts, electronic health records, and a crowd-sourced influenza reporting system to detect influenza activity in the Boston metropolis separately. We then adapted a multivariate dynamic regression method named ARGO (autoregression with general online information), designed for tracking influenza at the national level, and showed that it effectively uses the above data sources to monitor and forecast influenza at the city level 1 week ahead of the current date. Finally, we presented an ensemble-based approach capable of combining information from models based on multiple data sources to more robustly nowcast as well as forecast influenza activity in the Boston metropolitan area. The performances of our models were evaluated in an out-of-sample fashion over 4 influenza seasons within 2012-2016, as well as a holdout validation period from 2016 to 2017.RESULTS: Our ensemble-based methods incorporating information from diverse models based on multiple data sources, including ARGO, produced the most robust and accurate results. The observed Pearson correlations between our out-of-sample flu activity estimates and those historically reported by the BPHC were 0.98 in nowcasting influenza and 0.94 in forecasting influenza 1 week ahead of the current date.CONCLUSIONS: We show that information from Internet-based data sources, when combined using an informed, robust methodology, can be effectively used as early indicators of influenza activity at fine geographic resolutions.
View details for PubMedID 29317382
-
Large-scale analysis of disease pathways in the human interactome.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2018; 23: 111–22
Abstract
Discovering disease pathways, which can be defined as sets of proteins associated with a given disease, is an important problem that has the potential to provide clinically actionable insights for disease diagnosis, prognosis, and treatment. Computational methods aid the discovery by relying on protein-protein interaction (PPI) networks. They start with a few known disease-associated proteins and aim to find the rest of the pathway by exploring the PPI network around the known disease proteins. However, the success of such methods has been limited, and failure cases have not been well understood. Here we study the PPI network structure of 519 disease pathways. We find that 90% of pathways do not correspond to single well-connected components in the PPI network. Instead, proteins associated with a single disease tend to form many separate connected components/regions in the network. We then evaluate state-of-the-art disease pathway discovery methods and show that their performance is especially poor on diseases with disconnected pathways. Thus, we conclude that network connectivity structure alone may not be sufficient for disease pathway discovery. However, we show that higher-order network structures, such as small subgraphs of the pathway, provide a promising direction for the development of new methods.
View details for PubMedID 29218874
-
Community Interaction and Conflict on the Web
ASSOC COMPUTING MACHINERY. 2018: 933–43
View details for DOI 10.1145/3178876.3186141
View details for Web of Science ID 000460379000092
-
Embedding Logical Queries on Knowledge Graphs
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461823302007
-
Data-Driven Model Predictive Control of Autonomous Mobility-on-Demand Systems
IEEE COMPUTER SOC. 2018: 6019–25
View details for Web of Science ID 000446394504080
-
Drive2Vec: Multiscale State-Space Embedding of Vehicular Sensor Data
IEEE. 2018: 3233–38
View details for Web of Science ID 000457881303034
-
Learning Structural Node Embeddings via Diffusion Wavelets
ASSOC COMPUTING MACHINERY. 2018: 1320–29
View details for DOI 10.1145/3219819.3220025
View details for Web of Science ID 000455346400137
-
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
ASSOC COMPUTING MACHINERY. 2018: 974–83
View details for DOI 10.1145/3219819.3219890
View details for Web of Science ID 000455346400101
-
Hierarchical Graph Representation Learning with Differentiable Pooling
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461823304078
-
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461852000087
-
MIS2: Misinformation and Misbehavior Mining on the Web
ASSOC COMPUTING MACHINERY. 2018: 799–800
View details for DOI 10.1145/3159652.3160597
View details for Web of Science ID 000456363600112
-
Dynamic Network Model from Partial Observations
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461852004042
-
Large-scale analysis of disease pathways in the human interactome
WORLD SCIENTIFIC PUBL CO PTE LTD. 2018: 111–22
View details for Web of Science ID 000461831500011
-
The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2017; 2017: 275–84
Abstract
Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.
View details for PubMedID 29780658
-
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2017; 2017: 215–23
Abstract
Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.
View details for PubMedID 29770257
-
Network Inference via the Time-Varying Graphical Lasso.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2017; 2017: 205–13
Abstract
Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the time-varying graphical lasso (TVGL), a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability.
View details for PubMedID 29770256
-
Local Higher-Order Graph Clustering.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2017; 2017: 555–64
Abstract
Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology.
View details for PubMedID 29770258
-
Large-scale physical activity data reveal worldwide activity inequality
NATURE
2017; 547 (7663): 336-+
Abstract
To be able to curb the global pandemic of physical inactivity and the associated 5.3 million deaths per year, we need to understand the basic principles that govern physical activity. However, there is a lack of large-scale measurements of physical activity patterns across free-living populations worldwide. Here we leverage the wide usage of smartphones with built-in accelerometry to measure physical activity at the global scale. We study a dataset consisting of 68 million days of physical activity for 717,527 people, giving us a window into activity in 111 countries across the globe. We find inequality in how activity is distributed within countries and that this inequality is a better predictor of obesity prevalence in the population than average activity volume. Reduced activity in females contributes to a large portion of the observed activity inequality. Aspects of the built environment, such as the walkability of a city, are associated with a smaller gender gap in activity and lower activity inequality. In more walkable cities, activity is greater throughout the day and throughout the week, across age, gender, and body mass index (BMI) groups, with the greatest increases in activity found for females. Our findings have implications for global public health policy and urban planning and highlight the role of activity inequality and the built environment in improving physical activity and health.
View details for PubMedID 28693034
View details for PubMedCentralID PMC5774986
-
Predicting multicellular function through multi-layer tissue networks
BIOINFORMATICS
2017; 33 (14): I190–I198
Abstract
Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine.Here, we present OhmNet , a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding-based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems.Source code and datasets are available at http://snap.stanford.edu/ohmnet .jure@cs.stanford.edu.
View details for PubMedID 28881986
View details for PubMedCentralID PMC5870717
-
Network analysis: a novel method for mapping neonatal acute transport patterns in California.
Journal of perinatology
2017; 37 (6): 702-708
Abstract
The objectives of this study are to use network analysis to describe the pattern of neonatal transfers in California, to compare empirical sub-networks with established referral regions and to determine factors associated with transport outside the originating sub-network.This cross-sectional database study included 6546 infants <28 days old transported within California in 2012. After generating a graph representing acute transfers between hospitals (n=6696), we used community detection techniques to identify more tightly connected sub-networks. These empirically derived sub-networks were compared with state-defined regional referral networks. Reasons for transfer between empirical sub-networks were assessed using logistic regression.Empirical sub-networks showed significant overlap with regulatory regions (P<0.001). Transfer outside the empirical sub-network was associated with major congenital anomalies (P<0.001), need for surgery (P=0.01) and insurance as the reason for transfer (P<0.001).Network analysis accurately reflected empirical neonatal transfer patterns, potentially facilitating quantitative, rather than qualitative, analysis of regionalized health care delivery systems.Journal of Perinatology advance online publication, 23 March 2017; doi:10.1038/jp.2017.20.
View details for DOI 10.1038/jp.2017.20
View details for PubMedID 28333155
-
Loyalty in Online Communities.
Proceedings of the ... International AAAI Conference on Weblogs and Social Media. International AAAI Conference on Weblogs and Social Media
2017; 2017: 540–43
Abstract
Loyalty is an essential component of multi-community engagement. When users have the choice to engage with a variety of different communities, they often become loyal to just one, focusing on that community at the expense of others. However, it is unclear how loyalty is manifested in user behavior, or whether certain community characteristics encourage loyalty. In this paper we operationalize loyalty as a user-community relation: users loyal to a community consistently prefer it over all others; loyal communities retain their loyal users over time. By exploring a large set of Reddit communities, we reveal that loyalty is manifested in remarkably consistent behaviors. Loyal users employ language that signals collective identity and engage with more esoteric, less popular content, indicating that they may play a curational role in surfacing new material. Loyal communities have denser user-user interaction networks and lower rates of triadic closure, suggesting that community-level loyalty is associated with more cohesive interactions and less fragmentation into subgroups. We exploit these general patterns to predict future rates of loyalty. Our results show that a user's propensity to become loyal is apparent from their initial interactions with a community, suggesting that some users are intrinsically loyal from the very beginning.
View details for PubMedID 29354326
-
Community Identity and User Engagement in a Multi-Community Landscape.
Proceedings of the ... International AAAI Conference on Weblogs and Social Media. International AAAI Conference on Weblogs and Social Media
2017; 2017: 377–86
Abstract
A community's identity defines and shapes its internal dynamics. Our current understanding of this interplay is mostly limited to glimpses gathered from isolated studies of individual communities. In this work we provide a systematic exploration of the nature of this relation across a wide variety of online communities. To this end we introduce a quantitative, language-based typology reflecting two key aspects of a community's identity: how distinctive, and how temporally dynamic it is. By mapping almost 300 Reddit communities into the landscape induced by this typology, we reveal regularities in how patterns of user engagement vary with the characteristics of a community. Our results suggest that the way new and existing users engage with a community depends strongly and systematically on the nature of the collective identity it fosters, in ways that are highly consequential to community maintainers. For example, communities with distinctive and highly dynamic identities are more likely to retain their users. However, such niche communities also exhibit much larger acculturation gaps between existing users and newcomers, which potentially hinder the integration of the latter. More generally, our methodology reveals differences in how various social phenomena manifest across communities, and shows that structuring the multi-community landscape can lead to a better understanding of the systematic nature of this diversity.
View details for PubMedID 29354325
-
Anyone Can Become a Troll
AMERICAN SCIENTIST
2017; 105 (3): 152–55
View details for DOI 10.1511/2017.126.152
View details for Web of Science ID 000399270400012
-
How Gamification Affects Physical Activity: Large-scale Analysis of Walking Challenges in a Mobile Application.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2017; 2017: 455–63
Abstract
Gamification represents an effective way to incentivize user behavior across a number of computing applications. However, despite the fact that physical activity is essential for a healthy lifestyle, surprisingly little is known about how gamification and in particular competitions shape human physical activity. Here we study how competitions affect physical activity. We focus on walking challenges in a mobile activity tracking application where multiple users compete over who takes the most steps over a predefined number of days. We synthesize our findings in a series of game and app design implications. In particular, we analyze nearly 2,500 physical activity competitions over a period of one year capturing more than 800,000 person days of activity tracking. We observe that during walking competitions, the average user increases physical activity by 23%. Furthermore, there are large increases in activity for both men and women across all ages, and weight status, and even for users that were previously fairly inactive. We also find that the composition of participants greatly affects the dynamics of the game. In particular, if highly unequal participants get matched to each other, then competition suffers and the overall effect on the physical activity drops significantly. Furthermore, competitions with an equal mix of both men and women are more effective in increasing the level of activities. We leverage these insights to develop a statistical model to predict whether or not a competition will be particularly engaging with significant accuracy. Our models can serve as a guideline to help design more engaging competitions that lead to most beneficial behavioral changes.
View details for PubMedID 28990011
-
Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior.
Proceedings of the ... International Conference on Web Search & Data Mining. International Conference on Web Search & Data Mining
2017; 2017: 537-546
Abstract
Many of today's most widely used computing applications utilize social networking features and allow users to connect, follow each other, share content, and comment on others' posts. However, despite the widespread adoption of these features, there is little understanding of the consequences that social networking has on user retention, engagement, and online as well as offline behavior. Here, we study how social networks influence user behavior in a physical activity tracking application. We analyze 791 million online and offline actions of 6 million users over the course of 5 years, and show that social networking leads to a significant increase in users' online as well as offline activities. Specifically, we establish a causal effect of how social networks influence user behavior. We show that the creation of new social connections increases user online in-application activity by 30%, user retention by 17%, and user offline real-world physical activity by 7% (about 400 steps per day). By exploiting a natural experiment we distinguish the effect of social influence of new social connections from the simultaneous increase in user's motivation to use the app and take more steps. We show that social influence accounts for 55% of the observed changes in user behavior, while the remaining 45% can be explained by the user's increased motivation to use the app. Further, we show that subsequent, individual edge formations in the social network lead to significant increases in daily steps. These effects diminish with each additional edge and vary based on edge attributes and user demographics. Finally, we utilize these insights to develop a model that accurately predicts which users will be most influenced by the creation of new social network connections.
View details for DOI 10.1145/3018661.3018672
View details for PubMedID 28345078
-
SnapVX: A Network-Based Convex Optimization Solver
JOURNAL OF MACHINE LEARNING RESEARCH
2017; 18
View details for Web of Science ID 000397018200001
-
Inductive Representation Learning on Large Graphs
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
View details for Web of Science ID 000452649401007
-
Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov Random Fields.
Proceedings of machine learning research
2017; 54: 1302–10
Abstract
Markov random fields (MRFs) are a useful tool for modeling relationships present in large and high-dimensional data. Often, this data comes from various sources and can have diverse distributions, for example a combination of numerical, binary, and categorical variables. Here, we define the pairwise exponential Markov random field (PE-MRF), an approach capable of modeling exponential family distributions in heterogeneous domains. We develop a scalable method of learning the graphical structure across the variables by solving a regularized approximated maximum likelihood problem. Specifically, we first derive a tractable upper bound on the log-partition function. We then use this upper bound to derive the group graphical lasso, a generalization of the classic graphical lasso problem to heterogeneous domains. To solve this problem, we develop a fast algorithm based on the alternating direction method of multipliers (ADMM). We also prove that our estimator is sparsistent, with guaranteed recovery of the true underlying graphical structure, and that it has a polynomially faster runtime than the current state-of-the-art method for learning such distributions. Experiments on synthetic and real-world examples demonstrate that our approach is both efficient and accurate at uncovering the structure of heterogeneous data.
View details for PubMedID 30931433
-
Anyone Can Become a Troll: Causes of Trolling Behavior in Online Discussions
ASSOC COMPUTING MACHINERY. 2017: 1217–30
Abstract
In online communities, antisocial behavior such as trolling disrupts constructive discussion. While prior work suggests that trolling behavior is confined to a vocal and antisocial minority, we demonstrate that ordinary people can engage in such behavior as well. We propose two primary trigger mechanisms: the individual's mood, and the surrounding context of a discussion (e.g., exposure to prior trolling behavior). Through an experiment simulating an online discussion, we find that both negative mood and seeing troll posts by others significantly increases the probability of a user trolling, and together double this probability. To support and extend these results, we study how these same mechanisms play out in the wild via a data-driven, longitudinal analysis of a large online news discussion community. This analysis reveals temporal mood effects, and explores long range patterns of repeated exposure to trolling. A predictive model of trolling behavior shows that mood and discussion context together can explain trolling behavior better than an individual's history of trolling. These results combine to suggest that ordinary people can, under the right circumstances, behave like trolls.
View details for PubMedID 29399664
-
Motifs in Temporal Networks
ASSOC COMPUTING MACHINERY. 2017: 601–10
View details for DOI 10.1145/3018661.3018731
View details for Web of Science ID 000455803400066
-
Modeling Affinity based Popularity Dynamics
ASSOC COMPUTING MACHINERY. 2017: 477–86
View details for DOI 10.1145/3132847.3132923
View details for Web of Science ID 000440845300048
-
SnapVX: A Network-Based Convex Optimization Solver.
Journal of machine learning research : JMLR
2017; 18 (1): 110–14
Abstract
SnapVX is a high-performance solver for convex optimization problems defined on networks. For problems of this form, SnapVX provides a fast and scalable solution with guaranteed global convergence. It combines the capabilities of two open source software packages: Snap.py and CVXPY. Snap.py is a large scale graph processing library, and CVXPY provides a general modeling framework for small-scale subproblems. SnapVX offers a customizable yet easy-to-use Python interface with "out-of-the-box" functionality. Based on the Alternating Direction Method of Multipliers (ADMM), it is able to efficiently store, analyze, parallelize, and solve large optimization problems from a variety of different applications. Documentation, examples, and more can be found on the SnapVX website at http://snap.stanford.edu/snapvx.
View details for PubMedID 29599649
-
Large-scale Graph Representation Learning
IEEE. 2017: 4
View details for Web of Science ID 000428073700004
-
Mining Big Data to Extract Patterns and Predict Real-Life Outcomes
PSYCHOLOGICAL METHODS
2016; 21 (4): 493-506
Abstract
This article aims to introduce the reader to essential tools that can be used to obtain insights and build predictive models using large data sets. Recent user proliferation in the digital environment has led to the emergence of large samples containing a wealth of traces of human behaviors, communication, and social interactions. Such samples offer the opportunity to greatly improve our understanding of individuals, groups, and societies, but their analysis presents unique methodological challenges. In this tutorial, we discuss potential sources of such data and explain how to efficiently store them. Then, we introduce two methods that are often employed to extract patterns and reduce the dimensionality of large data sets: singular value decomposition and latent Dirichlet allocation. Finally, we demonstrate how to use dimensions or clusters extracted from data to build predictive models in a cross-validated way. The text is accompanied by examples of R code and a sample data set, allowing the reader to practice the methods discussed here. A companion website (http://dataminingtutorial.com) provides additional learning resources. (PsycINFO Database Record
View details for DOI 10.1037/met0000105
View details for Web of Science ID 000393202300004
View details for PubMedID 27918179
-
Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
2016; 2016: 2116-2121
Abstract
Words shift in meaning for many reasons, including cultural factors like new technologies and regular linguistic processes like subjectification. Understanding the evolution of language and culture requires disentangling these underlying causes. Here we show how two different distributional measures can be used to detect two different types of semantic change. The first measure, which has been used in many previous works, analyzes global shifts in a word's distributional semantics; it is sensitive to changes due to regular processes of linguistic drift, such as the semantic generalization of promise ("I promise." "It promised to be exciting."). The second measure, which we develop here, focuses on local changes to a word's nearest semantic neighbors; it is more sensitive to cultural shifts, such as the change in the meaning of cell ("prison cell" "cell phone"). Comparing measurements made by these two methods allows researchers to determine whether changes are more cultural or linguistic in nature, a distinction that is essential for work in the digital humanities and historical linguistics.
View details for PubMedID 28580459
-
Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora.
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
2016; 2016: 595–605
Abstract
A word's sentiment depends on the domain in which it is used. Computational social science research thus requires sentiment lexicons that are specific to the domains being studied. We combine domain-specific word embeddings with a label propagation framework to induce accurate domain-specific sentiment lexicons using small sets of seed words. We show that our approach achieves state-of-the-art performance on inducing sentiment lexicons from domain-specific corpora and that our purely corpus-based approach outperforms methods that rely on hand-curated resources (e.g., WordNet). Using our framework, we induce and release historical sentiment lexicons for 150 years of English and community-specific sentiment lexicons for 250 online communities from the social media forum Reddit. The historical lexicons we induce show that more than 5% of sentiment-bearing (non-neutral) English words completely switched polarity during the last 150 years, and the community-specific lexicons highlight how sentiment varies drastically between different communities.
View details for PubMedID 28660257
-
SNAP: A General-Purpose Network Analysis and Graph-Mining Library
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
2016; 8 (1)
Abstract
Large networks are becoming a widely used abstraction for studying complex systems in a broad set of disciplines, ranging from social network analysis to molecular biology and neuroscience. Despite an increasing need to analyze and manipulate large networks, only a limited number of tools are available for this task. Here, we describe Stanford Network Analysis Platform (SNAP), a general-purpose, high-performance system that provides easy to use, high-level operations for analysis and manipulation of large networks. We present SNAP functionality, describe its implementational details, and give performance benchmarks. SNAP has been developed for single big-memory machines and it balances the trade-off between maximum performance, compact in-memory graph representation, and the ability to handle dynamic graphs where nodes and edges are being added or removed over time. SNAP can process massive networks with hundreds of millions of nodes and billions of edges. SNAP offers over 140 different graph algorithms that can efficiently manipulate large graphs, calculate structural properties, generate regular and random graphs, and handle attributes and meta-data on nodes and edges. Besides being able to handle large graphs, an additional strength of SNAP is that networks and their attributes are fully dynamic, they can be modified during the computation at low cost. SNAP is provided as an open source library in C++ as well as a module in Python. We also describe the Stanford Large Network Dataset, a set of social and information real-world networks and datasets, which we make publicly available. The collection is a complementary resource to our SNAP software and is widely used for development and benchmarking of graph analytics algorithms.
View details for DOI 10.1145/2898361
View details for Web of Science ID 000385621300001
View details for PubMedCentralID PMC5361061
-
SNAP: A General Purpose Network Analysis and Graph Mining Library.
ACM transactions on intelligent systems and technology
2016; 8 (1)
Abstract
Large networks are becoming a widely used abstraction for studying complex systems in a broad set of disciplines, ranging from social network analysis to molecular biology and neuroscience. Despite an increasing need to analyze and manipulate large networks, only a limited number of tools are available for this task. Here, we describe Stanford Network Analysis Platform (SNAP), a general-purpose, high-performance system that provides easy to use, high-level operations for analysis and manipulation of large networks. We present SNAP functionality, describe its implementational details, and give performance benchmarks. SNAP has been developed for single big-memory machines and it balances the trade-off between maximum performance, compact in-memory graph representation, and the ability to handle dynamic graphs where nodes and edges are being added or removed over time. SNAP can process massive networks with hundreds of millions of nodes and billions of edges. SNAP offers over 140 different graph algorithms that can efficiently manipulate large graphs, calculate structural properties, generate regular and random graphs, and handle attributes and meta-data on nodes and edges. Besides being able to handle large graphs, an additional strength of SNAP is that networks and their attributes are fully dynamic, they can be modified during the computation at low cost. SNAP is provided as an open source library in C++ as well as a module in Python. We also describe the Stanford Large Network Dataset, a set of social and information real-world networks and datasets, which we make publicly available. The collection is a complementary resource to our SNAP software and is widely used for development and benchmarking of graph analytics algorithms.
View details for DOI 10.1145/2898361
View details for PubMedID 28344853
View details for PubMedCentralID PMC5361061
-
node2vec: Scalable Feature Learning for Networks.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2016; 2016: 855-864
Abstract
Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.
View details for PubMedID 27853626
-
Interpretable Decision Sets: A Joint Framework for Description and Prediction.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2016; 2016: 1675-1684
Abstract
One of the most important obstacles to deploying predictive models is the fact that humans do not understand and trust them. Knowing which variables are important in a model's prediction and how they are combined can be very powerful in helping people understand and trust automatic decision making systems. Here we propose interpretable decision sets, a framework for building predictive models that are highly accurate, yet also highly interpretable. Decision sets are sets of independent if-then rules. Because each rule can be applied independently, decision sets are simple, concise, and easily interpretable. We formalize decision set learning through an objective function that simultaneously optimizes accuracy and interpretability of the rules. In particular, our approach learns short, accurate, and non-overlapping rules that cover the whole feature space and pay attention to small but important classes. Moreover, we prove that our objective is a non-monotone submodular function, which we efficiently optimize to find a near-optimal set of rules. Experiments show that interpretable decision sets are as accurate at classification as state-of-the-art machine learning techniques. They are also three times smaller on average than rule-based models learned by other methods. Finally, results of a user study show that people are able to answer multiple-choice questions about the decision boundaries of interpretable decision sets and write descriptions of classes based on them faster and more accurately than with other rule-based models that were designed for interpretability. Overall, our framework provides a new approach to interpretable machine learning that balances accuracy, interpretability, and computational efficiency.
View details for PubMedID 27853627
-
Higher-order organization of complex networks
SCIENCE
2016; 353 (6295): 163-166
Abstract
Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks--at the level of small network subgraphs--remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns.
View details for DOI 10.1126/science.aad9029
View details for Web of Science ID 000379208400037
View details for PubMedID 27387949
-
Growing Wikipedia Across Languages via Recommendation.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2016; 2016: 975-985
Abstract
The different Wikipedia language editions vary dramatically in how comprehensive they are. As a result, most language editions contain only a small fraction of the sum of information that exists across all Wikipedias. In this paper, we present an approach to filling gaps in article coverage across different Wikipedia editions. Our main contribution is an end-to-end system for recommending articles for creation that exist in one language but are missing in another. The system involves identifying missing articles, ranking the missing articles according to their importance, and recommending important missing articles to editors based on their interests. We empirically validate our models in a controlled experiment involving 12,000 French Wikipedia editors. We find that personalizing recommendations increases editor engagement by a factor of two. Moreover, recommending articles increases their chance of being created by a factor of 3.2. Finally, articles created as a result of our recommendations are of comparable quality to organically created articles. Overall, our system leads to more engaged editors and faster growth of Wikipedia with no effect on its quality.
View details for PubMedID 27819073
-
Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health.
Transactions of the Association for Computational Linguistics
2016; 4: 463-476
Abstract
Mental illness is one of the most pressing public health issues of our time. While counseling and psychotherapy can be effective treatments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with labeled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis methods to measure how various linguistic aspects of conversations are correlated with conversation outcomes. Applying techniques such as sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses, we discover actionable conversation strategies that are associated with better conversation outcomes.
View details for PubMedID 28344978
-
Improving Website Hyperlink Structure Using Server Logs
ASSOC COMPUTING MACHINERY. 2016: 615–24
Abstract
Good websites should be easy to navigate via hyperlinks, yet maintaining a high-quality link structure is difficult. Identifying pairs of pages that should be linked may be hard for human editors, especially if the site is large and changes frequently. Further, given a set of useful link candidates, the task of incorporating them into the site can be expensive, since it typically involves humans editing pages. In the light of these challenges, it is desirable to develop data-driven methods for automating the link placement task. Here we develop an approach for automatically finding useful hyperlinks to add to a website. We show that passively collected server logs, beyond telling us which existing links are useful, also contain implicit signals indicating which nonexistent links would be useful if they were to be introduced. We leverage these signals to model the future usefulness of yet nonexistent links. Based on our model, we define the problem of link placement under budget constraints and propose an efficient algorithm for solving it. We demonstrate the effectiveness of our approach by evaluating it on Wikipedia, a large website for which we have access to both server logs (used for finding useful new links) and the complete revision history (containing a ground truth of new links). As our method is based exclusively on standard server logs, it may also be applied to any other website, as we show with the example of the biomedical research site Simtk.
View details for PubMedID 28345077
-
Driver Identification Using Automobile Sensor Data from a Single Turn
IEEE. 2016: 953–58
View details for Web of Science ID 000392215500149
-
Seeing the Forest for the Trees: New Approaches to Forecasting Cascades
ASSOC COMPUTING MACHINERY. 2016: 249–58
View details for DOI 10.1145/2908131.2908155
View details for Web of Science ID 000391621700040
-
Confusions over Time: An Interpretable Bayesian Model to Characterize Trends in Decision Making
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
View details for Web of Science ID 000458973701091
-
Information Cartography
COMMUNICATIONS OF THE ACM
2015; 58 (11): 62-73
View details for DOI 10.1145/2735624
View details for Web of Science ID 000363563800024
-
The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility.
Journal of the American Medical Informatics Association
2015; 22 (6): 1120-1125
Abstract
Regular physical activity helps prevent heart disease, stroke, diabetes, and other chronic diseases, yet a broad range of conditions impair mobility at great personal and societal cost. Vast amounts of data characterizing human movement are available from research labs, clinics, and millions of smartphones and wearable sensors, but integration and analysis of this large quantity of mobility data are extremely challenging. The authors have established the Mobilize Center (http://mobilize.stanford.edu) to harness these data to improve human mobility and help lay the foundation for using data science methods in biomedicine. The Center is organized around 4 data science research cores: biomechanical modeling, statistical learning, behavioral and social modeling, and integrative modeling. Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods. By developing these new approaches, sharing data and validated software tools, and training thousands of researchers, the Mobilize Center will transform human movement research.
View details for DOI 10.1093/jamia/ocv071
View details for PubMedID 26272077
View details for PubMedCentralID PMC4639715
-
Network Lasso: Clustering and Optimization in Large Graphs.
KDD : proceedings. International Conference on Knowledge Discovery & Data Mining
2015; 2015: 387-396
Abstract
Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular - binary classification, predicting housing prices, and event detection in time series data - comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems.
View details for PubMedID 27398260
-
Donor Retention in Online Crowdfunding Communities: A Case Study of DonorsChoose.org.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2015; 2015: 34-44
Abstract
Online crowdfunding platforms like DonorsChoose.org and Kick-starter allow specific projects to get funded by targeted contributions from a large number of people. Critical for the success of crowdfunding communities is recruitment and continued engagement of donors. With donor attrition rates above 70%, a significant challenge for online crowdfunding platforms as well as traditional offline non-profit organizations is the problem of donor retention. We present a large-scale study of millions of donors and donations on DonorsChoose.org, a crowdfunding platform for education projects. Studying an online crowdfunding platform allows for an unprecedented detailed view of how people direct their donations. We explore various factors impacting donor retention which allows us to identify different groups of donors and quantify their propensity to return for subsequent donations. We find that donors are more likely to return if they had a positive interaction with the receiver of the donation. We also show that this includes appropriate and timely recognition of their support as well as detailed communication of their impact. Finally, we discuss how our findings could inform steps to improve donor retention in crowdfunding communities and non-profit organizations.
View details for PubMedID 27077139
-
Ringo: Interactive Graph Analytics on Big-Memory Machines.
Proceedings. ACM-Sigmod International Conference on Management of Data
2015; 2015: 1105-1110
Abstract
We present Ringo, a system for analysis of large graphs. Graphs provide a way to represent and analyze systems of interacting objects (people, proteins, webpages) with edges between the objects denoting interactions (friendships, physical interactions, links). Mining graphs provides valuable insights about individual objects as well as the relationships among them. In building Ringo, we take advantage of the fact that machines with large memory and many cores are widely available and also relatively affordable. This allows us to build an easy-to-use interactive high-performance graph analytics system. Graphs also need to be built from input data, which often resides in the form of relational tables. Thus, Ringo provides rich functionality for manipulating raw input data tables into various kinds of graphs. Furthermore, Ringo also provides over 200 graph analytics functions that can then be applied to constructed graphs. We show that a single big-memory machine provides a very attractive platform for performing analytics on all but the largest graphs as it offers excellent performance and ease of use as compared to alternative approaches. With Ringo, we also demonstrate how to integrate graph analytics with an iterative process of trial-and-error data exploration and rapid experimentation, common in data mining workloads.
View details for PubMedID 27081215
-
Defining and evaluating network communities based on ground-truth
KNOWLEDGE AND INFORMATION SYSTEMS
2015; 42 (1): 181-213
View details for DOI 10.1007/s10115-013-0693-z
View details for Web of Science ID 000347286900008
-
Mining Online Networks and Communities
SPRINGER-VERLAG BERLIN. 2015
View details for Web of Science ID 000364669900003
-
Large Scale Network Analytics with SNAP Tutorial at the World Wide Web 2015 Conference
ASSOC COMPUTING MACHINERY. 2015: 1537–38
View details for DOI 10.1145/2740908.2744708
View details for Web of Science ID 000382666600335
-
Tensor Spectral Clustering for Partitioning Higher-order Network Structures.
Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining
2015; 2015: 118-126
Abstract
Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.
View details for PubMedID 27812399
-
Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia.
Proceedings of the ... International World-Wide Web Conference. International WWW Conference
2015; 2015: 1242-1252
Abstract
Hyperlinks are an essential feature of the World Wide Web. They are especially important for online encyclopedias such as Wikipedia: an article can often only be understood in the context of related articles, and hyperlinks make it easy to explore this context. But important links are often missing, and several methods have been proposed to alleviate this problem by learning a linking model based on the structure of the existing links. Here we propose a novel approach to identifying missing links in Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia's navigability. We leverage data sets of navigation paths collected through a Wikipedia-based human-computation game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates. Experiments show that our procedure identifies missing links of high quality.
View details for PubMedID 26634229
-
Analyzing Information Seeking and Drug-Safety Alert Response by Health Care Professionals as New Methods for Surveillance.
Journal of medical Internet research
2015; 17 (8)
Abstract
Patterns in general consumer online search logs have been used to monitor health conditions and to predict health-related activities, but the multiple contexts within which consumers perform online searches make significant associations difficult to interpret. Physician information-seeking behavior has typically been analyzed through survey-based approaches and literature reviews. Activity logs from health care professionals using online medical information resources are thus a valuable yet relatively untapped resource for large-scale medical surveillance.To analyze health care professionals' information-seeking behavior and assess the feasibility of measuring drug-safety alert response from the usage logs of an online medical information resource.Using two years (2011-2012) of usage logs from UpToDate, we measured the volume of searches related to medical conditions with significant burden in the United States, as well as the seasonal distribution of those searches. We quantified the relationship between searches and resulting page views. Using a large collection of online mainstream media articles and Web log posts we also characterized the uptake of a Food and Drug Administration (FDA) alert via changes in UpToDate search activity compared with general online media activity related to the subject of the alert.Diseases and symptoms dominate UpToDate searches. Some searches result in page views of only short duration, while others consistently result in longer-than-average page views. The response to an FDA alert for Celexa, characterized by a change in UpToDate search activity, differed considerably from general online media activity. Changes in search activity appeared later and persisted longer in UpToDate logs. The volume of searches and page view durations related to Celexa before the alert also differed from those after the alert.Understanding the information-seeking behavior associated with online evidence sources can offer insight into the information needs of health professionals and enable large-scale medical surveillance. Our Web log mining approach has the potential to monitor responses to FDA alerts at a national level. Our findings can also inform the design and content of evidence-based medical information resources such as UpToDate.
View details for DOI 10.2196/jmir.4427
View details for PubMedID 26293444
-
Overlapping Communities Explain Core-Periphery Organization of Networks
PROCEEDINGS OF THE IEEE
2014; 102 (12): 1892-1902
View details for DOI 10.1109/JPROC.2014.2364018
View details for Web of Science ID 000345524100004
-
Geospatial Structure of a Planetary-Scale Social Network
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS
2014; 1 (3): 156–63
View details for DOI 10.1109/TCSS.2014.2377789
View details for Web of Science ID 000433874800001
-
Structure and Overlaps of Ground-Truth Communities in Networks
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY
2014; 5 (2)
View details for DOI 10.1145/2594454
View details for Web of Science ID 000335576200005
-
Uncovering the structure and temporal dynamics of information propagation
NETWORK SCIENCE
2014; 2 (1): 26–65
View details for DOI 10.1017/nws.2014.3
View details for Web of Science ID 000218616400002
-
Discovering Social Circles in Ego Networks
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA
2014; 8 (1): 73-100
View details for DOI 10.1145/2556612
View details for Web of Science ID 000333491900004
-
Status and Friendship: Mechanisms of Social Network Evolution
ASSOC COMPUTING MACHINERY. 2014: 229–30
View details for DOI 10.1145/2567948.257732
View details for Web of Science ID 000455947000061
-
Can Cascades be Predicted?
ASSOC COMPUTING MACHINERY. 2014: 925–35
View details for DOI 10.1145/2566486.2567997
View details for Web of Science ID 000455945100083
-
The Bursty Dynamics of the Twitter Information Network
ASSOC COMPUTING MACHINERY. 2014: 913–23
View details for DOI 10.1145/2566486.2568043
View details for Web of Science ID 000455945100082
-
Engaging with Massive Online Courses
ASSOC COMPUTING MACHINERY. 2014: 687–97
View details for Web of Science ID 000455945100063
-
Finding Progression Stages in Time-evolving Event Sequences
ASSOC COMPUTING MACHINERY. 2014: 783–93
View details for DOI 10.1145/2566486.2568044
View details for Web of Science ID 000455945100071
- Modeling Information Propagation with Survival Theory 2013
-
Community Detection in Networks with Node Attributes
IEEE 13th International Conference on Data Mining (ICDM)
IEEE. 2013: 1151–1156
View details for DOI 10.1109/ICDM.2013.167
View details for Web of Science ID 000332874200130
- Structure and Dynamics of Information Pathways in Online Media 2013
- Nonparametric Multi-group Membership Model for Dynamic Networks 2013
- From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews 2013
- Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text 2013
- NIFTY: A System for Large Scale Information Flow Tracking and Clustering 2013
- Steering User Behavior With Badges 2013
- Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach 2013
- Information Cartography: Creating Zoomable, Large-Scale Maps of Information 2013
- Community Detection in Networks with Node Attributes 2013
- A computational approach to politeness with application to social factors 2013
- No Country for Old Members: User lifecycle and linguistic change in online communities 2013
- What’s in a name? Understanding the Interplay between Titles, Content, and Communities in Social Media 2013
-
Measurement error in network data: A re-classification
SOCIAL NETWORKS
2012; 34 (4): 396-409
View details for DOI 10.1016/j.socnet.2012.01.003
View details for Web of Science ID 000313304100004
-
Inferring Networks of Diffusion and Influence
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA
2012; 5 (4)
View details for DOI 10.1145/2086737.2086741
View details for Web of Science ID 000300526600004
-
Community-Affiliation Graph Model for Overlapping Network Community Detection
12th IEEE International Conference on Data Mining (ICDM)
IEEE. 2012: 1170–1175
View details for DOI 10.1109/ICDM.2012.139
View details for Web of Science ID 000316383800143
-
Multiplicative Attribute Graph Model of Real- World Networks
INTERNET MATHEMATICS
2012; 8 (1-2): 113–60
View details for DOI 10.1080/15427951.2012.625257
View details for Web of Science ID 000217673700006
-
Image Labeling on a Network: Using Social-Network Metadata for Image Classification
12th European Conference on Computer Vision (ECCV)
SPRINGER-VERLAG BERLIN. 2012: 828–841
View details for Web of Science ID 000342818800059
- Learning to Discover Social Circles in Ego Networks 2012
- Latent Multi-group Membership Graph Model 2012
- Information Diffusion and External Influence in Networks 2012
- Learning Attitudes and Attributes from Multi-Aspect Reviews 2012
- Automatic versus Human Navigation in Information Networks 2012
- Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow 2012
- The Life and Death of Online Groups: Predicting Group Growth and Longevity 2012
- Human Wayfinding in Information Networks 2012
- Effects of User Similarity in Social Media 2012
- Image Labeling on a Network: Using Social-Network Metadata for Image Classiffcation 2012
-
Defining and Evaluating Network Communities based on Ground-truth
12th IEEE International Conference on Data Mining (ICDM)
IEEE. 2012: 745–754
View details for DOI 10.1109/ICDM.2012.138
View details for Web of Science ID 000316383800076
-
Clash of the Contagions: Cooperation and Competition in Information Diffusion
12th IEEE International Conference on Data Mining (ICDM)
IEEE. 2012: 539–548
View details for DOI 10.1109/ICDM.2012.159
View details for Web of Science ID 000316383800055
-
HADI: Mining Radii of Large Graphs
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA
2011; 5 (2)
View details for DOI 10.1145/1921632.1921634
View details for Web of Science ID 000299341700002
-
Large-Scale Web Data Analysis
IEEE INTELLIGENT SYSTEMS
2011; 26 (1): 11-11
View details for Web of Science ID 000287660800008
-
Kronecker Graphs
GRAPH ALGORITHMS IN THE LANGUAGE OF LINEAR ALGEBRA
2011; 22: 137–204
View details for Web of Science ID 000293907200010
- Sentiment Flow Through Hyperlink Networks 2011
- Modeling Social Networks with Node Attributes using the Multiplicative Attribute Graph Model 2011
- Dynamics of Bidding in a P2P Lending Service: Effects of Herding and Predicting Loan Success 2011
- The Network Completion Problem: Inferring Missing Nodes and Edges in Networks 2011
- Patterns of Temporal Variation in Online Media 2011
- The Role of Social Networks in Online Shopping: Information Passing, Price of Trust, and Consumer Choice 2011
- Supervised Random Walks: Predicting and Recommending Links in Social Networks 2011
- Friendship and Mobility: User Movement In Location-Based Social Networks 2011
- Correcting for Missing Data in Information Cascades 2011
-
Kronecker Graphs: An Approach to Modeling Networks
JOURNAL OF MACHINE LEARNING RESEARCH
2010; 11: 985-1042
View details for Web of Science ID 000277186500021
-
Multiplicative Attribute Graph Model of Real-World Networks
7th Workshop on Algorithms and Models for the Web Graph
SPRINGER-VERLAG BERLIN. 2010: 62–73
View details for Web of Science ID 000297030700007
- Predicting Positive and Negative Links in Online Social Networks 2010
- Citing for High Impact 2010
- Modeling Information Diffusion in Implicit Networks 2010
- Empirical Comparison of Algorithms for Network Community Detection 2010
- On the Convexity of Latent Social Network Inference 2010
- Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations 2010
- Governance in Social Media: A case study of the Wikipedia promotion process 2010
-
Signed Networks in Social Media
28th Annual CHI Conference on Human Factors in Computing Systems
ASSOC COMPUTING MACHINERY. 2010: 1361–1370
View details for Web of Science ID 000281276700157
-
Meme-tracking and the Dynamics of the News Cycle
15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
ASSOC COMPUTING MACHINERY. 2009: 497–505
View details for Web of Science ID 000270922000049
-
Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters
INTERNET MATHEMATICS
2009; 6 (1): 29–123
View details for DOI 10.1080/15427951.2009.10129177
View details for Web of Science ID 000217654500004
- Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters Internet Mathematics 2009; 1 (6): 29--123
- Modeling blog dynamics 2009
- The Battle of the Water Sensor Networks (BWSN): A Design Challenge for Engineers and Algorithms 2009
-
Efficient Sensor Placement Optimization for Securing Large Water Distribution Networks
JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT-ASCE
2008; 134 (6): 516-526
View details for DOI 10.1061/(ASCE)0733-9496(2008)134:6(516)
View details for Web of Science ID 000260124300005
- Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions 2008
- Planetary-Scale Views on a Large Instant-Messaging Network 2008
- Epidemic Thresholds in Real Networks 2008
- Statistical Properties of Community Structure in Large Social and Information Networks 2008
- Microscopic Evolution of Social Networks 2008
- Monitoring Network Evolution using MDL 2008
- Cost-effective Outbreak Detection in Networks 2007
- Web Projections: Learning from Contextual Subgraphs of the Web 2007
- Scalable Modeling of Real Graphs using Kronecker Multiplication 2007
- The Dynamics of Viral Marketing ACM Transactions on the Web (TWEB) 2007; 1 (1)
- Graph Evolution: Densification and Shrinking Diameters 2007
- Cascading Behavior in Large Blog Graphs 2007
- Information Survival Threshold in Sensor and P2P Networks 2007
- Sampling from Large Graphs 2006
- Data Association for Topic Intensity Tracking 2006
- Patterns of Influence in a Recommendation Network 2006
- The Dynamics of Viral Marketing 2006
-
Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication
16th European Conference on Machine Learning (ECML)/9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD)
SPRINGER-VERLAG BERLIN. 2005: 133–145
View details for Web of Science ID 000233235600017
- Semantic Text Features from Small World Graphs 2005
- Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts 2005
- Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations 2005
- Extracting Summary Sentences Based on the Document Semantic Graph Microsoft Research Technical Report MSR-TR-2005-07 2005
- Learning Sub-structures of Document Semantic Graphs for Document Summarization 2004
- The Download Estimation task on KDD Cup 2003 SIGKDD Explorations 2003
- Linear Programming boost for Uneven Datasets 2003
- KDD Cup 2003: The Download Estimation task Jozef Stefan Institute Technical Report 2003
- Govorec - sistem za slovensko govorjenje racunalniskih besedil Information Society 2001
- Detection of Human Bodies using Computer Analysis of a Sequence of Stereo Images 1999