Bio


Leskovec's research focuses on the analyzing and modeling of large social and information networks as the study of phenomena across the social, technological, and natural worlds. He focuses on statistical modeling of network structure, network evolution, and spread of information, influence and viruses over networks. Problems he investigates are motivated by large scale data, the Web and other on-line media. He also does work on text mining and applications of machine learning.

Academic Appointments


Professional Education


  • BSc, University of Ljubljana, Sloveni, Computer Science (2004)
  • PhD, Carnegie Mellon University, Computer Science (2008)

2017-18 Courses


Stanford Advisees


All Publications


  • Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics (Oxford, England) Zitnik, M., Agrawal, M., Leskovec, J. 2018; 34 (13): i457–i466

    Abstract

    Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.Availability and implementation: Source code and preprocessed datasets are at: http://snap.stanford.edu/decagon.

    View details for DOI 10.1093/bioinformatics/bty294

    View details for PubMedID 29949996

  • Prioritizing network communities. Nature communications Zitnik, M., Sosic, R., Leskovec, J. 2018; 9 (1): 2544

    Abstract

    Uncovering modular structure in networks is fundamental for systems in biology, physics, and engineering. Community detection identifies candidate modules as hypotheses, which then need to be validated through experiments, such as mutagenesis in a biological laboratory. Only a few communities can typically be validated, and it is thus important to prioritize which communities to select for downstream experimentation. Here we develop CRANK, a mathematically principled approach for prioritizing network communities. CRANK efficiently evaluates robustness and magnitude of structural features of each community and then combines these features into the community prioritization. CRANK can be used with any community detection method. It needs only information provided by the network structure and does not require any additional metadata or labels. However, when available, CRANK can incorporate domain-specific information to further boost performance. Experiments on many large networks show that CRANK effectively prioritizes communities, yielding a nearly 50-fold improvement in community prioritization.

    View details for DOI 10.1038/s41467-018-04948-5

    View details for PubMedID 29959323

  • Higher-order clustering in networks PHYSICAL REVIEW E Yin, H., Benson, A. R., Leskovec, J. 2018; 97 (5)
  • Modeling Individual Cyclic Variation in Human Behavior. Proceedings of the ... International World-Wide Web Conference. International WWW Conference Pierson, E., Althoff, T., Leskovec, J. 2018; 2018: 107–16

    Abstract

    Cycles are fundamental to human health and behavior. Examples include mood cycles, circadian rhythms, and the menstrual cycle. However, modeling cycles in time series data is challenging because in most cases the cycles are not labeled or directly observed and need to be inferred from multidimensional measurements taken over time. Here, we present Cyclic Hidden Markov Models (CyH-MMs) for detecting and modeling cycles in a collection of multidimensional heterogeneous time series data. In contrast to previous cycle modeling methods, CyHMMs deal with a number of challenges encountered in modeling real-world cycles: they can model multivariate data with both discrete and continuous dimensions; they explicitly model and are robust to missing data; and they can share information across individuals to accommodate variation both within and between individual time series. Experiments on synthetic and real-world health-tracking data demonstrate that CyHMMs infer cycle lengths more accurately than existing methods, with 58% lower error on simulated data and 63% lower error on real-world data compared to the best-performing baseline. CyHMMs can also perform functions which baselines cannot: they can model the progression of individual features/symptoms over the course of the cycle, identify the most variable features, and cluster individual time series into groups with distinct characteristics. Applying CyHMMs to two real-world health-tracking datasets-of human menstrual cycle symptoms and physical activity tracking data-yields important insights including which symptoms to expect at each point during the cycle. We also find that people fall into several groups with distinct cycle patterns, and that these groups differ along dimensions not provided to the model. For example, by modeling missing data in the menstrual cycles dataset, we are able to discover a medically relevant group of birth control users even though information on birth control is not given to the model.

    View details for DOI 10.1145/3178876.3186052

    View details for PubMedID 29780976

  • Modeling Interdependent and Periodic Real-World Action Sequences. Proceedings of the ... International World-Wide Web Conference. International WWW Conference Kurashima, T., Althoff, T., Leskovec, J. 2018; 2018: 803–12

    Abstract

    Mobile health applications, including those that track activities such as exercise, sleep, and diet, are becoming widely used. Accurately predicting human actions in the real world is essential for targeted recommendations that could improve our health and for personalization of these applications. However, making such predictions is extremely difficult due to the complexities of human behavior, which consists of a large number of potential actions that vary over time, depend on each other, and are periodic. Previous work has not jointly modeled these dynamics and has largely focused on item consumption patterns instead of broader types of behaviors such as eating, commuting or exercising. In this work, we develop a novel statistical model, called TIPAS, for Time-varying, Interdependent, and Periodic Action Sequences. Our approach is based on personalized, multivariate temporal point processes that model time-varying action propensities through a mixture of Gaussian intensities. Our model captures short-term and long-term periodic interdependencies between actions through Hawkes process-based self-excitations. We evaluate our approach on two activity logging datasets comprising 12 million real-world actions (e.g., eating, sleep, and exercise) taken by 20 thousand users over 17 months. We demonstrate that our approach allows us to make successful predictions of future user actions and their timing. Specifically, TIPAS improves predictions of actions, and their timing, over existing methods across multiple datasets by up to 156%, and up to 37%, respectively. Performance improvements are particularly large for relatively rare and periodic actions such as walking and biking, improving over baselines by up to 256%. This demonstrates that explicit modeling of dependencies and periodicities in real-world behavior enables successful predictions of future actions, with implications for modeling human behavior, app personalization, and targeting of health interventions.

    View details for DOI 10.1145/3178876.3186161

    View details for PubMedID 29780977

  • I'll Be Back: On the Multiple Lives of Users of a Mobile Activity Tracking Application. Proceedings of the ... International World-Wide Web Conference. International WWW Conference Lin, Z., Althoff, T., Leskovec, J. 2018; 2018: 1501–11

    Abstract

    Mobile health applications that track activities, such as exercise, sleep, and diet, are becoming widely used. While these activity tracking applications have the potential to improve our health, user engagement and retention are critical factors for their success. However, long-term user engagement patterns in real-world activity tracking applications are not yet well understood. Here we study user engagement patterns within a mobile physical activity tracking application consisting of 115 million logged activities taken by over a million users over 31 months. Specifically, we show that over 75% of users return and re-engage with the application after prolonged periods of inactivity, no matter the duration of the inactivity. We find a surprising result that the re-engagement usage patterns resemble those of the start of the initial engagement period, rather than being a simple continuation of the end of the initial engagement period. This evidence points to a conceptual model of multiple lives of user engagement, extending the prevalent single life view of user activity. We demonstrate that these multiple lives occur because the users have a variety of different primary intents or goals for using the app. These primary intents are associated with how long each life lasts and how likely the user is to re-engage for a new life. We find evidence for users being more likely to stop using the app once they achieved their primary intent or goal (e.g., weight loss). However, these users might return once their original intent resurfaces (e.g., wanting to lose newly gained weight). We discuss implications of the multiple life paradigm and propose a novel prediction task of predicting the number of lives of a user. Based on insights developed in this work, including a marker of improved primary intent performance, our prediction models achieve 71% ROC AUC. Overall, our research has implications for modeling user re-engagement in health activity tracking applications and has consequences for how notifications, recommendations as well as gamification can be used to increase engagement.

    View details for DOI 10.1145/3178876.3186062

    View details for PubMedID 29780978

  • HUMAN DECISIONS AND MACHINE PREDICTIONS QUARTERLY JOURNAL OF ECONOMICS Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., Mullainathan, S. 2018; 133 (1): 237–93

    Abstract

    Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; and these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals. JEL Codes: C10 (Econometric and statistical methods and methodology), C55 (Large datasets: Modeling and analysis), K40 (Legal procedure, the legal system, and illegal behavior).

    View details for DOI 10.1093/qje/qjx032

    View details for Web of Science ID 000423802600005

    View details for PubMedID 29755141

    View details for PubMedCentralID PMC5947971

  • Large-scale analysis of disease pathways in the human interactome. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Agrawal, M., Zitnik, M., Leskovec, J. 2018; 23: 111–22

    Abstract

    Discovering disease pathways, which can be defined as sets of proteins associated with a given disease, is an important problem that has the potential to provide clinically actionable insights for disease diagnosis, prognosis, and treatment. Computational methods aid the discovery by relying on protein-protein interaction (PPI) networks. They start with a few known disease-associated proteins and aim to find the rest of the pathway by exploring the PPI network around the known disease proteins. However, the success of such methods has been limited, and failure cases have not been well understood. Here we study the PPI network structure of 519 disease pathways. We find that 90% of pathways do not correspond to single well-connected components in the PPI network. Instead, proteins associated with a single disease tend to form many separate connected components/regions in the network. We then evaluate state-of-the-art disease pathway discovery methods and show that their performance is especially poor on diseases with disconnected pathways. Thus, we conclude that network connectivity structure alone may not be sufficient for disease pathway discovery. However, we show that higher-order network structures, such as small subgraphs of the pathway, provide a promising direction for the development of new methods.

    View details for PubMedID 29218874

  • The Selective Labels Problem: Evaluating Algorithmic Predictions in the Presence of Unobservables. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Lakkaraju, H., Kleinberg, J., Leskovec, J., Ludwig, J., Mullainathan, S. 2017; 2017: 275–84

    Abstract

    Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, we observe the outcome of whether a defendant fails to return for their court appearance only if the human judge decides to release the defendant on bail. This selective labeling makes it harder to evaluate predictive models as the instances for which outcomes are observed do not represent a random sample of the population. Here we propose a novel framework for evaluating the performance of predictive models on selectively labeled data. We develop an approach called contraction which allows us to compare the performance of predictive models and human decision-makers without resorting to counterfactual inference. Our methodology harnesses the heterogeneity of human decision-makers and facilitates effective evaluation of predictive models even in the presence of unmeasured confounders (unobservables) which influence both human decisions and the resulting outcomes. Experimental results on real world datasets spanning diverse domains such as health care, insurance, and criminal justice demonstrate the utility of our evaluation metric in comparing human decisions and machine predictions.

    View details for DOI 10.1145/3097983.3098066

    View details for PubMedID 29780658

  • Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Hallac, D., Vare, S., Boyd, S., Leskovec, J. 2017; 2017: 215–23

    Abstract

    Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.

    View details for DOI 10.1145/3097983.3098060

    View details for PubMedID 29770257

  • Network Inference via the Time-Varying Graphical Lasso. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Hallac, D., Park, Y., Boyd, S., Leskovec, J. 2017; 2017: 205–13

    Abstract

    Many important problems can be modeled as a system of interconnected entities, where each entity is recording time-dependent observations or measurements. In order to spot trends, detect anomalies, and interpret the temporal dynamics of such data, it is essential to understand the relationships between the different entities and how these relationships evolve over time. In this paper, we introduce the time-varying graphical lasso (TVGL), a method of inferring time-varying networks from raw time series data. We cast the problem in terms of estimating a sparse time-varying inverse covariance matrix, which reveals a dynamic network of interdependencies between the entities. Since dynamic network inference is a computationally expensive task, we derive a scalable message-passing algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in an efficient way. We also discuss several extensions, including a streaming algorithm to update the model and incorporate new observations in real time. Finally, we evaluate our TVGL algorithm on both real and synthetic datasets, obtaining interpretable results and outperforming state-of-the-art baselines in terms of both accuracy and scalability.

    View details for DOI 10.1145/3097983.3098037

    View details for PubMedID 29770256

  • Local Higher-Order Graph Clustering. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Yin, H., Benson, A. R., Leskovec, J., Gleich, D. F. 2017; 2017: 555–64

    Abstract

    Local graph clustering methods aim to find a cluster of nodes by exploring a small region of the graph. These methods are attractive because they enable targeted clustering around a given seed node and are faster than traditional global graph clustering methods because their runtime does not depend on the size of the input graph. However, current local graph partitioning methods are not designed to account for the higher-order structures crucial to the network, nor can they effectively handle directed networks. Here we introduce a new class of local graph clustering methods that address these issues by incorporating higher-order network information captured by small subgraphs, also called network motifs. We develop the Motif-based Approximate Personalized PageRank (MAPPR) algorithm that finds clusters containing a seed node with minimal motif conductance, a generalization of the conductance metric for network motifs. We generalize existing theory to prove the fast running time (independent of the size of the graph) and obtain theoretical guarantees on the cluster quality (in terms of motif conductance). We also develop a theory of node neighborhoods for finding sets that have small motif conductance, and apply these results to the case of finding good seed nodes to use as input to the MAPPR algorithm. Experimental validation on community detection tasks in both synthetic and real-world networks, shows that our new framework MAPPR outperforms the current edge-based personalized PageRank methodology.

    View details for DOI 10.1145/3097983.3098069

    View details for PubMedID 29770258

  • Large-scale physical activity data reveal worldwide activity inequality NATURE Althoff, T., Sosic, R., Hicks, J. L., King, A. C., Delp, S. L., Leskovec, J. 2017; 547 (7663): 336-+

    Abstract

    To be able to curb the global pandemic of physical inactivity and the associated 5.3 million deaths per year, we need to understand the basic principles that govern physical activity. However, there is a lack of large-scale measurements of physical activity patterns across free-living populations worldwide. Here we leverage the wide usage of smartphones with built-in accelerometry to measure physical activity at the global scale. We study a dataset consisting of 68 million days of physical activity for 717,527 people, giving us a window into activity in 111 countries across the globe. We find inequality in how activity is distributed within countries and that this inequality is a better predictor of obesity prevalence in the population than average activity volume. Reduced activity in females contributes to a large portion of the observed activity inequality. Aspects of the built environment, such as the walkability of a city, are associated with a smaller gender gap in activity and lower activity inequality. In more walkable cities, activity is greater throughout the day and throughout the week, across age, gender, and body mass index (BMI) groups, with the greatest increases in activity found for females. Our findings have implications for global public health policy and urban planning and highlight the role of activity inequality and the built environment in improving physical activity and health.

    View details for DOI 10.1038/nature23018

    View details for Web of Science ID 000405844900031

    View details for PubMedID 28693034

    View details for PubMedCentralID PMC5774986

  • Network analysis: a novel method for mapping neonatal acute transport patterns in California. Journal of perinatology Kunz, S. N., Zupancic, J. A., Rigdon, J., Phibbs, C. S., Lee, H. C., Gould, J. B., Leskovec, J., Profit, J. 2017; 37 (6): 702-708

    Abstract

    The objectives of this study are to use network analysis to describe the pattern of neonatal transfers in California, to compare empirical sub-networks with established referral regions and to determine factors associated with transport outside the originating sub-network.This cross-sectional database study included 6546 infants <28 days old transported within California in 2012. After generating a graph representing acute transfers between hospitals (n=6696), we used community detection techniques to identify more tightly connected sub-networks. These empirically derived sub-networks were compared with state-defined regional referral networks. Reasons for transfer between empirical sub-networks were assessed using logistic regression.Empirical sub-networks showed significant overlap with regulatory regions (P<0.001). Transfer outside the empirical sub-network was associated with major congenital anomalies (P<0.001), need for surgery (P=0.01) and insurance as the reason for transfer (P<0.001).Network analysis accurately reflected empirical neonatal transfer patterns, potentially facilitating quantitative, rather than qualitative, analysis of regionalized health care delivery systems.Journal of Perinatology advance online publication, 23 March 2017; doi:10.1038/jp.2017.20.

    View details for DOI 10.1038/jp.2017.20

    View details for PubMedID 28333155

  • Loyalty in Online Communities. Proceedings of the ... International AAAI Conference on Weblogs and Social Media. International AAAI Conference on Weblogs and Social Media Hamilton, W. L., Zhang, J., Danescu-Niculescu-Mizil, C., Jurafsky, D., Leskovec, J. 2017; 2017: 540–43

    Abstract

    Loyalty is an essential component of multi-community engagement. When users have the choice to engage with a variety of different communities, they often become loyal to just one, focusing on that community at the expense of others. However, it is unclear how loyalty is manifested in user behavior, or whether certain community characteristics encourage loyalty. In this paper we operationalize loyalty as a user-community relation: users loyal to a community consistently prefer it over all others; loyal communities retain their loyal users over time. By exploring a large set of Reddit communities, we reveal that loyalty is manifested in remarkably consistent behaviors. Loyal users employ language that signals collective identity and engage with more esoteric, less popular content, indicating that they may play a curational role in surfacing new material. Loyal communities have denser user-user interaction networks and lower rates of triadic closure, suggesting that community-level loyalty is associated with more cohesive interactions and less fragmentation into subgroups. We exploit these general patterns to predict future rates of loyalty. Our results show that a user's propensity to become loyal is apparent from their initial interactions with a community, suggesting that some users are intrinsically loyal from the very beginning.

    View details for PubMedID 29354326

  • Online Actions with Offline Impact: How Online Social Networks Influence Online and Offline User Behavior. Proceedings of the ... International Conference on Web Search & Data Mining. International Conference on Web Search & Data Mining Althoff, T., Jindal, P., Leskovec, J. 2017; 2017: 537-546

    Abstract

    Many of today's most widely used computing applications utilize social networking features and allow users to connect, follow each other, share content, and comment on others' posts. However, despite the widespread adoption of these features, there is little understanding of the consequences that social networking has on user retention, engagement, and online as well as offline behavior. Here, we study how social networks influence user behavior in a physical activity tracking application. We analyze 791 million online and offline actions of 6 million users over the course of 5 years, and show that social networking leads to a significant increase in users' online as well as offline activities. Specifically, we establish a causal effect of how social networks influence user behavior. We show that the creation of new social connections increases user online in-application activity by 30%, user retention by 17%, and user offline real-world physical activity by 7% (about 400 steps per day). By exploiting a natural experiment we distinguish the effect of social influence of new social connections from the simultaneous increase in user's motivation to use the app and take more steps. We show that social influence accounts for 55% of the observed changes in user behavior, while the remaining 45% can be explained by the user's increased motivation to use the app. Further, we show that subsequent, individual edge formations in the social network lead to significant increases in daily steps. These effects diminish with each additional edge and vary based on edge attributes and user demographics. Finally, we utilize these insights to develop a model that accurately predicts which users will be most influenced by the creation of new social network connections.

    View details for DOI 10.1145/3018661.3018672

    View details for PubMedID 28345078

  • SnapVX: A Network-Based Convex Optimization Solver JOURNAL OF MACHINE LEARNING RESEARCH Hallac, D., Wong, C., Diamond, S., Sharang, A., Sosic, R., Boyd, S., Leskovec, J. 2017; 18
  • SnapVX: A Network-Based Convex Optimization Solver. Journal of machine learning research : JMLR Hallac, D., Wong, C., Diamond, S., Sharang, A., Sosic, R., Boyd, S., Leskovec, J. 2017; 18 (1): 110–14

    Abstract

    SnapVX is a high-performance solver for convex optimization problems defined on networks. For problems of this form, SnapVX provides a fast and scalable solution with guaranteed global convergence. It combines the capabilities of two open source software packages: Snap.py and CVXPY. Snap.py is a large scale graph processing library, and CVXPY provides a general modeling framework for small-scale subproblems. SnapVX offers a customizable yet easy-to-use Python interface with "out-of-the-box" functionality. Based on the Alternating Direction Method of Multipliers (ADMM), it is able to efficiently store, analyze, parallelize, and solve large optimization problems from a variety of different applications. Documentation, examples, and more can be found on the SnapVX website at http://snap.stanford.edu/snapvx.

    View details for PubMedID 29599649

  • Large-scale Graph Representation Learning Leskovec, J., Nie, J. Y., Obradovic, Z., Suzumura, T., Ghosh, R., Nambiar, R., Wang, C., Zang, H., BaezaYates, R., Hu, Kepner, J., Cuzzocrea, A., Tang, J., Toyoda, M. IEEE. 2017: 4
  • Mining Big Data to Extract Patterns and Predict Real-Life Outcomes PSYCHOLOGICAL METHODS Kosinski, M., Wang, Y., Lakkaraju, H., Leskovec, J. 2016; 21 (4): 493-506

    Abstract

    This article aims to introduce the reader to essential tools that can be used to obtain insights and build predictive models using large data sets. Recent user proliferation in the digital environment has led to the emergence of large samples containing a wealth of traces of human behaviors, communication, and social interactions. Such samples offer the opportunity to greatly improve our understanding of individuals, groups, and societies, but their analysis presents unique methodological challenges. In this tutorial, we discuss potential sources of such data and explain how to efficiently store them. Then, we introduce two methods that are often employed to extract patterns and reduce the dimensionality of large data sets: singular value decomposition and latent Dirichlet allocation. Finally, we demonstrate how to use dimensions or clusters extracted from data to build predictive models in a cross-validated way. The text is accompanied by examples of R code and a sample data set, allowing the reader to practice the methods discussed here. A companion website (http://dataminingtutorial.com) provides additional learning resources. (PsycINFO Database Record

    View details for DOI 10.1037/met0000105

    View details for Web of Science ID 000393202300004

    View details for PubMedID 27918179

  • Cultural Shift or Linguistic Drift? Comparing Two Computational Measures of Semantic Change. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing Hamilton, W. L., Leskovec, J., Jurafsky, D. 2016; 2016: 2116-2121

    Abstract

    Words shift in meaning for many reasons, including cultural factors like new technologies and regular linguistic processes like subjectification. Understanding the evolution of language and culture requires disentangling these underlying causes. Here we show how two different distributional measures can be used to detect two different types of semantic change. The first measure, which has been used in many previous works, analyzes global shifts in a word's distributional semantics; it is sensitive to changes due to regular processes of linguistic drift, such as the semantic generalization of promise ("I promise." "It promised to be exciting."). The second measure, which we develop here, focuses on local changes to a word's nearest semantic neighbors; it is more sensitive to cultural shifts, such as the change in the meaning of cell ("prison cell" "cell phone"). Comparing measurements made by these two methods allows researchers to determine whether changes are more cultural or linguistic in nature, a distinction that is essential for work in the digital humanities and historical linguistics.

    View details for PubMedID 28580459

  • SNAP: A General-Purpose Network Analysis and Graph-Mining Library ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY Leskovec, J., Sosic, R. 2016; 8 (1)

    View details for DOI 10.1145/2898361

    View details for Web of Science ID 000385621300001

  • node2vec: Scalable Feature Learning for Networks. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Grover, A., Leskovec, J. 2016; 2016: 855-864

    Abstract

    Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.

    View details for PubMedID 27853626

    View details for PubMedCentralID PMC5108654

  • Higher-order organization of complex networks SCIENCE Benson, A. R., Gleich, D. F., Leskovec, J. 2016; 353 (6295): 163-166

    Abstract

    Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks--at the level of small network subgraphs--remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns.

    View details for DOI 10.1126/science.aad9029

    View details for Web of Science ID 000379208400037

    View details for PubMedID 27387949

  • Growing Wikipedia Across Languages via Recommendation. Proceedings of the ... International World-Wide Web Conference. International WWW Conference Wulczyn, E., West, R., Zia, L., Leskovec, J. 2016; 2016: 975-985

    Abstract

    The different Wikipedia language editions vary dramatically in how comprehensive they are. As a result, most language editions contain only a small fraction of the sum of information that exists across all Wikipedias. In this paper, we present an approach to filling gaps in article coverage across different Wikipedia editions. Our main contribution is an end-to-end system for recommending articles for creation that exist in one language but are missing in another. The system involves identifying missing articles, ranking the missing articles according to their importance, and recommending important missing articles to editors based on their interests. We empirically validate our models in a controlled experiment involving 12,000 French Wikipedia editors. We find that personalizing recommendations increases editor engagement by a factor of two. Moreover, recommending articles increases their chance of being created by a factor of 3.2. Finally, articles created as a result of our recommendations are of comparable quality to organically created articles. Overall, our system leads to more engaged editors and faster growth of Wikipedia with no effect on its quality.

    View details for PubMedID 27819073

    View details for PubMedCentralID PMC5092237

  • Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Transactions of the Association for Computational Linguistics Althoff, T., Clark, K., Leskovec, J. 2016; 4: 463-476

    Abstract

    Mental illness is one of the most pressing public health issues of our time. While counseling and psychotherapy can be effective treatments, our knowledge about how to conduct successful counseling conversations has been limited due to lack of large-scale data with labeled outcomes of the conversations. In this paper, we present a large-scale, quantitative study on the discourse of text-message-based counseling conversations. We develop a set of novel computational discourse analysis methods to measure how various linguistic aspects of conversations are correlated with conversation outcomes. Applying techniques such as sequence-based conversation models, language model comparisons, message clustering, and psycholinguistics-inspired word frequency analyses, we discover actionable conversation strategies that are associated with better conversation outcomes.

    View details for PubMedID 28344978

  • Information Cartography COMMUNICATIONS OF THE ACM Shahaf, D., Guestrin, C., Horvitz, E., Leskovec, J. 2015; 58 (11): 62-73

    View details for DOI 10.1145/2735624

    View details for Web of Science ID 000363563800024

  • The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility. Journal of the American Medical Informatics Association Ku, J. P., Hicks, J. L., Hastie, T., Leskovec, J., Ré, C., Delp, S. L. 2015; 22 (6): 1120-1125

    Abstract

    Regular physical activity helps prevent heart disease, stroke, diabetes, and other chronic diseases, yet a broad range of conditions impair mobility at great personal and societal cost. Vast amounts of data characterizing human movement are available from research labs, clinics, and millions of smartphones and wearable sensors, but integration and analysis of this large quantity of mobility data are extremely challenging. The authors have established the Mobilize Center (http://mobilize.stanford.edu) to harness these data to improve human mobility and help lay the foundation for using data science methods in biomedicine. The Center is organized around 4 data science research cores: biomechanical modeling, statistical learning, behavioral and social modeling, and integrative modeling. Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods. By developing these new approaches, sharing data and validated software tools, and training thousands of researchers, the Mobilize Center will transform human movement research.

    View details for DOI 10.1093/jamia/ocv071

    View details for PubMedID 26272077

    View details for PubMedCentralID PMC4639715

  • Network Lasso: Clustering and Optimization in Large Graphs. KDD : proceedings. International Conference on Knowledge Discovery & Data Mining Hallac, D., Leskovec, J., Boyd, S. 2015; 2015: 387-396

    Abstract

    Convex optimization is an essential tool for modern data analysis, as it provides a framework to formulate and solve many problems in machine learning and data mining. However, general convex optimization solvers do not scale well, and scalable solvers are often specialized to only work on a narrow class of problems. Therefore, there is a need for simple, scalable algorithms that can solve many common optimization problems. In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs. We develop an algorithm based on the Alternating Direction Method of Multipliers (ADMM) to solve this problem in a distributed and scalable manner, which allows for guaranteed global convergence even on large graphs. We also examine a non-convex extension of this approach. We then demonstrate that many types of problems can be expressed in our framework. We focus on three in particular - binary classification, predicting housing prices, and event detection in time series data - comparing the network lasso to baseline approaches and showing that it is both a fast and accurate method of solving large optimization problems.

    View details for PubMedID 27398260

    View details for PubMedCentralID PMC4937836

  • Donor Retention in Online Crowdfunding Communities: A Case Study of DonorsChoose.org. Proceedings of the ... International World-Wide Web Conference. International WWW Conference Althoff, T., Leskovec, J. 2015; 2015: 34-44

    Abstract

    Online crowdfunding platforms like DonorsChoose.org and Kick-starter allow specific projects to get funded by targeted contributions from a large number of people. Critical for the success of crowdfunding communities is recruitment and continued engagement of donors. With donor attrition rates above 70%, a significant challenge for online crowdfunding platforms as well as traditional offline non-profit organizations is the problem of donor retention. We present a large-scale study of millions of donors and donations on DonorsChoose.org, a crowdfunding platform for education projects. Studying an online crowdfunding platform allows for an unprecedented detailed view of how people direct their donations. We explore various factors impacting donor retention which allows us to identify different groups of donors and quantify their propensity to return for subsequent donations. We find that donors are more likely to return if they had a positive interaction with the receiver of the donation. We also show that this includes appropriate and timely recognition of their support as well as detailed communication of their impact. Finally, we discuss how our findings could inform steps to improve donor retention in crowdfunding communities and non-profit organizations.

    View details for PubMedID 27077139

  • Ringo: Interactive Graph Analytics on Big-Memory Machines. Proceedings. ACM-Sigmod International Conference on Management of Data Perez, Y., Sosic, R., Banerjee, A., Puttagunta, R., Raison, M., Shah, P., Leskovec, J. 2015; 2015: 1105-1110

    Abstract

    We present Ringo, a system for analysis of large graphs. Graphs provide a way to represent and analyze systems of interacting objects (people, proteins, webpages) with edges between the objects denoting interactions (friendships, physical interactions, links). Mining graphs provides valuable insights about individual objects as well as the relationships among them. In building Ringo, we take advantage of the fact that machines with large memory and many cores are widely available and also relatively affordable. This allows us to build an easy-to-use interactive high-performance graph analytics system. Graphs also need to be built from input data, which often resides in the form of relational tables. Thus, Ringo provides rich functionality for manipulating raw input data tables into various kinds of graphs. Furthermore, Ringo also provides over 200 graph analytics functions that can then be applied to constructed graphs. We show that a single big-memory machine provides a very attractive platform for performing analytics on all but the largest graphs as it offers excellent performance and ease of use as compared to alternative approaches. With Ringo, we also demonstrate how to integrate graph analytics with an iterative process of trial-and-error data exploration and rapid experimentation, common in data mining workloads.

    View details for PubMedID 27081215

    View details for PubMedCentralID PMC4829061

  • Defining and evaluating network communities based on ground-truth KNOWLEDGE AND INFORMATION SYSTEMS Yang, J., Leskovec, J. 2015; 42 (1): 181-213
  • Tensor Spectral Clustering for Partitioning Higher-order Network Structures. Proceedings of the ... SIAM International Conference on Data Mining. SIAM International Conference on Data Mining Benson, A. R., Gleich, D. F., Leskovec, J. 2015; 2015: 118-126

    Abstract

    Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.

    View details for PubMedID 27812399

    View details for PubMedCentralID PMC5089081

  • Mining Missing Hyperlinks from Human Navigation Traces: A Case Study of Wikipedia. Proceedings of the ... International World-Wide Web Conference. International WWW Conference West, R., Paranjape, A., Leskovec, J. 2015; 2015: 1242-1252

    Abstract

    Hyperlinks are an essential feature of the World Wide Web. They are especially important for online encyclopedias such as Wikipedia: an article can often only be understood in the context of related articles, and hyperlinks make it easy to explore this context. But important links are often missing, and several methods have been proposed to alleviate this problem by learning a linking model based on the structure of the existing links. Here we propose a novel approach to identifying missing links in Wikipedia. We build on the fact that the ultimate purpose of Wikipedia links is to aid navigation. Rather than merely suggesting new links that are in tune with the structure of existing links, our method finds missing links that would immediately enhance Wikipedia's navigability. We leverage data sets of navigation paths collected through a Wikipedia-based human-computation game in which users must find a short path from a start to a target article by only clicking links encountered along the way. We harness human navigational traces to identify a set of candidates for missing links and then rank these candidates. Experiments show that our procedure identifies missing links of high quality.

    View details for PubMedID 26634229

    View details for PubMedCentralID PMC4664478

  • Analyzing Information Seeking and Drug-Safety Alert Response by Health Care Professionals as New Methods for Surveillance. Journal of medical Internet research Callahan, A., Pernek, I., Stiglic, G., Leskovec, J., Strasberg, H. R., Shah, N. H. 2015; 17 (8)

    Abstract

    Patterns in general consumer online search logs have been used to monitor health conditions and to predict health-related activities, but the multiple contexts within which consumers perform online searches make significant associations difficult to interpret. Physician information-seeking behavior has typically been analyzed through survey-based approaches and literature reviews. Activity logs from health care professionals using online medical information resources are thus a valuable yet relatively untapped resource for large-scale medical surveillance.To analyze health care professionals' information-seeking behavior and assess the feasibility of measuring drug-safety alert response from the usage logs of an online medical information resource.Using two years (2011-2012) of usage logs from UpToDate, we measured the volume of searches related to medical conditions with significant burden in the United States, as well as the seasonal distribution of those searches. We quantified the relationship between searches and resulting page views. Using a large collection of online mainstream media articles and Web log posts we also characterized the uptake of a Food and Drug Administration (FDA) alert via changes in UpToDate search activity compared with general online media activity related to the subject of the alert.Diseases and symptoms dominate UpToDate searches. Some searches result in page views of only short duration, while others consistently result in longer-than-average page views. The response to an FDA alert for Celexa, characterized by a change in UpToDate search activity, differed considerably from general online media activity. Changes in search activity appeared later and persisted longer in UpToDate logs. The volume of searches and page view durations related to Celexa before the alert also differed from those after the alert.Understanding the information-seeking behavior associated with online evidence sources can offer insight into the information needs of health professionals and enable large-scale medical surveillance. Our Web log mining approach has the potential to monitor responses to FDA alerts at a national level. Our findings can also inform the design and content of evidence-based medical information resources such as UpToDate.

    View details for DOI 10.2196/jmir.4427

    View details for PubMedID 26293444

  • Overlapping Communities Explain Core-Periphery Organization of Networks PROCEEDINGS OF THE IEEE Yang, J., Leskovec, J. 2014; 102 (12): 1892-1902
  • Structure and Overlaps of Ground-Truth Communities in Networks ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY Yang, J., Leskovec, J. 2014; 5 (2)

    View details for DOI 10.1145/2594454

    View details for Web of Science ID 000335576200005

  • Discovering Social Circles in Ego Networks ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA McAuley, J., Leskovec, J. 2014; 8 (1): 73-100

    View details for DOI 10.1145/2556612

    View details for Web of Science ID 000333491900004

  • Modeling Information Propagation with Survival Theory Gomez-Rodriguez, M., Leskovec, J., Schoelkopf, B. 2013
  • Community Detection in Networks with Node Attributes IEEE 13th International Conference on Data Mining (ICDM) Yang, J., McAuley, J., Leskovec, J. IEEE. 2013: 1151–1156
  • Structure and Dynamics of Information Pathways in Online Media Gomez-Rodriguez, M., Leskovec, J., Schoelkopf, B. 2013
  • Nonparametric Multi-group Membership Model for Dynamic Networks Kim, M., Leskovec, J. 2013
  • From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews McAuley, J., Leskovec, J. 2013
  • Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text McAuley, J., Leskovec, J. 2013
  • NIFTY: A System for Large Scale Information Flow Tracking and Clustering Suen, C., Huang, S., Eksombatchai, C., Sosic, R., Leskovec, J. 2013
  • Steering User Behavior With Badges Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J. 2013
  • Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Yang, J., Leskovec, J. 2013
  • Information Cartography: Creating Zoomable, Large-Scale Maps of Information Shahaf, D., Yang, J., Suen, C., Jacobs, J., Wang, H., Leskovec, J. 2013
  • Community Detection in Networks with Node Attributes Yang, J., McAuley, J., Leskovec, J. 2013
  • A computational approach to politeness with application to social factors Danescu-Niculescu-Mizil, C., Sudhof, M., Jurafsky, D., Leskovec, J., Potts, C. 2013
  • No Country for Old Members: User lifecycle and linguistic change in online communities Danescu-Niculescu-Mizil, C., West, R., Jurafsky, D., Leskovec, J., Potts, C. 2013
  • What’s in a name? Understanding the Interplay between Titles, Content, and Communities in Social Media Lakkaraju, H., McAuley, J., Leskovec, J. 2013
  • Measurement error in network data: A re-classification SOCIAL NETWORKS Wang, D. J., Shi, X., McFarland, D. A., Leskovec, J. 2012; 34 (4): 396-409
  • Inferring Networks of Diffusion and Influence ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA Gomez-Rodriguez, M., Leskovec, J., Krause, A. 2012; 5 (4)
  • Community-Affiliation Graph Model for Overlapping Network Community Detection 12th IEEE International Conference on Data Mining (ICDM) Yang, J., Leskovec, J. IEEE. 2012: 1170–1175
  • Image Labeling on a Network: Using Social-Network Metadata for Image Classification 12th European Conference on Computer Vision (ECCV) McAuley, J., Leskovec, J. SPRINGER-VERLAG BERLIN. 2012: 828–841
  • Learning to Discover Social Circles in Ego Networks McAuley, J., Leskovec, J. 2012
  • Latent Multi-group Membership Graph Model Kim, M., Leskovec, J. 2012
  • Information Diffusion and External Influence in Networks Myers, S., Zhu, C., Leskovec, J. 2012
  • Learning Attitudes and Attributes from Multi-Aspect Reviews McAuley, J., Leskovec, J., Jurafsky, D. 2012
  • Automatic versus Human Navigation in Information Networks West, R., Leskovec, J. 2012
  • Discovering Value from Community Activity on Focused Question Answering Sites: A Case Study of Stack Overflow Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J. 2012
  • The Life and Death of Online Groups: Predicting Group Growth and Longevity Kairam, S., Wang, D., Leskovec, J. 2012
  • Human Wayfinding in Information Networks West, R., Leskovec, J. 2012
  • Effects of User Similarity in Social Media Anderson, A., Huttenlocher, D., Kleinberg, J., Leskovec, J. 2012
  • Image Labeling on a Network: Using Social-Network Metadata for Image Classiffcation McAuley, J., Leskovec, J. 2012
  • Defining and Evaluating Network Communities based on Ground-truth 12th IEEE International Conference on Data Mining (ICDM) Yang, J., Leskovec, J. IEEE. 2012: 745–754
  • Clash of the Contagions: Cooperation and Competition in Information Diffusion 12th IEEE International Conference on Data Mining (ICDM) Myers, S. A., Leskovec, J. IEEE. 2012: 539–548
  • HADI: Mining Radii of Large Graphs ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA Kang, U., Tsourakakis, C. E., Appel, A. P., Faloutsos, C., Leskovec, J. 2011; 5 (2)
  • Large-Scale Web Data Analysis IEEE INTELLIGENT SYSTEMS Leskovec, J. 2011; 26 (1): 11-11
  • Sentiment Flow Through Hyperlink Networks Miller, M., Sathi, C., Wiesenthal, D., Leskovec, J., Potts, C. 2011
  • Modeling Social Networks with Node Attributes using the Multiplicative Attribute Graph Model Kim, M., Leskovec, J. 2011
  • Dynamics of Bidding in a P2P Lending Service: Effects of Herding and Predicting Loan Success Ceyhan, S., Shi, X., Leskovec, J. 2011
  • The Network Completion Problem: Inferring Missing Nodes and Edges in Networks Kim, M., Leskovec, J. 2011
  • Patterns of Temporal Variation in Online Media Yang, J., Leskovec, J. 2011
  • The Role of Social Networks in Online Shopping: Information Passing, Price of Trust, and Consumer Choice Guo, S., Wang, M., Leskovec, J. 2011
  • Supervised Random Walks: Predicting and Recommending Links in Social Networks Backstrom, L., Leskovec, J. 2011
  • Friendship and Mobility: User Movement In Location-Based Social Networks Cho, E., Myers, S., A., Leskovec, J. 2011
  • Correcting for Missing Data in Information Cascades Sadikov, E., Medina, M., Leskovec, J., Garcia-Molina, H. 2011
  • Kronecker Graphs: An Approach to Modeling Networks JOURNAL OF MACHINE LEARNING RESEARCH Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., Ghahramani, Z. 2010; 11: 985-1042
  • Multiplicative Attribute Graph Model of Real-World Networks 7th Workshop on Algorithms and Models for the Web Graph Kim, M., Leskovec, J. SPRINGER-VERLAG BERLIN. 2010: 62–73
  • Predicting Positive and Negative Links in Online Social Networks Leskovec, J., Huttenlocher, D., Kleinberg, J. 2010
  • Citing for High Impact Shi, X., Leskovec, J., McFarland, D., A. 2010
  • Modeling Information Diffusion in Implicit Networks Yang, J., Leskovec, J. 2010
  • Empirical Comparison of Algorithms for Network Community Detection Leskovec, J., Lang, K., Mahoney, M. 2010
  • On the Convexity of Latent Social Network Inference Myers, S., A., Leskovec, J. 2010
  • Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations Kang, U., Tsourakakis, C., Appel, A., Faloutsos, C., Leskovec, J. 2010
  • Governance in Social Media: A case study of the Wikipedia promotion process Leskovec, J., Huttenlocher, D., Kleinberg, J. 2010
  • Signed Networks in Social Media 28th Annual CHI Conference on Human Factors in Computing Systems Leskovec, J., Huttenlocher, D., Kleinberg, J. ASSOC COMPUTING MACHINERY. 2010: 1361–1370
  • Meme-tracking and the Dynamics of the News Cycle 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Leskovec, J., Backstrom, L., Kleinberg, J. ASSOC COMPUTING MACHINERY. 2009: 497–505
  • Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters Internet Mathematics Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M. 2009; 1 (6): 29--123
  • Modeling blog dynamics Goetz, M., Leskovec, J., Mcglohon, M., Faloutsos, C. 2009
  • The Battle of the Water Sensor Networks (BWSN): A Design Challenge for Engineers and Algorithms Leskovec, J., Ostfeld et al, A. 2009
  • Efficient Sensor Placement Optimization for Securing Large Water Distribution Networks JOURNAL OF WATER RESOURCES PLANNING AND MANAGEMENT-ASCE Krause, A., Leskovec, J., Guestrin, C., VanBriesen, J., Faloutsos, C. 2008; 134 (6): 516-526
  • Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions Seshadri, M., Machiraju, S., Sridharan, A., Bolot, J., Faloutsos, C., Leskovec, J. 2008
  • Planetary-Scale Views on a Large Instant-Messaging Network Leskovec, J., Horvitz, E. 2008
  • Epidemic Thresholds in Real Networks Chakrabarti, D., Wang, Y., Wang, C., Leskovec, J., Faloutsos, C. 2008
  • Statistical Properties of Community Structure in Large Social and Information Networks Leskovec, J., Lang, K., Dasgupta, A., Mahoney, M. 2008
  • Microscopic Evolution of Social Networks Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A. 2008
  • Monitoring Network Evolution using MDL Ferlez, J., Faloutsos, C., Leskovec, J., Mladenic, D., Grobelnik, M. 2008
  • Cost-effective Outbreak Detection in Networks Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N. 2007
  • Web Projections: Learning from Contextual Subgraphs of the Web Leskovec, J., Dumais, S., Horvitz, E. 2007
  • Scalable Modeling of Real Graphs using Kronecker Multiplication Leskovec, J., Faloutsos, C. 2007
  • The Dynamics of Viral Marketing ACM Transactions on the Web (TWEB) Leskovec, J., Adamic, L., Huberman, B. 2007; 1 (1)
  • Graph Evolution: Densification and Shrinking Diameters Leskovec, J., Kleinberg, J., Faloutsos, C. 2007
  • Cascading Behavior in Large Blog Graphs Leskovec, J., McGlohon, M., Faloutsos, C., Glance, N., Hurst, M. 2007
  • Information Survival Threshold in Sensor and P2P Networks Chakrabarti, D., Leskovec, J., Faloutsos, C., Madden, S., Guestrin, C., Faloutsos, M. 2007
  • Sampling from Large Graphs Leskovec, J., Faloutsos, C. 2006
  • Data Association for Topic Intensity Tracking Krause, A., Leskovec, J., Guestrin, C. 2006
  • The Dynamics of Viral Marketing Leskovec, J., Adamic, L., Huberman, B. 2006
  • Patterns of Influence in a Recommendation Network Leskovec, J., Singh, A., Kleinberg, J. 2006
  • Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication 16th European Conference on Machine Learning (ECML)/9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C. SPRINGER-VERLAG BERLIN. 2005: 133–145
  • Semantic Text Features from Small World Graphs Leskovec, J., Shawe-Taylor, J. 2005
  • Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts Leskovec, J., Milic-Frayling, N., Grobelnik, M. 2005
  • Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations Leskovec, J., Kleinberg, J., Faloutsos, C. 2005
  • Extracting Summary Sentences Based on the Document Semantic Graph Microsoft Research Technical Report MSR-TR-2005-07 Leskovec, J., Milic-Frayling, N., Grobelnik, M. 2005
  • Learning Sub-structures of Document Semantic Graphs for Document Summarization Leskovec, J., Grobelnik, M., Milic-Frayling, N. 2004
  • The Download Estimation task on KDD Cup 2003 SIGKDD Explorations Brank, J., Leskovec, J. 2003
  • Linear Programming boost for Uneven Datasets Leskovec, J., Shawe-Taylor, J. 2003
  • KDD Cup 2003: The Download Estimation task Jozef Stefan Institute Technical Report Brank, J., Leskovec, J. 2003
  • Govorec - sistem za slovensko govorjenje racunalniskih besedil Information Society Leskovec, J. 2001
  • Detection of Human Bodies using Computer Analysis of a Sequence of Stereo Images Leskovec, J. 1999