Mert Pilanci is an assistant professor of Electrical Engineering at Stanford University. He received his Ph.D. in Electrical Engineering and Computer Science from UC Berkeley in 2016. Prior to joining Stanford, he was an assistant professor of Electrical Engineering and Computer Science at the University of Michigan. In 2017, he was a Math+X postdoctoral fellow working with Emmanuel Candès at Stanford University. His research interests are in large scale machine learning, optimization, and information theory.

Academic Appointments

Program Affiliations

  • Stanford SystemX Alliance

Professional Education

  • PhD, University of California, Berkeley, Electrical Engineering and Computer Science (2016)

All Publications

  • Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models Ergen, T., Pilanci, M., Chiappa, S., Calandra, R. ADDISON-WESLEY PUBL CO. 2020: 4024–32
  • Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory Chai, E., Pilanci, M., Murmann, B. arXiv:2002.10674. 2020


    Batch Normalization (BatchNorm) is commonly used in Convolutional Neural Networks (CNNs) to improve training speed and stability. However, there is still limited consensus on why this technique is effective. This paper uses concepts from the traditional adaptive filter domain to provide insight into the dynamics and inner workings of BatchNorm. First, we show that the convolution weight updates have natural modes whose stability and convergence speed are tied to the eigenvalues of the input autocorrelation matrices, which are controlled by BatchNorm through the convolution layers' channel-wise structure. Furthermore, our experiments demonstrate that the speed and stability benefits are distinct effects. At low learning rates, it is BatchNorm's amplification of the smallest eigenvalues that improves convergence speed, while at high learning rates, it is BatchNorm's suppression of the largest eigenvalues that ensures stability. Lastly, we prove that in the first training step, when normalization is needed most, BatchNorm satisfies the same optimization as Normalized Least Mean Square (NLMS), while it continues to approximate this condition in subsequent steps. The analyses provided in this paper lay the groundwork for gaining further insight into the operation of modern neural network structures using adaptive filter theory.

  • Distributed Sketching Methods for Privacy Preserving Regression Bartan, B., Pilanci, M. arXiv:2002.06538. 2020


    In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distributed sketches. We consider random matrices including Gaussian, randomized Hadamard, uniform sampling and leverage score sampling in the distributed setting. Moreover, we propose a hybrid approach combining sampling and fast random projections for better computational efficiency. We illustrate the performance of distributed sketches in a serverless computing platform with large scale experiments.

  • Distributed Averaging Methods for Randomized Second Order Optimization Bartan, B., Pilanci, M. arXiv:2002.06540. 2020


    We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closed-form formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems with varying worker resources. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments performed on a serverless computing platform.

  • Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods Lacotte, J., Liu, S., Dobriban, E., Pilanci, M. International Conference on Machine Learning (ICML), 2020. 2020


    We provide an exact analysis of the limiting spectrum of matrices randomly projected either with the subsampled randomized Hadamard transform, or truncated Haar matrices. We characterize this limiting distribution through its Stieltjes transform, a classical object in random matrix theory, and compute the first and second inverse moments. We leverage the limiting spectrum and asymptotic freeness of random matrices to obtain an exact analysis of iterative sketching methods for solving least squares problems. Our results also yield optimal step-sizes and convergence rates in terms of simple closed-form expressions. Moreover, we show that the convergence rate for Haar and randomized Hadamard matrices are identical, and uniformly improve upon Gaussian random projections. The developed techniques and formulas can be applied to a plethora of randomized algorithms that employ fast randomized Hadamard dimension reduction.

  • Weighted Gradient Coding with Leverage Score Sampling IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Charalambides, N., Pilanci, M., Hero, A. O. 2020
  • Convex Geometry of Two-Layer ReLU Networks: Implicit Autoencoding and Interpretable Models 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) Ergen, T., Pilanci, M. 2020
  • Optimal Randomized First-Order Methods for Least-Squares Problems Lacotte, J., Pilanci, M. arXiv:2002.09488 . 2020


    We provide an exact analysis of a class of randomized algorithms for solving overdetermined least-squares problems. We consider first-order methods, where the gradients are pre-conditioned by an approximation of the Hessian, based on a subspace embedding of the data matrix. This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems. We focus on two classical embeddings, namely, Gaussian projections and subsampled randomized Hadamard transforms (SRHT). Our key technical innovation is the derivation of the limiting spectral density of SRHT embeddings. Leveraging this novel result, we derive the family of normalized orthogonal polynomials of the SRHT density and we find the optimal pre-conditioned first-order method along with its rate of convergence. Our analysis of Gaussian embeddings proceeds similarly, and leverages classical random matrix theory results. In particular, we show that for a given sketch size, SRHT embeddings exhibits a faster rate of convergence than Gaussian embeddings. Then, we propose a new algorithm by optimizing the computational complexity over the choice of the sketching dimension. To our knowledge, our resulting algorithm yields the best known complexity for solving least-squares problems with no condition number dependence.

  • Convex Duality of Deep Neural Networks Ergen, T., Pilanci, M. arXiv:2002.09773. 2020


    We study regularized deep neural networks and introduce an analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set. For two-layer linear networks, we first formulate a convex dual program and prove that strong duality holds. We then extend our derivations to prove that strong duality also holds for certain deep networks. In particular, for linear deep networks, we show that each optimal layer weight matrix is rank-one and aligns with the previous layers when the network output is scalar. We also extend our analysis to the vector outputs and other convex loss functions. More importantly, we show that the same characterization can also be applied to deep ReLU networks with rank-one inputs, where we prove that strong duality still holds and optimal layer weight matrices are rank-one for scalar output networks. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. We then verify our theoretical results via several numerical experiments.

  • High-Dimensional Optimization in Adaptive Random Subspaces Lacotte, J., Pilanci, M., Pavone, M., Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
  • Convex Optimization for Shallow Neural Networks Ergen, T., Pilanci, M., IEEE IEEE. 2019: 79–83
  • Distributed Black-Box Optimization via Error Correcting Codes Bartan, B., Pilanci, M., IEEE IEEE. 2019: 246–52
  • Straggler Resilient Serverless Computing Based on Polar Codes Bartan, B., Pilanci, M., IEEE IEEE. 2019: 276–83
  • Faster Least Squares Optimization Lacotte, J., Pilanci, M. arXiv:1911.02675. 2019


    We investigate randomized methods for solving overdetermined linear least-squares problems, where the Hessian is approximated based on a random projection of the data matrix. We consider a random subspace embedding which is either drawn at the beginning of the algorithm and fixed throughout, or, refreshed at each iteration. For a broad class of random matrices, we provide an exact finite-time analysis of the refreshed embeddings method, an exact asymptotic analysis of the fixed embedding method, as well as a non-asymptotic analysis, with and without momentum acceleration. Surprisingly, we show that, for Gaussian matrices, the refreshed sketching method with no momentum yields the same convergence rate as the fixed embedding method with momentum. Furthermore, we prove that momentum does not accelerate the refreshed embeddings method. Thus, picking the accelerated, fixed embedding method as the algorithm of choice among the methods we consider, we propose a new algorithm by optimizing the computational complexity over the choice of the sketching dimension. Our resulting algorithm yields a smaller complexity compared to current state-of-the-art randomized pre-conditioning methods. In particular, as the sample size grows, the resulting complexity becomes sub-linear in the problem dimensions. We validate numerically our guarantees on large sample datasets, both for Gaussian and SRHT embeddings.

  • Straggler Resilient Serverless Computing Based on Polar Codes 57th Annual Allerton Conference on Communication, Control, and Computing Bartan, B., Pilanci, M. 2019
  • Distributed Black-Box Optimization via Error Correcting Codes 57th Annual Allerton Conference on Communication, Control, and Computing Bartan, B., Pilanci, M. 2019
  • High-Dimensional Optimization in Adaptive Random Subspaces Neural Information Processing Systems (NeurIPS) Lacotte, J., Pilanci, M., Pavone, M. 2019
  • Fast and Robust Solution Techniques for Large Scale Linear System of Equations Ozaslan, I. K., Pilanci, M., Arikan, O., IEEE IEEE. 2019
  • ITERATIVE HESSIAN SKETCH WITH MOMENTUM Ozaslan, I., Pilanci, M., Arikan, O., IEEE IEEE. 2019: 7470–74

    View details for DOI 10.1137/15M1021106

    View details for Web of Science ID 000404178500010