Mert Pilanci
Assistant Professor of Electrical Engineering
Bio
Mert Pilanci is an assistant professor of Electrical Engineering at Stanford University. He received his Ph.D. in Electrical Engineering and Computer Science from UC Berkeley in 2016. Prior to joining Stanford, he was an assistant professor of Electrical Engineering and Computer Science at the University of Michigan. In 2017, he was a Math+X postdoctoral fellow working with Emmanuel Candès at Stanford University. His research interests are in large scale machine learning, optimization, and information theory.
Honors & Awards

Early Career Award, U.S. Army Research Office (2021)

International Conference on Acoustics, Speech, & Signal Processing (ICASSP) Best Paper Award, IEEE (2021)

Best Poster Award, Conference on the Mathematical Theory of Deep Neural Networks (2020)

Faculty Research Award, Facebook (2020)

Faculty Research Award, Adobe (2019)

Terman Faculty Fellow, Stanford University (2018)

Math+X Postdoctoral Fellowship, Simons Foundation (2016)

PhD Fellowship, Microsoft Research (2013)

Signal Processing and Communications Applications Conference Best Paper Award, IEEE (2010)
Program Affiliations

Stanford SystemX Alliance
Professional Education

Postdoctoral Fellow, Stanford University (2017)

PhD, University of California, Berkeley, Electrical Engineering and Computer Science (2016)
Current Research and Scholarly Interests
Dr. Pilanci's research interests include neural networks, machine learning, mathematical optimization, information theory and signal processing.
202122 Courses
 Convex Optimization II
CME 364B, EE 364B (Spr)  Introductory Research Seminar in Electrical Engineering
EE 301 (Aut)  Signal Processing for Machine Learning
EE 269 (Aut) 
Independent Studies (4)
 Master's Thesis and Thesis Research
EE 300 (Aut, Win)  Ph.D. Research
CME 400 (Spr)  Special Studies and Reports in Electrical Engineering
EE 391 (Aut, Win, Spr, Sum)  Special Studies or Projects in Electrical Engineering
EE 390 (Win, Spr, Sum)
 Master's Thesis and Thesis Research

Prior Year Courses
202021 Courses
 Convex Optimization II
CME 364B, EE 364B (Spr)  Large Scale Matrix Computation, Optimization and Learning
EE 270 (Win)  Signal Processing for Machine Learning
EE 269 (Aut)
201920 Courses
 Convex Optimization II
CME 364B, EE 364B (Spr)  Large Scale Matrix Computation, Optimization and Learning
EE 270 (Win)  Signal Processing for Machine Learning
EE 269 (Aut)
201819 Courses
 Convex Optimization II
CME 364B, EE 364B (Spr)  Signal Processing for Machine Learning
EE 269 (Win)
 Convex Optimization II
Stanford Advisees

Doctoral Dissertation Reader (AC)
Surin Ahn, Hilal Asi, Tavor Baharav, Karan Chadha, Lucas Fuentes Valenzuela, Beliz Gunel, Soheil Hor, Cem Kalkanli, Qianyun Lu, Bennet MeyersIm, Batu Ozturkler, Khaled Saab, Arda Sahiner, Michael Silvernagel, Joe Vincent 
Orals Chair
Chelsea Sidrane 
Doctoral Dissertation Advisor (AC)
Tolga Ergen, Rajarshi Saha, Yifei Wang 
Master's Program Advisor
Yuhan Zhang 
Doctoral (Program)
Amirhossein Afsharrad, WeiNing Chen, Gary Cheng, Tolga Ergen, Dorsa Fathollahi, Ibrahim Gulluk, Erin Kunz, Aaron Mishkin, Rajarshi Saha, Arda Sahiner, Siddharth Tanwar, Yifei Wang
All Publications

Boost AI Power: Data Augmentation Strategies with Unlabeled Data and Conformal Prediction, a Case in Alternative Herbal Medicine Discrimination with Electronic Nose
IEEE Sensors Journal
2021: 111
View details for DOI 10.1109/JSEN.2021.3102488

Revealing the Structure of Deep Neural Networks via Convex Duality
JMLRJOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000683104603002

Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs
JMLRJOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000683104603001

Adaptive Newton Sketch: Lineartime Optimization with Quadratic Convergence and Effective Hessian Dimensionality
JMLRJOURNAL MACHINE LEARNING RESEARCH. 2021
View details for Web of Science ID 000683104605087
 Boost AI Power: Data Augmentation Strategies with Unlabelled Data and Conformal Prediction, a Case in Alternative Herbal Medicine Discrimination with Electronic Nose IEEE Sensors Journal 2021
 Convex Geometry and Duality of Overparameterized Neural Networks Convex Geometry and Duality of Overparameterized Neural Networks 2021

WEIGHTED GRADIENT CODING WITH LEVERAGE SCORE SAMPLING
IEEE. 2020: 5215–19
View details for Web of Science ID 000615970405095

Optimal Randomized FirstOrder Methods for LeastSquares Problems
JMLRJOURNAL MACHINE LEARNING RESEARCH. 2020
View details for Web of Science ID 000683178505066
 Global Multiclass Classification from Heterogeneous Local Models IEEE Journal on Selected Areas in Information Theory 2020
 Lower Bounds and a NearOptimal Shrinkage Estimator for Least Squares using Random Projections IEEE Journal on Selected Areas in Information Theory 2020

Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory
arXiv:2002.10674.
2020
Abstract
Batch Normalization (BatchNorm) is commonly used in Convolutional Neural Networks (CNNs) to improve training speed and stability. However, there is still limited consensus on why this technique is effective. This paper uses concepts from the traditional adaptive filter domain to provide insight into the dynamics and inner workings of BatchNorm. First, we show that the convolution weight updates have natural modes whose stability and convergence speed are tied to the eigenvalues of the input autocorrelation matrices, which are controlled by BatchNorm through the convolution layers' channelwise structure. Furthermore, our experiments demonstrate that the speed and stability benefits are distinct effects. At low learning rates, it is BatchNorm's amplification of the smallest eigenvalues that improves convergence speed, while at high learning rates, it is BatchNorm's suppression of the largest eigenvalues that ensures stability. Lastly, we prove that in the first training step, when normalization is needed most, BatchNorm satisfies the same optimization as Normalized Least Mean Square (NLMS), while it continues to approximate this condition in subsequent steps. The analyses provided in this paper lay the groundwork for gaining further insight into the operation of modern neural network structures using adaptive filter theory.

Distributed Sketching Methods for Privacy Preserving Regression
arXiv:2002.06538.
2020
Abstract
In this work, we study distributed sketching methods for large scale regression problems. We leverage multiple randomized sketches for reducing the problem dimensions as well as preserving privacy and improving straggler resilience in asynchronous distributed systems. We derive novel approximation guarantees for classical sketching methods and analyze the accuracy of parameter averaging for distributed sketches. We consider random matrices including Gaussian, randomized Hadamard, uniform sampling and leverage score sampling in the distributed setting. Moreover, we propose a hybrid approach combining sampling and fast random projections for better computational efficiency. We illustrate the performance of distributed sketches in a serverless computing platform with large scale experiments.

Distributed Averaging Methods for Randomized Second Order Optimization
arXiv:2002.06540.
2020
Abstract
We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a significant bottleneck. We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian. Existing works do not take the bias of the estimators into consideration, which limits their application to massively parallel computation. We provide closedform formulas for regularization parameters and step sizes that provably minimize the bias for sketched Newton directions. We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems with varying worker resources. Additionally, we demonstrate the implications of our theoretical findings via large scale experiments performed on a serverless computing platform.

Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods
International Conference on Machine Learning (ICML), 2020.
2020
Abstract
We provide an exact analysis of the limiting spectrum of matrices randomly projected either with the subsampled randomized Hadamard transform, or truncated Haar matrices. We characterize this limiting distribution through its Stieltjes transform, a classical object in random matrix theory, and compute the first and second inverse moments. We leverage the limiting spectrum and asymptotic freeness of random matrices to obtain an exact analysis of iterative sketching methods for solving least squares problems. Our results also yield optimal stepsizes and convergence rates in terms of simple closedform expressions. Moreover, we show that the convergence rate for Haar and randomized Hadamard matrices are identical, and uniformly improve upon Gaussian random projections. The developed techniques and formulas can be applied to a plethora of randomized algorithms that employ fast randomized Hadamard dimension reduction.
 Weighted Gradient Coding with Leverage Score Sampling IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
 Convex Geometry of TwoLayer ReLU Networks: Implicit Autoencoding and Interpretable Models 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020

Optimal Randomized FirstOrder Methods for LeastSquares Problems
arXiv:2002.09488 .
2020
Abstract
We provide an exact analysis of a class of randomized algorithms for solving overdetermined leastsquares problems. We consider firstorder methods, where the gradients are preconditioned by an approximation of the Hessian, based on a subspace embedding of the data matrix. This class of algorithms encompasses several randomized methods among the fastest solvers for leastsquares problems. We focus on two classical embeddings, namely, Gaussian projections and subsampled randomized Hadamard transforms (SRHT). Our key technical innovation is the derivation of the limiting spectral density of SRHT embeddings. Leveraging this novel result, we derive the family of normalized orthogonal polynomials of the SRHT density and we find the optimal preconditioned firstorder method along with its rate of convergence. Our analysis of Gaussian embeddings proceeds similarly, and leverages classical random matrix theory results. In particular, we show that for a given sketch size, SRHT embeddings exhibits a faster rate of convergence than Gaussian embeddings. Then, we propose a new algorithm by optimizing the computational complexity over the choice of the sketching dimension. To our knowledge, our resulting algorithm yields the best known complexity for solving leastsquares problems with no condition number dependence.
arXiv 
Convex Duality of Deep Neural Networks
arXiv:2002.09773.
2020
Abstract
We study regularized deep neural networks and introduce an analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set. For twolayer linear networks, we first formulate a convex dual program and prove that strong duality holds. We then extend our derivations to prove that strong duality also holds for certain deep networks. In particular, for linear deep networks, we show that each optimal layer weight matrix is rankone and aligns with the previous layers when the network output is scalar. We also extend our analysis to the vector outputs and other convex loss functions. More importantly, we show that the same characterization can also be applied to deep ReLU networks with rankone inputs, where we prove that strong duality still holds and optimal layer weight matrices are rankone for scalar output networks. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for onedimensional datasets which was previously known only for twolayer networks. We then verify our theoretical results via several numerical experiments.
arXiv 
Neural Networks are Convex Regularizers: Exact Polynomialtime Convex Optimization Formulations for TwoLayer Networks
International Conference on Machine Learning (ICML), 2020.
2020
Abstract
We develop exact representations of two layer neural networks with rectified linear units in terms of a single convex program with number of variables polynomial in the number of training samples and number of hidden neurons. Our theory utilizes semiinfinite duality and minimum norm regularization. Moreover, we show that certain standard multilayer convolutional neural networks are equivalent to L1 regularized linear models in a polynomial sized discrete Fourier feature space. We also introduce exact semidefinite programming representations of convolutional and fully connected linear multilayer networks which are polynomial size in both the sample size and dimension.
arXiv 
Separating the Effects of Batch Normalization on CNN Training Speed and Stability Using Classical Adaptive Filter Theory
IEEE. 2020: 12141221
View details for DOI 10.1109/IEEECONF51394.2020.9443275
View details for Web of Science ID 000681731800234

Convex Geometry and Duality of Overparameterized Neural Networks
arXiv:2002.11219.
2020
Abstract
We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that neural networks with rectified linear units act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, as well as rankone data matrices, we prove that finite twolayer ReLU networks with norm regularization yield linear spline interpolation. We characterize the classification decision regions in terms of a closed form kernel matrix and minimum L1 norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain neural network predictions with finitely many neurons. Our convex geometric description also provides intuitive explanations of hidden neurons as autoencoders. In higher dimensions, we show that the training problem for twolayer networks can be cast as a convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cuttingplane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. We also establish a connection to  equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models.
arXiv 
Convex Geometry of TwoLayer ReLU Networks: Implicit Autoencoding and Interpretable Models
ADDISONWESLEY PUBL CO. 2020: 4024–32
View details for Web of Science ID 000559931300084

HighDimensional Optimization in Adaptive Random Subspaces
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000535866902047

Straggler Resilient Serverless Computing Based on Polar Codes
IEEE. 2019: 276–83
View details for Web of Science ID 000535355700039

Faster Least Squares Optimization
arXiv:1911.02675.
2019
Abstract
We investigate randomized methods for solving overdetermined linear leastsquares problems, where the Hessian is approximated based on a random projection of the data matrix. We consider a random subspace embedding which is either drawn at the beginning of the algorithm and fixed throughout, or, refreshed at each iteration. For a broad class of random matrices, we provide an exact finitetime analysis of the refreshed embeddings method, an exact asymptotic analysis of the fixed embedding method, as well as a nonasymptotic analysis, with and without momentum acceleration. Surprisingly, we show that, for Gaussian matrices, the refreshed sketching method with no momentum yields the same convergence rate as the fixed embedding method with momentum. Furthermore, we prove that momentum does not accelerate the refreshed embeddings method. Thus, picking the accelerated, fixed embedding method as the algorithm of choice among the methods we consider, we propose a new algorithm by optimizing the computational complexity over the choice of the sketching dimension. Our resulting algorithm yields a smaller complexity compared to current stateoftheart randomized preconditioning methods. In particular, as the sample size grows, the resulting complexity becomes sublinear in the problem dimensions. We validate numerically our guarantees on large sample datasets, both for Gaussian and SRHT embeddings.
 Straggler Resilient Serverless Computing Based on Polar Codes 57th Annual Allerton Conference on Communication, Control, and Computing 2019
 Distributed BlackBox Optimization via Error Correcting Codes 57th Annual Allerton Conference on Communication, Control, and Computing 2019
 HighDimensional Optimization in Adaptive Random Subspaces Neural Information Processing Systems (NeurIPS) 2019

Fast and Robust Solution Techniques for Large Scale Linear System of Equations
IEEE. 2019
View details for Web of Science ID 000518994300160

Distributed BlackBox Optimization via Error Correcting Codes
IEEE. 2019: 246–52
View details for Web of Science ID 000535355700035

Convex Optimization for Shallow Neural Networks
IEEE. 2019: 79–83
View details for Web of Science ID 000535355700013

CONVEX RELAXATIONS OF CONVOLUTIONAL NEURAL NETS
IEEE. 2019: 4928–32
View details for Web of Science ID 000482554005033

ITERATIVE HESSIAN SKETCH WITH MOMENTUM
IEEE. 2019: 7470–74
View details for Web of Science ID 000482554007141

NEWTON SKETCH: A NEAR LINEARTIME OPTIMIZATION ALGORITHM WITH LINEARQUADRATIC CONVERGENCE
SIAM JOURNAL ON OPTIMIZATION
2017; 27 (1): 205–45
View details for DOI 10.1137/15M1021106
View details for Web of Science ID 000404178500010
 Randomized sketches for kernels: Fast and optimal nonparametric regression Annals of Statistics 2017
 Iterative Hessian sketch: Fast and accurate solution approximation for constrained leastsquares Journal of Machine Learning Research (JMLR) 2016
 Sparse learning via Boolean relaxations Mathematical Programming 2015
 Randomized Sketches of Convex Programs With Sharp Guarantees IEEE Transactions on Information Theory 2015
 Structured least squares problems and robust estimators IEEE Transactions on Signal Processing 2010