
Silvio Savarese
Associate Professor of Computer Science
Bio
Silvio Savarese is an Associate Professor of Computer Science at Stanford University and the inaugural Mindtree Faculty Scholar. He earned his Ph.D. in Electrical Engineering from the California Institute of Technology in 2005 and was a Beckman Institute Fellow at the University of Illinois at Urbana-Champaign from 2005–2008. He joined Stanford in 2013 after being Assistant and then Associate Professor of Electrical and Computer Engineering at the University of Michigan, Ann Arbor, from 2008 to 2013. From 2016 to 2018, he served as a director of the SAIL-Toyota Center for AI Research at Stanford. In 2017 he co-founded an AI-to-business start up where he built and directed as Chief Scientist a large R&D team until 2020.
Dr Savarese addresses theoretical foundations and practical applications of computational vision and robotic perception. His research focuses on developing algorithms for enabling autonomous and embodied systems to understand and interact with the environment. Contributions include: i) investigation of methods for interpreting complex situations and behaviors from sensory streams; ii) development of computational models for capturing social norms and common sense rules allowing agents to effectively predict and respond to the environment; iii) exploration of machine vision methodologies for enabling automatic performance analysis and sustainability assessment in construction engineering.
Dr Savarese has published more than 200 scientific articles in top-tier journals and conferences, including IJCV, IEEE-PAMI, CVPR, ICCV, NIPS, ECCV, ICRA, IROS, and RSS. He was program chair of the Conference on Computer Vision and Pattern Recognition (CVPR) in 2020, general chair of the 4th International Conference on 3D Vision (3DV) in 2016, area chair of CVPR 2010, ICCV 2011, CVPR 2013, ECCV 2014, CVPR 2015, ICCV 2015, ECCV 2016, ICCV 2017, and an Associate editor of IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), from 2016-2019.
Dr. Savarese has been recipient of several awards including a Best Paper Award at the IEEE International Conference on Robotics and Automation (ICRA) in 2019, a Best Paper Award at the Conference on Computer Vision and Pattern Recognition (CVPR) in 2018, a Best Student Paper Award at CVPR 2016, the James R. Croes Medal in 2013, a TRW Automotive Endowed Research Award in 2012, an NSF Career Award in 2011 and Google Research Award in 2020 and 2010. In 2002 he was awarded the Walker von Brimer Award for outstanding research initiative. He has been a keynote speaker at various academic conferences and his work has been featured in a variety of media outlets, magazines and domestic and international newspapers including The New York Times, CBS, PBS, Financial Times, Quartz, ABC, BBC, Corriere Della Sera and La Repubblica.
Academic Appointments
-
Associate Professor, Computer Science
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
-
Member, Wu Tsai Neurosciences Institute
Administrative Appointments
-
Co-founder and Chief Scientist in AI start up, AI to Business Start up (2018 - 2020)
-
Director of the SAIL-Toyota Center for AI Research, Stanford (2016 - 2018)
Honors & Awards
-
Award, Google Research Award (2020, 2010)
-
Best Paper Award, IEEE International Conference on Robotics and Automation (ICRA) (2019)
-
Best Paper Award, Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
-
Best Student Paper Award, Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
-
Award, James R. Croes Medal (2013)
-
Award, TRW Automotive Endowed Research Award (2012)
-
Career Award, NSF (2011)
-
Award, Walker von Brimer Award for Outstanding Research Initiative (2002)
Program Affiliations
-
Stanford SystemX Alliance
Professional Education
-
PhD, California Institute of Technology, Electrical Engineering (2005)
-
MS, California Institute of Technology, Electrical Engineering
-
Laura Degree, Universita' degli Studi di Napoli, Federico II, Electrical Engineering
2020-21 Courses
- Computer Vision: From 3D Reconstruction to Recognition
CS 231A (Win) - Representation Learning in Computer Vision
CS 331B (Spr) -
Independent Studies (15)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr, Sum) - Advanced Reading and Research
CS 499P (Aut, Win, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Sum) - Curricular Practical Training
CS 390B (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390C (Aut, Win, Spr, Sum) - Directed Investigation
BIOE 392 (Sum) - Directed Study
BIOE 391 (Sum) - Independent Project
CS 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399P (Aut, Win, Spr, Sum) - Independent Work
CS 199 (Aut, Win, Spr) - Independent Work
CS 199P (Aut, Win, Spr) - Part-time Curricular Practical Training
CS 390D (Aut, Win) - Senior Project
CS 191 (Aut, Win, Spr, Sum) - Supervised Undergraduate Research
CS 195 (Aut) - Writing Intensive Senior Project (WIM)
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2017-18 Courses
- Computer Vision: From 3D Reconstruction to Recognition
CS 231A (Win) - Representation Learning in Computer Vision
CS 331B (Aut)
- Computer Vision: From 3D Reconstruction to Recognition
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Panos Achlioptas, Damian Mrowca, Liyue Shen, He Wang -
Postdoctoral Faculty Sponsor
Roberto Martín Martín, Claudia Perez D'Arpino -
Doctoral Dissertation Advisor (AC)
Kevin Chen, Kuan Fang, Jingwei Ji, Rachel Luo, Ajay Mandlekar, Lyne Tchapmi P., Fei Xia -
Master's Program Advisor
Krithika Iyer, Siddharth Kapoor, William Shen -
Doctoral Dissertation Co-Advisor (AC)
Suraj Nair -
Doctoral (Program)
JunYoung Gwak, Andrey Kurenkov, William Shen, Trevor Standley, Danfei Xu
All Publications
-
Watch-n-Patch: Unsupervised Learning of Actions and Relations
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
2018; 40 (2): 467–81
Abstract
There is a large variation in the activities that humans perform in their everyday lives. We consider modeling these composite human activities which comprises multiple basic level actions in a completely unsupervised setting. Our model learns high-level co-occurrence and temporal relations between the actions. We consider the video as a sequence of short-term action clips, which contains human-words and object-words. An activity is about a set of action-topics and object-topics indicating which actions are present and which objects are interacting with. We then propose a new probabilistic model relating the words and the topics. It allows us to model long-range action relations that commonly exist in the composite activities, which is challenging in previous works. We apply our model to the unsupervised action segmentation and clustering, and to a novel application that detects forgotten actions, which we call action patching. For evaluation, we contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which contains several human daily activities as compositions of multiple actions interacting with different objects. Moreover, we develop a robotic system that watches and reminds people using our action patching algorithm. Our robotic setup can be easily deployed on any assistive robots.
View details for DOI 10.1109/TPAMI.2017.2679054
View details for Web of Science ID 000422706000015
View details for PubMedID 28287959
-
Lattice Long Short-Term Memory for Human Action Recognition
IEEE. 2017: 2166–75
View details for DOI 10.1109/ICCV.2017.236
View details for Web of Science ID 000425498402024
-
Adversarially Robust Policy Learning: Active Construction of Physically-Plausible Perturbations
IEEE. 2017: 3932–39
View details for Web of Science ID 000426978203126
-
Deep View Morphing
IEEE. 2017: 7092–7100
View details for DOI 10.1109/CVPR.2017.750
View details for Web of Science ID 000418371407021
-
Feedback Networks
IEEE. 2017: 1808–17
View details for DOI 10.1109/CVPR.2017.196
View details for Web of Science ID 000418371401091
-
Tracking The Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies
IEEE. 2017: 300–311
View details for DOI 10.1109/ICCV.2017.41
View details for Web of Science ID 000425498400032
-
Robust real-time tracking combining 3D shape, color, and motion
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH
2016; 35 (1-3): 30-49
View details for DOI 10.1177/0278364915593399
View details for Web of Science ID 000368032600003
-
Automatic Extrinsic Calibration of Vision and Lidar by Maximizing Mutual Information
JOURNAL OF FIELD ROBOTICS
2015; 32 (5): 696-722
View details for DOI 10.1002/rob.21542
View details for Web of Science ID 000358016200005
-
Indoor Scene Understanding with Geometric and Semantic Contexts
INTERNATIONAL JOURNAL OF COMPUTER VISION
2015; 112 (2): 204-220
View details for DOI 10.1007/s11263-014-0779-4
View details for Web of Science ID 000351518500006
-
Relating Things and Stuff via Object Property Interactions
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
2014; 36 (7): 1370-1383
View details for DOI 10.1109/TPAMI.2013.193
View details for Web of Science ID 000338209900007
-
Relating Things and Stuff via ObjectProperty Interactions.
IEEE transactions on pattern analysis and machine intelligence
2014; 36 (7): 1370-1383
Abstract
In the last few years, substantially different approaches have been adopted for segmenting and detecting "things" (object categories that have a well defined shape such as people and cars) and "stuff" (object categories which have an amorphous spatial extent such as grass and sky). While things have been typically detected by sliding window or Hough transform based methods, detection of stuff is generally formulated as a pixel or segment-wise classification problem. This paper proposes a framework for scene understanding that models both things and stuff using a common representation while preserving their distinct nature by using a property list. This representation allows us to enforce sophisticated geometric and semantic relationships between thing and stuff categories via property interactions in a single graphical model. We use the latest advances made in the field of discrete optimization to efficiently perform maximum a posteriori (MAP) inference in this model. We evaluate our method on the Stanford dataset by comparing it against state-of-the-art methods for object segmentation and detection. We also show that our method achieves competitive performances on the challenging PASCAL '09 segmentation dataset.
View details for DOI 10.1109/TPAMI.2013.193
View details for PubMedID 26353309
-
Understanding Collective Activities of People from Videos
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
2014; 36 (6): 1242-1257
View details for DOI 10.1109/TPAMI.2013.220
View details for Web of Science ID 000337124200015
-
A Bayesian generative model for learning semantic hierarchies
FRONTIERS IN PSYCHOLOGY
2014; 5
Abstract
Building fine-grained visual recognition systems that are capable of recognizing tens of thousands of categories, has received much attention in recent years. The well known semantic hierarchical structure of categories and concepts, has been shown to provide a key prior which allows for optimal predictions. The hierarchical organization of various domains and concepts has been subject to extensive research, and led to the development of the WordNet domains hierarchy (Fellbaum, 1998), which was also used to organize the images in the ImageNet (Deng et al., 2009) dataset, in which the category count approaches the human capacity. Still, for the human visual system, the form of the hierarchy must be discovered with minimal use of supervision or innate knowledge. In this work, we propose a new Bayesian generative model for learning such domain hierarchies, based on semantic input. Our model is motivated by the super-subordinate organization of domain labels and concepts that characterizes WordNet, and accounts for several important challenges: maintaining context information when progressing deeper into the hierarchy, learning a coherent semantic concept for each node, and modeling uncertainty in the perception process.
View details for DOI 10.3389/fpsyg.2014.00417
View details for Web of Science ID 000336085600001
View details for PubMedID 24904452
View details for PubMedCentralID PMC4033064
- Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild 2014
-
Monocular Multiview Object Tracking with 3D Aspect Parts
13th European Conference on Computer Vision (ECCV)
SPRINGER-VERLAG BERLIN. 2014: 220–235
View details for Web of Science ID 000345300000015
-
A Hierarchical Representation for Future Action Prediction
13th European Conference on Computer Vision (ECCV)
SPRINGER INT PUBLISHING AG. 2014: 689–704
View details for Web of Science ID 000345527000045
-
Relating Things and Stuff via Object Property Interactions.
IEEE transactions on pattern analysis and machine intelligence
2013: -?
Abstract
In the last few years, substantially different approaches have been adopted for segmenting and detecting "things" (object categories that have a well defined shape such as people and cars) and "stuff" (object categories which have an amorphous spatial extent such as grass and sky). While things have been typically detected by sliding window or Hough transform based methods, detection of stuff is generally formulated as a pixel or segment-wise classification problem. This paper proposes a framework for scene understanding that models both things and stuff using a common representation while preserving their distinct nature by using a property list. This representation allows us to enforce sophisticated geometric and semantic relationships between thing and stuff categories via property interactions in a single graphical model. We use the latest advances made in the field of discrete optimization to efficiently perform maximum a posteriori (MAP) inference in this model. We evaluate our method on the Stanford dataset by comparing it against state-of-the-art methods for object segmentation and detection. We also show that our method achieves competitive performances on the challenging PASCAL'09 segmentation dataset.
View details for PubMedID 24101332
- Layout Estimation of Highly Cluttered Indoor Scenes using Geometric and Semantic Cues 2013
-
Find the Best Path: an Efficient and Accurate Classifier for Image Hierarchies
IEEE International Conference on Computer Vision (ICCV)
IEEE. 2013: 265–272
View details for DOI 10.1109/ICCV.2013.40
View details for Web of Science ID 000351830500034
-
3D Scene Understanding by Voxel-CRF
IEEE International Conference on Computer Vision (ICCV)
IEEE. 2013: 1425–1432
View details for DOI 10.1109/ICCV.2013.180
View details for Web of Science ID 000351830500178
-
Breaking the chain: liberation from the temporal Markov assumption for tracking human poses
IEEE International Conference on Computer Vision (ICCV)
IEEE. 2013: 2424–2431
View details for DOI 10.1109/ICCV.2013.301
View details for Web of Science ID 000351830500303
-
Object Detection by 3D Aspectlets and Occlusion Reasoning
IEEE International Conference on Computer Vision Workshops (ICCVW)
IEEE. 2013: 530–537
View details for DOI 10.1109/ICCVW.2013.75
View details for Web of Science ID 000349847200072
-
Free your Camera: 3D Indoor Scene Understanding from Arbitrary Camera Motion
24th British Machine Vision Conference
B M V A PRESS. 2013
View details for DOI 10.5244/C.27.24
View details for Web of Science ID 000346352700021
- Accurate Localization of 3D Objects from RGB-D Data using Segmentation Hypotheses 2013
- Breaking the chain: liberation from the temporal Markov assumption for tracking human poses 2013
- Dense Object Reconstruction Using Semantic Priors 2013
- Free your Camera: 3D Indoor Scene Understanding from Arbitrary Camera Motion 2013
- Learning Hierarchical Linguistic Descriptions of Visual Datasets NAACL-HLT Workshop on Vision and Language 2013
- Find the Best Path: an Efficient and Accurate Classifier for Image Hierarchies 2013
- Understanding Indoor Scenes using 3D Geometric Phrases 2013
- Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines 2013
- Recognizing Complex Human Activities via Crowd Context Augmented Vision and Reality Springer. 2013: 1
- Object Detection by 3D Aspectlets and Occlusion Reasoning in the 4th International IEEE Workshop on 3D Representation and Recognition (3dRR) 2013
- Label Transfer Exploiting Three-dimensional Structure for Semantic Segmentation 2013
- 3D Scene Understanding by Voxel-CRF 2013
- Object detection, shape recovery, and 3D modelling by depth-encoded hough voting in Computer Vision and Image Understanding (CVIU) 2013
- Automatic targetless extrinsic calibration of a 3d lidar and camera by maximizing mutual information 2012
- Mobile Object Detection through Client-Server based Vote Transfer 2012
- Multimodality video indexing and retrieval using directed information IEEE Transactions on Multimedia 2012; 14 (1)
- Object Detection using Geometrical Context Feedback International Journal of Computer Vision 2012; 2
- An Efficient Branch-and-Bound Algorithm for Optimal Human Pose Estimation 2012
- Relating Things and Stuff by High-Order Potential Modeling ECCV 2012 Workshop on Higher-Order Models and Global Constraints in Computer Vision (HiPot). 2012
- Estimating the Aspect Layout of Object Categories 2012
- Structure From Motion with Points, Objects, and Regions 2012
- Toward Mutual Information based Automatic Registration of 3D Point Clouds 2012
- A Unified Framework for Multi-Target Tracking and Collective Activity Recognition 2012
- Model-based object recognition Encyclopedia of Computer Vision Springer. 2012: 1
- Object Co-detection 2012
- 3D Shape from Specular Reflections Encyclopedia of Computer Vision Springer. 2012: 1
- MVSS: Michigan Visual Sonification System 2012
- Scene Understanding for the Visually Impaired Using Visual Sonification by Visual Feature Analysis and Auditory Signature 2012
- Efficient and Exact MAP Inference using Branch and Bound 2012
-
Research in Visualization Techniques for Field Construction
JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT-ASCE
2011; 137 (10): 853-862
View details for DOI 10.1061/(ASCE)CO.1943-7862.0000262
View details for Web of Science ID 000296507700021
- Toward coherent object detection and scene layout understanding Image and Vision Computing 2011; 9
- Visually Bootstrapped generalized ICP 2011
- EFFEX: An Embedded Processor for Computer Vision Based Feature Extraction 2011
- Deformable Part Models Revisited: A Performance Evaluation for Object Category Pose Estimation IEEE Workshop on Challenges and Opportunities in Robot Perception (in conjunction with ICCV-11). 2011
- Monitoring Changes of 3D Building Elements from Unordered Photo Collections IEEE workshop on Computer Vision for Remote Sensing of the Environment (in conjunction with ICCV-11). 2011
- Learning Context for Collective Activity Recognition 2011
- Semantic Structure from Motion 2011
- MEVBench: A Mobile Computer Vision Benchmarking Suite 2011
- Visualization of Construction Progress Monitoring using Unordered Construction Photo Collections and 4D Building Information Models in "Augmented Reality", ISBN 978-953-307-631-7 2011: 1
- Toward Automatic 3D Generic Object Modeling from One Single Image 3DIM-PVT 2011
- Semantic Structure From Motion with Object and Point Interactions IEEE Workshop on Challenges and Opportunities in Robot Perception (in conjunction with ICCV-11). 2011
- Hierarchical Classification of Images by Sparse Approximation 2011
- Articulated Part-based Model for Joint Object Detection and Pose Estimation 2011
- Integrated Sequential As-Built and As-Planned Representation with D4AR Tools in Support of Decision-Making Tasks in the AEC/FM Industry ASCE Journal of Construction Engineering and Management 2011
- Robust Object Pose Estimation via Statistical Manifold Modeling 2011
- Representations and Techniques for 3D Object Recognition and Scene Interpretation Synthesis lecture on Artificial Intelligence and Machine Learning Morgan Claypool Publishers. 2011: 1
- Recognizing Human Actions by Attributes 2011
- Cross-View Action Recognition via View Knowledge Transfer 2011
- Detecting and Tracking People using an RGB-D Camera via Multiple Detector Fusion Workshop on Challenges and Opportunities in Robot Perception (in conjunction with ICCV-11). 2011
- A computer analysis of the mirror in Hans Memlingis Virgin and Child and Maarten van Nieuwenhove Digital Imaging for Cultural Heritage Preservation CRC Press. 2011: 1
- Multi-view Object Categorization and Pose Estimation Computer Vision: Detection, Recognition and Reconstruction (Studies in Computational Intelligence) Springer. 2010: 1
- Remote assessment of pre and post-disaster critical physical infrastructures using segway mobile workstation chariot and D4AR 4D augmented reality models. 2010
- Automated model component-based recognition of progress using daily construction photographs and 4D IFC-based models. 2010
- D4AR 4 Dimensional augmented reality - tools for automated remote progress tracking and support of decision-enabling tasks in the AEC/FM industry 2010
- Depth-Encoded Hough Voting for Joint Object Detection and Shape Recovery 2010
- CEC: Research in Visualization Techniques for Field Construction 2010
- D4AR - 4 DIMENSIONAL AUGMENTED REALITY - MODELS FOR AUTOMATION AND INTERACTIVE VISUALIZATION OF CONSTRUCTION PROGRESS MONITORING 2010
- Extrinsic calibration of a 3d laser scanner and an omnidirectional camera. 2010
- Toward automated generation of parametric BIMs based on hybrid video and laser scanning data. In Journal of Advanced Engineering Informatics 2010; 4 (24): 456-465
- Model-based detection of progress using D4AR - A 4 Dimensional augmented reality- models generated by daily site photologs and building information models 2010
- Toward Coherent Object Detection And Scene Layout Understanding 2010
- Multiple Target Tracking in World Coordinate with Single, Minimally Calibrated Camera 2010
- Object Detection with Geometrical Context Feedback Loop 2010
-
Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories
12th IEEE International Conference on Computer Vision
IEEE. 2009: 213–220
View details for Web of Science ID 000294955300028
- A Multi-View Probabilistic Model for 3D Object Classes. 2009
- Monitoring of Construction Performance Using Daily Progress Photograph Logs and 4D As-Planned Models 2009
- What are they doing? : Collective Activity Classification Using Spatio-Temporal Relationship Among People 2009
- D4AR- A 4-Dimensional Augmented Reality Model for Automating Construction Progress Data Collection 2009
- Unsupervised Object Pose Classification from Short Video Sequences 2009
- Sparse Reconstruction and Geo-Registration of Daily Site Photographs for Representation of As-Built Construction Scene and Automatic Construction Progress Data Collection 2009
- Scene Categorization from Low Definition Video 2009
- Interactive Visual Construction Progress Monitoring with 4D Augmented Reality Model 2009
- View synthesis for recognizing unseen poses of object classes. 2008
- Why do we see some surfaces as reflective? 2008
- When are reflections useful in perceiving the shape of shiny surfaces? 2008
- Spatial-Temporal Correlations for Unsupervised Action Classification 2008
- Reflections on praxis and facture in a devotional portrait diptych: A computer analysis of the mirror in Hans Memling’s Virgin and Child and Maarten van Nieuwenhove 2008
- Interactive Visual Construction Progress Monitoring with 4D Augmented Reality Model CCBE-XI 2008
- Detecting Specular Surfaces on Natural Images 2007
- Carving from ray-tracing constraints: IRT-carving 2006
- Discriminative Object Class Models of Appearance and Shape by Correlatons 2006
- 3D Reconstruction by Shadow Carving: Theory and Practical Evaluation International Journal of Computer Vision (IJCV) 2006; 3 (71): 305-336
- Local Shape from Mirror Reflections International Journal of Computer Vision (IJCV) 2005; 1 (64): 31-67
- What do reflections tell us about the shape of a mirror? in Applied Perception in Graphics and Visualization [sponsored by ACM SIGGRAPH] 2004: 115-118
- Recovering local shape of a mirror surface from reflection of a regular gridI 2004
- Can We See the Shape of a Mirror? 2003
- Implementation of a Shadow Carving System for Shape Capture 2002
- Local Analysis for 3D Reconstruction of Specular Surfaces -- part II 2002
- Second Order Local Analysis for 3D Reconstruction of Specular Surfaces 2002
- Local Analysis for 3D Reconstruction of Specular Surfaces 2001
- Shadow Carving 2001