Kunle Olukotun
Cadence Design Systems Professor, Professor of Electrical Engineering and of Computer Science
Bio
Kunle Olukotun is the Cadence Design Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. He founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multi-core processor, called Niagara, was acquired by Sun Microsystems and now powers Oracle's SPARC-based servers. In 2017, Olukotun co-founded SambaNova Systems, a Machine Learning and Artificial Intelligence company, and continues to lead as their Chief Technologist.
Olukotun is the Director of the Pervasive Parallel Lab and a member of the Data Analytics tor What's Next (DAWN) Lab, developing infrastructure for usable machine learning. He is a member of the National Academy of Engineering, an ACM Fellow, and an IEEE Fellow for contributions to multiprocessors on a chip design and the commercialization of this technology. He also received the Harry H. Goode Memorial Award.
Olukotun received his Ph.D. in Computer Engineering from The University of Michigan.
Academic Appointments
-
Professor, Electrical Engineering
-
Professor, Computer Science
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
-
Member, Wu Tsai Neurosciences Institute
Honors & Awards
-
Eckert-Machly Award, ACM-IEEE (2023)
-
Member, American Academy of Arts and Sciences (2022)
-
Member, National Academy of Engineering (2021)
-
Harry H. Goode Memorial Award, IEEE (2018)
-
Fellow, ACM (2007)
-
Fellow, IEEE (2007)
Professional Education
-
PhD, Michigan (1991)
2024-25 Courses
- Digital Systems Design Lab
EE 109 (Spr) -
Independent Studies (19)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr) - Advanced Reading and Research
CS 499P (Aut, Win, Spr) - Curricular Practical Training
CS 390A (Aut, Win, Spr) - Curricular Practical Training
CS 390B (Aut, Win, Spr) - Curricular Practical Training
CS 390C (Aut, Win, Spr) - Independent Project
CS 399 (Aut, Win, Spr) - Independent Project
CS 399P (Aut, Win, Spr) - Independent Work
CS 199 (Aut, Win, Spr) - Independent Work
CS 199P (Aut, Win, Spr) - Master's Thesis and Thesis Research
EE 300 (Aut, Win, Spr) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr) - Programming Service Project
CS 192 (Aut, Win, Spr) - Senior Project
CS 191 (Aut, Win, Spr) - Special Studies and Reports in Electrical Engineering
EE 191 (Aut, Win, Spr) - Special Studies and Reports in Electrical Engineering
EE 391 (Aut, Win, Spr) - Special Studies and Reports in Electrical Engineering (WIM)
EE 191W (Aut, Win, Spr) - Special Studies or Projects in Electrical Engineering
EE 190 (Aut, Win, Spr) - Special Studies or Projects in Electrical Engineering
EE 390 (Aut, Win, Spr) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2023-24 Courses
- Digital Systems Design Lab
EE 109 (Spr)
2022-23 Courses
- Digital Systems Design Lab
EE 109 (Spr) - Hardware Accelerators for Machine Learning
CS 217 (Win) - Parallel Computing
CS 149 (Aut)
2021-22 Courses
- Digital Systems Design Lab
EE 109 (Spr) - Parallel Computing
CS 149 (Aut)
- Digital Systems Design Lab
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Trevor Gale, Robert Radway, Eshan Singh, Athinagoras Skiadopoulos -
Orals Evaluator
Trevor Gale -
Doctoral Dissertation Advisor (AC)
Olivia Hsu, Sho Ko, Rubens Lacouture, Gina Sohn -
Master's Program Advisor
Andreas Alexandrou, Seth Carden, Chia-Hsiang Chang, Robert Chen, Janelle Cheung, Suhas Chundi, Hamza El Boudali, Divija Hasteer, Suchen He, Quan Ho, Jinhyo Huh, Matthew Hung, Isabella Jordan, Manas Khadka, Rupert Lu, Patrick McEwen, Sai Gautham Ravipati, Joseph Rejive, Hari Vallabhaneni, Yichen Wang -
Doctoral (Program)
Konstantin Hossfeld, Olivia Hsu, Wonsuk Jang, Taeyoung Kong, Rubens Lacouture, Louis Le Coeur, Leo Liu, Gina Sohn, Genghan Zhang, Qizheng Zhang
All Publications
-
Revet: A Language and Compiler for Dataflow Threads
IEEE COMPUTER SOC. 2024: 61-74
View details for DOI 10.1109/HPCA57654.2024.00016
View details for Web of Science ID 001207751400005
-
Computing Systems in the Foundation Model Era
IEEE COMPUTER SOC. 2024: 889
View details for DOI 10.1109/IPDPS57955.2024.00083
View details for Web of Science ID 001270389600047
-
CARAVAN: Practical Online Learning of In-Network ML Models with Labeling Agents
USENIX ASSOC. 2024: 325-345
View details for Web of Science ID 001270877200018
-
Mosaic: An Interoperable Compiler for Tensor Algebra
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL
2023; 7 (PLDI)
View details for DOI 10.1145/3591236
View details for Web of Science ID 001005701900018
-
BaCO: A Fast and Portable Bayesian Compiler Optimization Framework
ASSOC COMPUTING MACHINERY. 2023: 19-42
View details for DOI 10.1145/3623278.3624770
View details for Web of Science ID 001161547900002
-
Sigma: Compiling Einstein Summations to Locality-Aware Dataflow
ASSOC COMPUTING MACHINERY. 2023: 718-732
View details for DOI 10.1145/3575693.3575694
View details for Web of Science ID 001074472300050
-
Global Perspectives of Diversity, Equity, and Inclusion
COMMUNICATIONS OF THE ACM
2022; 65 (12): 30-31
View details for DOI 10.1145/3548454
View details for Web of Science ID 000887945400010
-
Taurus: A Data Plane Architecture for Per-Packet ML
ASSOC COMPUTING MACHINERY. 2022: 1099-1114
View details for DOI 10.1145/3503222.3507726
View details for Web of Science ID 000810486300077
-
Accelerating SLIDE: Exploiting Sparsity on Accelerator Architectures
IEEE COMPUTER SOC. 2022: 663-670
View details for DOI 10.1109/IPDPSW55747.2022.00116
View details for Web of Science ID 000855041000083
-
Compilation of Sparse Array Programming Models
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL
2021; 5
View details for DOI 10.1145/3485505
View details for Web of Science ID 000731569200032
-
Chopping off the Tail: Bounded Non-Determinism for Real-Time Accelerators
IEEE COMPUTER ARCHITECTURE LETTERS
2021; 20 (2): 110-113
View details for DOI 10.1109/LCA.2021.3102224
View details for Web of Science ID 000685885600002
-
Aurochs: An Architecture for Dataflow Threads
IEEE COMPUTER SOC. 2021: 402-415
View details for DOI 10.1109/ISCA52012.2021.00039
View details for Web of Science ID 000702275600030
-
Bayesian Optimization with a Prior for the Optimum
SPRINGER INTERNATIONAL PUBLISHING AG. 2021: 265-296
View details for DOI 10.1007/978-3-030-86523-8_17
View details for Web of Science ID 000713413200017
-
High Performance Lattice Regression on FPGAs via a High Level Hardware Description Language
IEEE. 2021: 78-87
View details for DOI 10.1109/ICFPT52863.2021.9609893
View details for Web of Science ID 000792703100011
-
SARA: Scaling a Reconfigurable Dataflow Accelerator
IEEE COMPUTER SOC. 2021: 1041-1054
View details for DOI 10.1109/ISCA52012.2021.00085
View details for Web of Science ID 000702275600076
-
Elastic RSS: Co-Scheduling Packets and Cores Using Programmable NICs
ASSOC COMPUTING MACHINERY. 2019: 71–77
View details for DOI 10.1145/3343180.3343184
View details for Web of Science ID 000505066500011
-
Scalable Interconnects for Reconfigurable Spatial Architectures
ASSOC COMPUTING MACHINERY. 2019: 615–28
View details for DOI 10.1145/3307650.3322249
View details for Web of Science ID 000521059600048
-
TensorFlow to Cloud FPGAs: Tradeoffs for Accelerating Deep Neural Networks
IEEE. 2019: 360–66
View details for DOI 10.1109/FPL.2019.00064
View details for Web of Science ID 000518670300054
-
Polystore plus plus : Accelerated Polystore System for Heterogeneous Workloads
IEEE COMPUTER SOC. 2019: 1641–51
View details for DOI 10.1109/ICDCS.2019.00163
View details for Web of Science ID 000565234200152
-
Exploring the Utility of Developer Exhaust.
Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.)
2018; 2018
Abstract
Using machine learning to analyze data often results in developer exhaust - code, logs, or metadata that do not define the learning algorithm but are byproducts of the data analytics pipeline. We study how the rich information present in developer exhaust can be used to approximately solve otherwise complex tasks. Specifically, we focus on using log data associated with training deep learning models to perform model search by predicting performance metrics for untrained models. Instead of designing a different model for each performance metric, we present two preliminary methods that rely only on information present in logs to predict these characteristics for different architectures. We introduce (i) a nearest neighbor approach with a hand-crafted edit distance metric to compare model architectures and (ii) a more generalizable, end-to-end approach that trains an LSTM using model architectures and associated logs to predict performance metrics of interest. We perform model search optimizing for best validation accuracy, degree of overfitting, and best validation accuracy given a constraint on training time. Our approaches can predict validation accuracy within 1.37% error on average, while the baseline achieves 4.13% by using the performance of a trained model with the closest number of layers. When choosing the best performing model given constraints on training time, our approaches select the top-3 models that overlap with the true top- 3 models 82% of the time, while the baseline only achieves this 54% of the time. Our preliminary experiments hold promise for how developer exhaust can help learn models that can approximate various complex tasks efficiently.
View details for DOI 10.1145/3209889.3209895
View details for PubMedID 31131381
-
Plasticine: A Reconfigurable Accelerator for Parallel Patterns
IEEE MICRO
2018; 38 (3): 20–31
View details for Web of Science ID 000432316500004
-
LevelHeaded: A Unified Engine for Business Intelligence and Linear Algebra Querying
IEEE. 2018: 449–60
View details for DOI 10.1109/ICDE.2018.00048
View details for Web of Science ID 000492836500040
-
EmptyHeaded: A Relational Engine for Graph Processing
ASSOC COMPUTING MACHINERY. 2017
View details for DOI 10.1145/3129246
View details for Web of Science ID 000419302700001
-
Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded
PROCEEDINGS OF THE VLDB ENDOWMENT
2017; 10 (12): 1849–52
View details for Web of Science ID 000416494000024
-
Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent
ASSOC COMPUTING MACHINERY. 2017: 561–74
Abstract
Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.
View details for PubMedID 29391770
View details for PubMedCentralID PMC5789782
-
EmptyHeaded: A Relational Engine for Graph Processing.
Proceedings. ACM-Sigmod International Conference on Management of Data
2016; 2016: 431-446
Abstract
There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP.
View details for DOI 10.1145/2882903.2915213
View details for PubMedID 28077912
-
Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling.
JMLR workshop and conference proceedings
2016; 48: 1567-1576
Abstract
Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes.
View details for PubMedID 28344730
-
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms.
Advances in neural information processing systems
2015; 28: 2656-2664
Abstract
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.
View details for PubMedID 27330264
-
Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width.
Advances in neural information processing systems
2015; 28: 3079-3087
Abstract
Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width-regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.
View details for PubMedID 27279724
-
Beyond Parallel Programming with Domain Specific Languages
ACM SIGPLAN NOTICES
2014; 49 (8): 179-179
View details for DOI 10.1145/2555243.2557966
View details for Web of Science ID 000349142100016
-
Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS
2014; 13
View details for DOI 10.1145/2584665
View details for Web of Science ID 000341390100017
-
Surgical Precision JIT Compilers
35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI)
ASSOC COMPUTING MACHINERY. 2014: 41–52
View details for DOI 10.1145/2594291.2594316
View details for Web of Science ID 000344455800008
-
Forge: Generating a High Performance DSL Implementation from a Declarative Specification
ACM SIGPLAN NOTICES
2014; 49 (3): 145-154
View details for DOI 10.1145/2517208.2517220
View details for Web of Science ID 000338625500017
-
Optimizing Data Structures in High-Level Programs New Directions for Extensible Compilers based on Staging
ACM SIGPLAN NOTICES
2013; 48 (1): 497-510
View details for DOI 10.1145/2480359.2429128
View details for Web of Science ID 000318629900042
-
High Performance Embedded Domain Specific Languages
ACM SIGPLAN NOTICES
2012; 47 (9): 139-139
View details for DOI 10.1145/2398856.2364548
View details for Web of Science ID 000311296000014
-
Green-Marl: A DSL for Easy and Efficient Graph Analysis
ACM SIGPLAN NOTICES
2012; 47 (4): 349-362
View details for Web of Science ID 000209339300029
- Green-Marl: A DSL for Easy and Efficient Graph Analysis 2012
-
IMPLEMENTING DOMAIN-SPECIFIC LANGUAGES FOR HETEROGENEOUS PARALLEL COMPUTING
IEEE MICRO
2011; 31 (5): 42-52
View details for Web of Science ID 000295883700006
-
Accelerating CUDA Graph Algorithms at Maximum Warp
ACM SIGPLAN NOTICES
2011; 46 (8): 267-276
View details for Web of Science ID 000296264900027
-
A Domain-Specific Approach To Heterogeneous Parallelism
ACM SIGPLAN NOTICES
2011; 46 (8): 35-45
View details for Web of Science ID 000296264900005
-
Hardware Acceleration of Transactional Memory on Commodity Systems
ACM SIGPLAN NOTICES
2011; 46 (3): 27-38
View details for DOI 10.1145/1961296.1950372
View details for Web of Science ID 000290854400004
- Implementing Domain-Specific Languages for Heterogeneous Parallel Computing IEEE Micro: Special Issue on CPU, GPU, and Hybrid Computing 2011
- Hardware Acceleration of Transactional Memory on Commodity Systems 2011
- Accelerating CUDA Graph Algorithms at Maximum Warp 2011
- A Domain-Specific Approach to Heterogeneous Parallelism 2011
- Building-Blocks for Performance Oriented DSLs 2011
- OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning 2011
- Efficient Parallel Graph Exploration on Multi-Core CPU and GPU 2011
- A Heterogeneous Parallel Framework for Domain-Specific Languages 2011
-
Language Virtualization for Heterogeneous Parallel Computing
Conference on Object Oriented Programming Systems, Languages and Applications/SPLASH 2010
ASSOC COMPUTING MACHINERY. 2010: 835–47
View details for DOI 10.1145/1932682.1869527
View details for Web of Science ID 000286595800051
-
A Practical Concurrent Binary Search Tree
ACM SIGPLAN NOTICES
2010; 45 (5): 257-268
View details for Web of Science ID 000280548100024
-
UBIQUITOUS PARALLEL COMPUTING FROM BERKELEY, ILLINOIS, AND STANFORD
IEEE MICRO
2010; 30 (2): 41-55
View details for Web of Science ID 000276473900006
- A Large-scale Architecture for Restricted Boltzmann Machines 2010
- FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures 2010
- Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory 2010
- Transactional Predication: High-Performance Concurrent Sets and Maps for STM 2010
- EigenBench: A Simple Exploration Tool for Orthogonal TM Characterisitics 2010
- CCSTM: A Library-Based STM for Scala 2010
- Making Nested Parallel Transactions Practical using Lightweight Hardware Support 2010
- Language Virtualization for Heterogeneous Parallel Computing 2010
- Implementing and Evaluating a Model Checker for Transactional Memory Systems 2010
- A Practical Concurrent Binary Search Tree. 2010
- A Highly Scalable Restricted Boltzmann Machine FPGA Implementation 2009
-
Feedback-Directed Barrier Optimization in a Strongly Isolated STM
ACM SIGPLAN NOTICES
2009; 44 (1): 213-225
View details for Web of Science ID 000272013800020
- Feedback-Directed Barrier Optimization in a Strongly Isolated STM 2009
-
Improving Software Concurrency with Hardware-assisted Memory Snapshot
20th ACM Symposium on Parallelism in Algorithms and Architectures
ASSOC COMPUTING MACHINERY. 2008: 363–363
View details for Web of Science ID 000266217200050
-
STAMP: Stanford Transactional Applications for Multi-Processing
IEEE International Symposium on Workload Characterization
IEEE. 2008: 31–42
View details for Web of Science ID 000263063500004
-
ASeD: Availability, Security, and Debugging Support using Transactional Memory
20th ACM Symposium on Parallelism in Algorithms and Architectures
ASSOC COMPUTING MACHINERY. 2008: 366–366
View details for Web of Science ID 000266217200053
-
Transactional memory: The hardware-software interface
IEEE MICRO
2007; 27 (1): 67-76
View details for Web of Science ID 000246455000009
-
An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees
34th Annual International Symposium on Computer Architecture
ASSOC COMPUTING MACHINERY. 2007: 69–80
View details for Web of Science ID 000265786200007
-
Transactional Collection Classes
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ASSOC COMPUTING MACHINERY. 2007: 56–67
View details for Web of Science ID 000266870900006
-
A Practical FPGA-based Framework for Novel CMP Research
15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
ASSOC COMPUTING MACHINERY. 2007: 116–125
View details for Web of Science ID 000268330100013
-
Towards Soft Optimization Techniques for Parallel Cognitive Applications
19th Annual Symposium on Parallelism in Algorithms and Architectures
ASSOC COMPUTING MACHINERY. 2007: 59–60
View details for Web of Science ID 000266371200009
-
A scalable, non-blocking approach to transactional memory
13th International Symposium on High-Performance Computer Architecture
IEEE COMPUTER SOC. 2007: 97–108
View details for Web of Science ID 000245463100010
-
ATLAS: A chip-multiprocessor with Transactional Memory support
Design, Automation and Test in Europe Conference and Exhibition (DATE 07)
IEEE. 2007: 3–8
View details for Web of Science ID 000252175700001
-
Executing Java programs with transactional memory
OOPSLA Workshop on Synchronization and Concurrent in Object-Oriented Languages
ELSEVIER SCIENCE BV. 2006: 111–29
View details for DOI 10.1016/j.scico.2006.05.006
View details for Web of Science ID 000241921200002
-
Tradeoffs in transactional memory virtualization
ACM SIGPLAN NOTICES
2006; 41 (11): 371-381
View details for Web of Science ID 000202972600035
-
The ATOMO Sigma transactional programming language
ACM SIGPLAN NOTICES
2006; 41 (6): 1-13
View details for Web of Science ID 000202972100001
- The Atomos Transactional Programming Language 2006
-
Architectural semantics for practical Transactional Memory
33rd International Symposium on Computer Architecture
IEEE COMPUTER SOC. 2006: 53–64
View details for Web of Science ID 000238976500005
-
The common case transactional behavior of multithreaded programs
12th International Symposium on High-Performance Computer Architecture
IEEE COMPUTER SOC. 2006: 271–282
View details for Web of Science ID 000237200400026
- The Common Case Transactional Behavior of Multithreaded Programs 2006
- Architectural Semantics for Practical Transactional Memory 2006
- The Software Stack for Transactional Memory: Challenges and Opportunities 2006
- Tradeoffs in Transactional Memory Virtualizations 2006
-
Niagara: A 32-way multithreaded SPARC processor
IEEE MICRO
2005; 25 (2): 21-29
View details for Web of Science ID 000228487000004
- The Future of Microprocessors ACM QUEUE Magazine 2005
-
Maximizing CMP throughput with mediocre cores
PACT 2005: 14TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES
2005: 51-62
View details for Web of Science ID 000233637100005
-
A new approach to programming and prototyping parallel systems
HIGH PERFORMANCE COMPUTING - HIPC 2005, PROCEEDINGS
2005; 3769: 4-4
View details for Web of Science ID 000235801700003
-
Characterization of TCC on chip-multiprocessors
14th International Conference on Parallel Architectures and Compilation Techniques
IEEE COMPUTER SOC. 2005: 63–74
View details for Web of Science ID 000233637100006
- Maximizing CMP Throughput with Mediocre Cores 2005
- TAPE: A Transactional Application Profiling Environment 2005
- Article about Kunle Olukuton's Niagara processor: Sun's Big Splash IEEE Spectrum Magazine 2005
- Transactional Execution of Java Programs 2005
- Exposing Speculative Thread Parallelism in SPEC2000 2005
- Characterization of TCC on Chip-Multiprocessors 2005
-
Transactional coherence and consistency: Simplifying parallel hardware and software
IEEE MICRO
2004; 24 (6): 92-103
View details for Web of Science ID 000226365900013
-
Programming with transactional coherence and consistency (TCC)
11th International Conference on Architectural Support for Programming Languages and Operating Systems
ASSOC COMPUTING MACHINERY. 2004: 1–13
View details for Web of Science ID 000228341700003
- Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software Micro's Top Picks, IEEE Micro 2004; 24 (6)
-
Transactional memory coherence and consistency
31st Annual International Symposium on Computer Architecture
IEEE COMPUTER SOC. 2004: 102–113
View details for Web of Science ID 000222915900009
- Niagara: A 32-Way Multithreaded SPARC Processor IEEE MICRO Magazine, March-April 2005, and presented at Hot Chips 2004
- Transactional Memory Coherence and Consistency 2004
- Programming with Transactional Coherence and Consistency (TCC) 2004
-
The Jrpm system for dynamically parallelizing sequential Java programs
IEEE MICRO
2003; 23 (6): 26-35
View details for Web of Science ID 000188257700006
-
Using thread-level speculation to simplify manual parallelization
9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ASSOC COMPUTING MACHINERY. 2003: 1–12
View details for Web of Science ID 000187366900001
- Using Thread-Level Speculation to Simplify Manual Parallelization 2003
-
The Jrpm system for dynamically parallelizing Java programs
30TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS
2003: 434-445
View details for Web of Science ID 000183763700037
-
TEST: A tracer for extracting speculative threads
CGO 2003: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION
2003: 301-312
View details for Web of Science ID 000182316800026
- The Jrpm System for Dynamically Parallelizing Java Programs 2003
- TEST: A Tracer for Extracting Speculative Threads 2003
- The Jrpm System for Dynamically Parallelizing Java Programs 2003
-
Targeting dynamic compilation for embedded environments
USENIX ASSOCIATION PROCEEDINGS OF THE 2ND JAVA(TM) VIRTUAL MACHINE RESEARCH AND TECHNOLOGY SYMPOSIUM
2002: 151-164
View details for Web of Science ID 000178400500013
-
Efficient state representation for symbolic simulation
39TH DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2002
2002: 99-104
View details for Web of Science ID 000177213300018
-
High bandwidth on-chip cache design
IEEE TRANSACTIONS ON COMPUTERS
2001; 50 (4): 292-307
View details for Web of Science ID 000168145500002
-
The Stanford Hydra CMP
IEEE MICRO
2000; 20 (2): 71-84
View details for Web of Science ID 000086194900013
-
A single chip multiprocessor integrated with high density DRAM
IEICE TRANSACTIONS ON ELECTRONICS
1999; E82C (8): 1567-1577
View details for Web of Science ID 000082243400030
-
REMARC: Reconfigurable multimedia array coprocessor
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS
1999; E82D (2): 389-397
View details for Web of Science ID 000079040600006
- The Stanford Hydra CMP IEEE MICRO Magazine, March-April 2000, and presented at Hot Chips 1999
- Improving the Performance of Speculatively Parallel Applications on the Hydra CMP 1999
-
Data speculation support for a chip multiprocessor
ACM SIGPLAN NOTICES
1998; 33 (11): 58-69
View details for Web of Science ID 000076778700008
- Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Stanford University Computer Systems Lab Technical Report CSL-TR-98-749 1998
-
Digital system simulation: Methodologies and examples
35th Design Automation Conference
ASSOC COMPUTING MACHINERY. 1998: 658–663
View details for Web of Science ID 000077273700118
-
Exploiting method-level parallelism in single-threaded Java programs
International Conference on Parallel Architectures and Compilation Techniques
IEEE COMPUTER SOC. 1998: 176–184
View details for Web of Science ID 000076611700022
-
DCP: an algorithm for datapath/control partitioning of synthesizable RTL models
International Conference on Computer Design: VLSI in Computers and Processors
I E E E, COMPUTER SOC PRESS. 1998: 442–449
View details for Web of Science ID 000076796900070
- Data Speculation Support for a Chip Multiprocessor 1998
- Exploiting Method-Level Parallelism in Single-Threaded Java Programs 1998
-
Multilevel optimization of pipelined caches
IEEE TRANSACTIONS ON COMPUTERS
1997; 46 (10): 1093-1102
View details for Web of Science ID A1997YB64800004
-
A single-chip multiprocessor
COMPUTER
1997; 30 (9): 79-?
View details for Web of Science ID A1997XU01900018
- A Single Chip Multiprocessor Integrated with DRAM 1997
-
Java as a specification language for hardware-software systems
1997 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 97)
I E E E, COMPUTER SOC PRESS. 1997: 690–697
View details for Web of Science ID A1997BK01U00099
-
Verifying correct pipeline implementation for microprocessors
1997 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 97)
I E E E, COMPUTER SOC PRESS. 1997: 162–169
View details for Web of Science ID A1997BK01U00026
-
Designing high bandwidth on-chip caches
24th Annual International Symposium on Computer Architecture
ASSOC COMPUTING MACHINERY. 1997: 121–132
View details for Web of Science ID A1997BH95B00011
- A Single-Chip Multiprocessor IEEE Computer Special Issue on "Billion-Transistor Processors" 1997
- Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Stanford University Computer Systems Lab Technical Report CSL-TR-97-715 1997
-
The case for a single-chip multiprocessor
ACM SIGPLAN NOTICES
1996; 31 (9): 2-11
View details for Web of Science ID A1996VM12800003
- The Case for a Single-Chip Multiprocessor 1996
-
A scalable formal verification methodology for pipelined microprocessors
33rd Design Automation Conference
ASSOC COMPUTING MACHINERY. 1996: 558–563
View details for Web of Science ID A1996BF92A00111
-
The impact of shared-cache clustering in small-scale shared-memory multiprocessors
2nd International Symposium on High-Performance Computer Architecture (HPCA-2)
I E E E, COMPUTER SOC PRESS. 1996: 74–84
View details for Web of Science ID A1996BF28H00007
-
Evaluation of design alternatives for a multiprocessor microprocessor
23rd Annual International Symposium on Computer Architecture
ASSOC COMPUTING MACHINERY. 1996: 67–77
View details for Web of Science ID A1996BF68U00007
-
Emulation and prototyping of digital systems
NATO Advanced Study Institute on Hardware/Software Co-Design
SPRINGER. 1996: 339–366
View details for Web of Science ID A1996BF04R00014
-
Increasing cache port efficiency for dynamic superscalar microprocessors
23rd Annual International Symposium on Computer Architecture
ASSOC COMPUTING MACHINERY. 1996: 147–157
View details for Web of Science ID A1996BF68U00014
- Evaluation of Design Alternatives for a Multiprocessor Microprocessor 1996
-
The benefits of clustering in shared address space multiprocessors: An applications-driven investigation
1995 ACM/IEEE Supercomputing Conference (SC 95)
ASSOC COMPUTING MACHINERY. 1995: 1674–1704
View details for Web of Science ID A1995BH56H00055
-
A general method for compiling event driven simulations
32nd Design Automation Conference
ASSOC COMPUTING MACHINERY. 1995: 151–156
View details for Web of Science ID A1995BD41Y00026
-
A SOFTWARE-HARDWARE COSYNTHESIS APPROACH TO DIGITAL SYSTEM SIMULATION
IEEE MICRO
1994; 14 (4): 48-58
View details for Web of Science ID A1994NZ48900009
- Rationale and Design of the Hydra Multiprocessor Stanford University Computer Systems Lab Technical Report CSL-TR-94-645 1994
-
EXPLORING THE DESIGN SPACE FOR A SHARED-CACHE MULTIPROCESSOR
21st Annual International Symposium on Computer Architecture
I E E E, COMPUTER SOC PRESS. 1994: 166–175
View details for Web of Science ID A1994BA93B00015
-
ANALYSIS AND DESIGN OF LATCH-CONTROLLED SYNCHRONOUS DIGITAL CIRCUITS
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
1992; 11 (3): 322-333
View details for Web of Science ID A1992HB92700004
-
THE DESIGN OF A MICROSUPERCOMPUTER
COMPUTER
1991; 24 (1): 57-64
View details for Web of Science ID A1991ER66000009
-
IMPLEMENTING A CACHE FOR A HIGH-PERFORMANCE GAAS MICROPROCESSOR
18TH ANNUAL INTERNATIONAL SYMP ON COMPUTER ARCHITECTURE
ASSOC COMPUTING MACHINERY. 1991: 138–147
View details for Web of Science ID A1991BT52Q00014
-
HIERARCHICAL GATE-ARRAY ROUTING ON A HYPERCUBE MULTIPROCESSOR
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
1990; 8 (4): 313-324
View details for Web of Science ID A1990CY87900003
-
INTERCONNECTING OFF-THE-SHELF MICROPROCESSORS
AFIPS CONFERENCE PROCEEDINGS
1985; 54: 175-?
View details for Web of Science ID A1985ANT7800024
-
Plasticine: A Reconfigurable Architecture For Parallel Patterns
ISCA '17: 44th International Symposium on Computer Architecture, June 2017
2017
View details for DOI 10.1145/3079856.3080256