Bio


Kunle Olukotun is the Cadence Design Professor of Electrical Engineering and Computer Science at Stanford University. Olukotun is a pioneer in multicore processor design and the leader of the Stanford Hydra chip multiprocessor (CMP) research project. He founded Afara Websystems to develop high-throughput, low-power multicore processors for server systems. The Afara multi-core processor, called Niagara, was acquired by Sun Microsystems and now powers Oracle's SPARC-based servers. In 2017, Olukotun co-founded SambaNova Systems, a Machine Learning and Artificial Intelligence company, and continues to lead as their Chief Technologist.

Olukotun is the Director of the Pervasive Parallel Lab and a member of the Data Analytics tor What's Next (DAWN) Lab, developing infrastructure for usable machine learning. He is a member of the National Academy of Engineering, an ACM Fellow, and an IEEE Fellow for contributions to multiprocessors on a chip design and the commercialization of this technology. He also received the Harry H. Goode Memorial Award.

Olukotun received his Ph.D. in Computer Engineering from The University of Michigan.

Honors & Awards


  • Eckert-Machly Award, ACM-IEEE (2023)
  • Member, American Academy of Arts and Sciences (2022)
  • Member, National Academy of Engineering (2021)
  • Harry H. Goode Memorial Award, IEEE (2018)
  • Fellow, ACM (2007)
  • Fellow, IEEE (2007)

Professional Education


  • PhD, Michigan (1991)

2024-25 Courses


Stanford Advisees


All Publications


  • Mosaic: An Interoperable Compiler for Tensor Algebra PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL Bansal, M., Hsu, O., Olukotun, K., Kjolstad, F. 2023; 7 (PLDI)

    View details for DOI 10.1145/3591236

    View details for Web of Science ID 001005701900018

  • Global Perspectives of Diversity, Equity, and Inclusion COMMUNICATIONS OF THE ACM Barroso, L., Choudhury, T., Gupta, M., Olukotun, O., Popa, R., Song, D., Patterson, D. A. 2022; 65 (12): 30-31

    View details for DOI 10.1145/3548454

    View details for Web of Science ID 000887945400010

  • Taurus: A Data Plane Architecture for Per-Packet ML Swamy, T., Rucker, A., Shahbaz, M., Gaur, I., Olukotun, K., Falsafi, B., Ferdman, M., Lu, S., Weinisch, T. ASSOC COMPUTING MACHINERY. 2022: 1099-1114
  • Accelerating SLIDE: Exploiting Sparsity on Accelerator Architectures Ko, S., Rucker, A., Zhang, Y., Mure, P., Olukotun, K., IEEE Comp Soc IEEE COMPUTER SOC. 2022: 663-670
  • Compilation of Sparse Array Programming Models PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL Henry, R., Hsu, O., Yadav, R., Chou, S., Olukotun, K., Amarasinghe, S., Kjolstad, F. 2021; 5

    View details for DOI 10.1145/3485505

    View details for Web of Science ID 000731569200032

  • Chopping off the Tail: Bounded Non-Determinism for Real-Time Accelerators IEEE COMPUTER ARCHITECTURE LETTERS Rucker, A., Shahbaz, M., Olukotun, K. 2021; 20 (2): 110-113
  • Aurochs: An Architecture for Dataflow Threads Vilim, M., Rucker, A., Olukotun, K., IEEE Comp Soc IEEE COMPUTER SOC. 2021: 402-415
  • Bayesian Optimization with a Prior for the Optimum Souza, A., Nardi, L., Oliveira, L. B., Olukotun, K., Lindauer, M., Hutter, F., Oliver, N., PerezCruz, F., Kramer, S., Read, J., Lozano, J. A. SPRINGER INTERNATIONAL PUBLISHING AG. 2021: 265-296
  • High Performance Lattice Regression on FPGAs via a High Level Hardware Description Language Zhang, N., Feldman, M., Olukotun, K., IEEE IEEE. 2021: 78-87
  • SARA: Scaling a Reconfigurable Dataflow Accelerator Zhang, Y., Zhang, N., Zhao, T., Vilim, M., Shahbaz, M., Olukotun, K., IEEE Comp Soc IEEE COMPUTER SOC. 2021: 1041-1054
  • Elastic RSS: Co-Scheduling Packets and Cores Using Programmable NICs Rucker, A., Swamy, T., Shahbaz, M., Olukotun, K., ACM ASSOC COMPUTING MACHINERY. 2019: 71–77
  • Scalable Interconnects for Reconfigurable Spatial Architectures Zhang, Y., Rucker, A., Vilim, M., Prabhakar, R., Hwang, W., Olukotun, K., ACM ASSOC COMPUTING MACHINERY. 2019: 615–28
  • TensorFlow to Cloud FPGAs: Tradeoffs for Accelerating Deep Neural Networks Hadjis, S., Olukotun, K., Sourdis, Bouganis, C. S., Alvarez, C., Toledo, L., Valero, P., Martorell IEEE. 2019: 360–66
  • Polystore plus plus : Accelerated Polystore System for Heterogeneous Workloads Singhal, R., Zhang, N., Nardi, L., Shahbaz, M., Olukotun, K., IEEE Comp Soc IEEE COMPUTER SOC. 2019: 1641–51
  • Exploring the Utility of Developer Exhaust. Proceedings of the Second Workshop on Data Management for End-to-End Machine Learning. Workshop on Data Management for End-to-End Machine Learning (2nd : 2018 : Houston, Tex.) Zhang, J., Lam, M., Wang, S., Varma, P., Nardi, L., Olukotun, K., Re, C. 2018; 2018

    Abstract

    Using machine learning to analyze data often results in developer exhaust - code, logs, or metadata that do not define the learning algorithm but are byproducts of the data analytics pipeline. We study how the rich information present in developer exhaust can be used to approximately solve otherwise complex tasks. Specifically, we focus on using log data associated with training deep learning models to perform model search by predicting performance metrics for untrained models. Instead of designing a different model for each performance metric, we present two preliminary methods that rely only on information present in logs to predict these characteristics for different architectures. We introduce (i) a nearest neighbor approach with a hand-crafted edit distance metric to compare model architectures and (ii) a more generalizable, end-to-end approach that trains an LSTM using model architectures and associated logs to predict performance metrics of interest. We perform model search optimizing for best validation accuracy, degree of overfitting, and best validation accuracy given a constraint on training time. Our approaches can predict validation accuracy within 1.37% error on average, while the baseline achieves 4.13% by using the performance of a trained model with the closest number of layers. When choosing the best performing model given constraints on training time, our approaches select the top-3 models that overlap with the true top- 3 models 82% of the time, while the baseline only achieves this 54% of the time. Our preliminary experiments hold promise for how developer exhaust can help learn models that can approximate various complex tasks efficiently.

    View details for DOI 10.1145/3209889.3209895

    View details for PubMedID 31131381

  • Plasticine: A Reconfigurable Accelerator for Parallel Patterns IEEE MICRO Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., Olukotun, K. 2018; 38 (3): 20–31
  • LevelHeaded: A Unified Engine for Business Intelligence and Linear Algebra Querying Aberger, C. R., Lamb, A., Olukotun, K., Re, C., IEEE IEEE. 2018: 449–60
  • EmptyHeaded: A Relational Engine for Graph Processing Aberger, C. R., Lamb, A., Tu, S., Noetzli, A., Olukotun, K., Re, C. ASSOC COMPUTING MACHINERY. 2017

    View details for DOI 10.1145/3129246

    View details for Web of Science ID 000419302700001

  • Mind the Gap: Bridging Multi-Domain Query Workloads with EmptyHeaded PROCEEDINGS OF THE VLDB ENDOWMENT Aberger, C. R., Lamb, A., Olukotun, K., Re, C. 2017; 10 (12): 1849–52
  • Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent De Sa, C., Feldman, M., Re, C., Olukotun, K., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2017: 561–74

    Abstract

    Stochastic gradient descent (SGD) is one of the most popular numerical algorithms used in machine learning and other domains. Since this is likely to continue for the foreseeable future, it is important to study techniques that can make it run fast on parallel hardware. In this paper, we provide the first analysis of a technique called Buckwild! that uses both asynchronous execution and low-precision computation. We introduce the DMGC model, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and show that it provides a way to both classify these algorithms and model their performance. We leverage this insight to propose and analyze techniques to improve the speed of low-precision SGD. First, we propose software optimizations that can increase throughput on existing CPUs by up to 11×. Second, we propose architectural changes, including a new cache technique we call an obstinate cache, that increase throughput beyond the limits of current-generation hardware. We also implement and analyze low-precision SGD on the FPGA, which is a promising alternative to the CPU for future SGD systems.

    View details for PubMedID 29391770

    View details for PubMedCentralID PMC5789782

  • EmptyHeaded: A Relational Engine for Graph Processing. Proceedings. ACM-Sigmod International Conference on Management of Data Aberger, C. R., Tu, S., Olukotun, K., Ré, C. 2016; 2016: 431-446

    Abstract

    There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3× worse performance on SSSP.

    View details for DOI 10.1145/2882903.2915213

    View details for PubMedID 28077912

  • Ensuring Rapid Mixing and Low Bias for Asynchronous Gibbs Sampling. JMLR workshop and conference proceedings De Sa, C., Olukotun, K., Ré, C. 2016; 48: 1567-1576

    Abstract

    Gibbs sampling is a Markov chain Monte Carlo technique commonly used for estimating marginal distributions. To speed up Gibbs sampling, there has recently been interest in parallelizing it by executing asynchronously. While empirical results suggest that many models can be efficiently sampled asynchronously, traditional Markov chain analysis does not apply to the asynchronous case, and thus asynchronous Gibbs sampling is poorly understood. In this paper, we derive a better understanding of the two main challenges of asynchronous Gibbs: bias and mixing time. We show experimentally that our theoretical results match practical outcomes.

    View details for PubMedID 28344730

  • Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms. Advances in neural information processing systems De Sa, C., Zhang, C., Olukotun, K., Ré, C. 2015; 28: 2656-2664

    Abstract

    Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.

    View details for PubMedID 27330264

  • Rapidly Mixing Gibbs Sampling for a Class of Factor Graphs Using Hierarchy Width. Advances in neural information processing systems De Sa, C., Zhang, C., Olukotun, K., Ré, C. 2015; 28: 3079-3087

    Abstract

    Gibbs sampling on factor graphs is a widely used inference technique, which often produces good empirical results. Theoretical guarantees for its performance are weak: even for tree structured graphs, the mixing time of Gibbs may be exponential in the number of variables. To help understand the behavior of Gibbs sampling, we introduce a new (hyper)graph property, called hierarchy width. We show that under suitable conditions on the weights, bounded hierarchy width ensures polynomial mixing time. Our study of hierarchy width is in part motivated by a class of factor graph templates, hierarchical templates, which have bounded hierarchy width-regardless of the data used to instantiate them. We demonstrate a rich application from natural language processing in which Gibbs sampling provably mixes rapidly and achieves accuracy that exceeds human volunteers.

    View details for PubMedID 27279724

  • Beyond Parallel Programming with Domain Specific Languages ACM SIGPLAN NOTICES Olukotun, K. 2014; 49 (8): 179-179
  • Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS Sujeeth, A. K., Brown, K. J., Lee, H., Rompf, T., Chafi, H., Odersky, M., Olukotun, K. 2014; 13

    View details for DOI 10.1145/2584665

    View details for Web of Science ID 000341390100017

  • Surgical Precision JIT Compilers 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) Rompf, T., Sujeeth, A. K., Brown, K. J., Lee, H., Chafi, H., Olukotun, K. ASSOC COMPUTING MACHINERY. 2014: 41–52
  • Forge: Generating a High Performance DSL Implementation from a Declarative Specification ACM SIGPLAN NOTICES Sujeeth, A. K., Gibbons, A., Brown, K. J., Lee, H., Rompf, T., Odersky, M., Olukotun, K. 2014; 49 (3): 145-154
  • Optimizing Data Structures in High-Level Programs New Directions for Extensible Compilers based on Staging ACM SIGPLAN NOTICES Rompf, T., Sujeeth, A. K., Amin, N., Brown, K. J., Jovanovic, V., Lee, H., Jonnalagedda, M., Olukotun, K., Odersky, M. 2013; 48 (1): 497-510
  • High Performance Embedded Domain Specific Languages ACM SIGPLAN NOTICES Olukotun, K. 2012; 47 (9): 139-139
  • Green-Marl: A DSL for Easy and Efficient Graph Analysis ACM SIGPLAN NOTICES Hong, S., Chafi, H., Sedlar, E., Olukotun, K. 2012; 47 (4): 349-362
  • Green-Marl: A DSL for Easy and Efficient Graph Analysis Hong, S., Chafi, H., Sedlar, E., Olukotun, K. 2012
  • IMPLEMENTING DOMAIN-SPECIFIC LANGUAGES FOR HETEROGENEOUS PARALLEL COMPUTING IEEE MICRO Lee, H., Brown, K. J., Sujeeth, A. K., Chafi, H., Olukotun, K., Rompf, T., Odersky, M. 2011; 31 (5): 42-52
  • Accelerating CUDA Graph Algorithms at Maximum Warp ACM SIGPLAN NOTICES Hong, S., Kim, S. K., Oguntebi, T., Olukotun, K. 2011; 46 (8): 267-276
  • A Domain-Specific Approach To Heterogeneous Parallelism ACM SIGPLAN NOTICES Chafi, H., Sujeeth, A. K., Brown, K. J., Lee, H., Atreya, A. R., Olukotun, K. 2011; 46 (8): 35-45
  • Hardware Acceleration of Transactional Memory on Commodity Systems ACM SIGPLAN NOTICES Casper, J., Oguntebi, T., Hong, S., Bronson, N. G., Kozyrakis, C., Olukotun, K. 2011; 46 (3): 27-38
  • Implementing Domain-Specific Languages for Heterogeneous Parallel Computing IEEE Micro: Special Issue on CPU, GPU, and Hybrid Computing Lee, H., Brown, Kevin, J., Sujeeth, Arvind, K., Chafi, H., Rompf, T., Odersky, M., Olukotun, Oyekunle, A. 2011
  • Hardware Acceleration of Transactional Memory on Commodity Systems Casper, J., Oguntebi, T., Hong, S., Bronson, Nathan, G., Kozyrakis, C., Olukotun, K. 2011
  • Accelerating CUDA Graph Algorithms at Maximum Warp Hong, S., Kim, S. K., Oguntebi, T., Olukotun, K. 2011
  • A Domain-Specific Approach to Heterogeneous Parallelism Chafi, H., Sujeeth, Arvind, K., Brown, Kevin, J., Lee, H., Atreya, Anand, R., Olukotun, K. 2011
  • Building-Blocks for Performance Oriented DSLs Rompf, T., Sujeeth, Arvind, K., Lee, H., Brown, Kevin, J., Chafi, H., Odersky, M., Olukotun, Oyekunle, A. 2011
  • OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning Sujeeth, Arvind, K., Lee, H., Brown, Kevin, J., Rompf, T., Chafi, H., Wu, M., Olukotun, Oyekunle, A. 2011
  • Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Hong, S., Oguntebi, T., Olukotun, K. 2011
  • A Heterogeneous Parallel Framework for Domain-Specific Languages Brown, Kevin, J., Sujeeth, Arvind, K., Lee, H., Rompf, T., Chafi, H., Odersky, M., Olukotun, Oyekunle, A. 2011
  • Language Virtualization for Heterogeneous Parallel Computing Conference on Object Oriented Programming Systems, Languages and Applications/SPLASH 2010 Chafi, H., DeVito, Z., Moors, A., Rompf, T., Sujeeth, A. K., Hanrahan, P., Odersky, M., Olukotun, K. ASSOC COMPUTING MACHINERY. 2010: 835–47
  • A Practical Concurrent Binary Search Tree ACM SIGPLAN NOTICES Bronson, N. G., Casper, J., Chafi, H., Olukotun, K. 2010; 45 (5): 257-268
  • UBIQUITOUS PARALLEL COMPUTING FROM BERKELEY, ILLINOIS, AND STANFORD IEEE MICRO Catanzaro, B., Fox, A., Keutzer, K., Patterson, D., Su, B., Snir, M., Olukotun, K., Hanrahan, P., Chafi, H. 2010; 30 (2): 41-55
  • A Large-scale Architecture for Restricted Boltzmann Machines Kim, S. K., McMahon, Peter, L., Olukotun, K. 2010
  • FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures Oguntebi, T., Hong, S., Casper, J., Bronson, N., Kozyrakis, C., Olukotun, K. 2010
  • Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory Baek, W., Bronson, N., Kozyrakis, C., Olukotun, K. 2010
  • Transactional Predication: High-Performance Concurrent Sets and Maps for STM Bronson, Nathan, G., Casper, J., Chafi, H., Olukotun, K. 2010
  • EigenBench: A Simple Exploration Tool for Orthogonal TM Characterisitics Hong, S., Oguntebi, T., Casper, J., Bronson, N., Koyrakis, C., Olukotun, K. 2010
  • CCSTM: A Library-Based STM for Scala Bronson, Nathan, G., Chafi, H., Olukotun, K. 2010
  • Making Nested Parallel Transactions Practical using Lightweight Hardware Support Baek, W., Bronson, N., Kozyrakis, C., Olukotun, K. 2010
  • Language Virtualization for Heterogeneous Parallel Computing Chafi, H., DeVito, Z., Moors, A., Rompf, T., Sujeeth, Arvind, K., Hanrahan, P., Olukotun, Oyekunle, A. 2010
  • Implementing and Evaluating a Model Checker for Transactional Memory Systems Baek, W., Bronson, Nathan, G., Kozyrakis, C., Olukotun, K. 2010
  • A Practical Concurrent Binary Search Tree. Bronson, Nathan, G., Casper, J., Chafi, H., Olukotun, K. 2010
  • A Highly Scalable Restricted Boltzmann Machine FPGA Implementation Kim, S. K., McAfee, Lawrence, C., McMahon, Peter, L., Olukotun, K. 2009
  • Feedback-Directed Barrier Optimization in a Strongly Isolated STM ACM SIGPLAN NOTICES Bronson, N. G., Kozyrakis, C., Olukotun, K. 2009; 44 (1): 213-225
  • Feedback-Directed Barrier Optimization in a Strongly Isolated STM Bronson, Nathan, G., Kozyrakis, C., Olukotun, K. 2009
  • Improving Software Concurrency with Hardware-assisted Memory Snapshot 20th ACM Symposium on Parallelism in Algorithms and Architectures Chung, J., Seo, J., Baek, W., Minh, C. C., McDonald, A., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2008: 363–363
  • STAMP: Stanford Transactional Applications for Multi-Processing IEEE International Symposium on Workload Characterization Minh, C. C., Chung, J., Kozyrakis, C., Olukotun, K. IEEE. 2008: 31–42
  • ASeD: Availability, Security, and Debugging Support using Transactional Memory 20th ACM Symposium on Parallelism in Algorithms and Architectures Chung, J., Baek, W., Bronson, N. G., Seo, J., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2008: 366–366
  • Transactional memory: The hardware-software interface IEEE MICRO McDonald, A., Carlstrom, B. D., Chung, J., Minh, C. C., Chafi, H., Kozyrakis, C., Olukotun, K. 2007; 27 (1): 67-76
  • An Effective Hybrid Transactional Memory System with Strong Isolation Guarantees 34th Annual International Symposium on Computer Architecture Minh, C. C., Trautmann, M., Chung, J., McDonald, A., Bronson, N., Casper, J., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2007: 69–80
  • Transactional Collection Classes ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Carlstrom, B. D., McDonald, A., Carbin, M., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2007: 56–67
  • A Practical FPGA-based Framework for Novel CMP Research 15th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Wee, S., Casper, J., Njoroge, N., Tesylar, Y., Ge, D., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2007: 116–125
  • Towards Soft Optimization Techniques for Parallel Cognitive Applications 19th Annual Symposium on Parallelism in Algorithms and Architectures Baek, W., Chung, J., Minh, C. C., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2007: 59–60
  • A scalable, non-blocking approach to transactional memory 13th International Symposium on High-Performance Computer Architecture Chafi, H., Casper, J., Carlstrom, B. D., McDonald, A., Minh, C. C., Baek, W., Kozyrakis, C., Olukotun, K. IEEE COMPUTER SOC. 2007: 97–108
  • ATLAS: A chip-multiprocessor with Transactional Memory support Design, Automation and Test in Europe Conference and Exhibition (DATE 07) Njoroge, N., Casper, J., Wee, S., Teslyar, Y., Ge, D., Kozyrakis, C., Olukotun, K. IEEE. 2007: 3–8
  • Executing Java programs with transactional memory OOPSLA Workshop on Synchronization and Concurrent in Object-Oriented Languages Carlstrom, B. D., Chung, J., Chafi, H., McDonald, A., Minh, C. C., Hammond, L., Kozyrakis, C., Olukotun, K. ELSEVIER SCIENCE BV. 2006: 111–29
  • Tradeoffs in transactional memory virtualization ACM SIGPLAN NOTICES Chung, J., Minh, C. C., McDonald, A., Skare, T., Chafi, H., Carlstrom, B. D., Kozyrakis, C., Olukotun, K. 2006; 41 (11): 371-381
  • The ATOMO Sigma transactional programming language ACM SIGPLAN NOTICES Carlstrom, B. D., McDonald, A., Chafi, H., Chung, J., Minh, C. C., Kozyrakis, C., Olukotun, K. 2006; 41 (6): 1-13
  • The Atomos Transactional Programming Language Carlstrom, Brian, D., McDonald, A., Chafi, H., Chung, J., Minh, C. C., Kozyrakis, C., Olukotun, Oyekunle, A. 2006
  • Architectural semantics for practical Transactional Memory 33rd International Symposium on Computer Architecture McDonald, A., Chung, J., Carlstrom, B. D., Minh, C. C., Chafi, H., Kozyrakis, C., Olukotun, K. IEEE COMPUTER SOC. 2006: 53–64
  • The common case transactional behavior of multithreaded programs 12th International Symposium on High-Performance Computer Architecture Chung, J., Chafi, H., Minh, C. C., McDonald, A., Carlstrom, B., Kozyrakis, C., Olukotun, K. IEEE COMPUTER SOC. 2006: 271–282
  • The Common Case Transactional Behavior of Multithreaded Programs Chung, J., Chafi, H., Minh, C. C., McDonald, A., Carlstrom, Brian, D., Kozyrakis, C., Olukotun, Oyekunle, A. 2006
  • Architectural Semantics for Practical Transactional Memory McDonald, A., Chung, J., Carlstrom, Brian, D., Minh, C. C., Chafi, H., Kozyrakis, C., Olukotun, Oyekunle, A. 2006
  • The Software Stack for Transactional Memory: Challenges and Opportunities Carlstrom, Brian, D., Chung, J., Kozyrakis, C., Olukotun, K. 2006
  • Tradeoffs in Transactional Memory Virtualizations Chung, J., Minh, C. C., McDonald, A., Chafi, H., Carlstrom, Brian, D., Skare, T., Olukotun, Oyekunle, A. 2006
  • Niagara: A 32-way multithreaded SPARC processor IEEE MICRO Kongetira, P., Aingaran, K., Olukotun, K. 2005; 25 (2): 21-29
  • The Future of Microprocessors ACM QUEUE Magazine Olukotun, K., Hammond, L. 2005
  • Maximizing CMP throughput with mediocre cores PACT 2005: 14TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES Davis, J. D., Laudon, J., Olukotun, K. 2005: 51-62
  • A new approach to programming and prototyping parallel systems HIGH PERFORMANCE COMPUTING - HIPC 2005, PROCEEDINGS Olukotun, K. 2005; 3769: 4-4
  • Characterization of TCC on chip-multiprocessors 14th International Conference on Parallel Architectures and Compilation Techniques McDonald, A., Chung, J. W., Chafi, H., Minh, C. C., Carlstrom, B. D., Hammond, L., Kozyrakis, C., Olukotun, K. IEEE COMPUTER SOC. 2005: 63–74
  • Maximizing CMP Throughput with Mediocre Cores Davis, John, D., Laudon, J., Olukotun, K. 2005
  • TAPE: A Transactional Application Profiling Environment Chafi, H., Minh, C. C., McDonald, A., Carlstrom, Brian, D., Chung, J., Hammond, L., Olukotun, Oyekunle, A. 2005
  • Article about Kunle Olukuton's Niagara processor: Sun's Big Splash IEEE Spectrum Magazine Olukotun, K., Geppert, L. 2005
  • Transactional Execution of Java Programs Carlstrom, Brian, D., Chung, J., Chafi, H., McDonald, A., Minh, C. C., Hammond, L., Olukotun, Oyekunle, A. 2005
  • Exposing Speculative Thread Parallelism in SPEC2000 Prabhu, M., Olukotun, K. 2005
  • Characterization of TCC on Chip-Multiprocessors McDonald, A., Chung, J., Chafi, H., Minh, C. C., Carlstrom, Brian, D., Hammond, L., Olukotun, Oyekunle, A. 2005
  • Transactional coherence and consistency: Simplifying parallel hardware and software IEEE MICRO Hammond, L., Carlstrom, B. D., Wong, V., Chen, M., Kozyrakis, C., Olukotun, K. 2004; 24 (6): 92-103
  • Programming with transactional coherence and consistency (TCC) 11th International Conference on Architectural Support for Programming Languages and Operating Systems Hammond, L., Carlstrom, B. D., Wong, V., Hertzberg, B., Chen, M., Kozyrakis, C., Olukotun, K. ASSOC COMPUTING MACHINERY. 2004: 1–13
  • Transactional Coherence and Consistency: Simplifying Parallel Hardware and Software Micro's Top Picks, IEEE Micro Hammond, L., Carlstrom, Brian, D., Wong, V., Chen, M., Kozyrakis, C., Olukotun, K. 2004; 24 (6)
  • Transactional memory coherence and consistency 31st Annual International Symposium on Computer Architecture Hammond, L., Wong, V., Chen, M., Carlstrom, B. D., Davis, J. D., Hertzberg, B., Prabhu, M. K., Wijaya, H., Kozyrakis, C., Olukotun, K. IEEE COMPUTER SOC. 2004: 102–113
  • Niagara: A 32-Way Multithreaded SPARC Processor IEEE MICRO Magazine, March-April 2005, and presented at Hot Chips Kongetira, P., Aingaran, K., Olukotun, K. 2004
  • Transactional Memory Coherence and Consistency Hammond, L., Wong, V., Chen, M., Hertzberg, B., Carlstrom, Brian, D., Davis, John, D., Olukotun, Oyekunle, A. 2004
  • Programming with Transactional Coherence and Consistency (TCC) Hammond, L., Carlstrom, Brian, D., Wong, V., Hertzberg, B., Chen, M., Kozyrakis, C., Olukotun, Oyekunle, A. 2004
  • The Jrpm system for dynamically parallelizing sequential Java programs IEEE MICRO Chen, M. K., Olukotun, K. 2003; 23 (6): 26-35
  • Using thread-level speculation to simplify manual parallelization 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Prabhu, M. K., Olukotun, K. ASSOC COMPUTING MACHINERY. 2003: 1–12
  • Using Thread-Level Speculation to Simplify Manual Parallelization Prabhu, M., Olukotun, K. 2003
  • The Jrpm system for dynamically parallelizing Java programs 30TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS Chen, M. K., Olukotun, K. 2003: 434-445
  • TEST: A tracer for extracting speculative threads CGO 2003: INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION Chen, M., Olukotun, K. 2003: 301-312
  • The Jrpm System for Dynamically Parallelizing Java Programs Chen, M., Olukotun, K. 2003
  • TEST: A Tracer for Extracting Speculative Threads Chen, M., Olukotun, K. 2003
  • The Jrpm System for Dynamically Parallelizing Java Programs Chen, M., Olukotun, K. 2003
  • Targeting dynamic compilation for embedded environments USENIX ASSOCIATION PROCEEDINGS OF THE 2ND JAVA(TM) VIRTUAL MACHINE RESEARCH AND TECHNOLOGY SYMPOSIUM Chen, M., Olukotun, K. 2002: 151-164
  • Efficient state representation for symbolic simulation 39TH DESIGN AUTOMATION CONFERENCE, PROCEEDINGS 2002 Bertacco, V., Olukotun, K. 2002: 99-104
  • High bandwidth on-chip cache design IEEE TRANSACTIONS ON COMPUTERS Wilson, K. M., Olukotun, K. 2001; 50 (4): 292-307
  • The Stanford Hydra CMP IEEE MICRO Hammond, L., Hubbert, B. A., Siu, M., Prabhu, M. K., Chen, M., Olukotun, K. 2000; 20 (2): 71-84
  • A single chip multiprocessor integrated with high density DRAM IEICE TRANSACTIONS ON ELECTRONICS Yamauchi, T., Hammond, L., Olukotun, O. A., Arimoto, K. 1999; E82C (8): 1567-1577
  • REMARC: Reconfigurable multimedia array coprocessor IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS Miyamori, T., Olukotun, K. 1999; E82D (2): 389-397
  • The Stanford Hydra CMP IEEE MICRO Magazine, March-April 2000, and presented at Hot Chips Hammond, L., Hubbert, B., Siu, M., Prabhu, M., Chen, M., Olukotun, K. 1999
  • Improving the Performance of Speculatively Parallel Applications on the Hydra CMP Olukotun, K., Hammond, L., Willey, M. 1999
  • Data speculation support for a chip multiprocessor ACM SIGPLAN NOTICES Hammond, L., Willey, R., Olukotun, K. 1998; 33 (11): 58-69
  • Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Stanford University Computer Systems Lab Technical Report CSL-TR-98-749 Hammond, L., Olukotun, K. 1998
  • Digital system simulation: Methodologies and examples 35th Design Automation Conference Olukotun, K., Heinrich, M., Ofelt, D. ASSOC COMPUTING MACHINERY. 1998: 658–663
  • Exploiting method-level parallelism in single-threaded Java programs International Conference on Parallel Architectures and Compilation Techniques Chen, M. K., Olukotun, K. IEEE COMPUTER SOC. 1998: 176–184
  • DCP: an algorithm for datapath/control partitioning of synthesizable RTL models International Conference on Computer Design: VLSI in Computers and Processors Lam, V. J., OLUKOTUN, K. A. I E E E, COMPUTER SOC PRESS. 1998: 442–449
  • Data Speculation Support for a Chip Multiprocessor Hammond, L., Willey, M., Olukotun, K. 1998
  • Exploiting Method-Level Parallelism in Single-Threaded Java Programs Chen, M., Olukotun, K. 1998
  • Multilevel optimization of pipelined caches IEEE TRANSACTIONS ON COMPUTERS Olukotun, K., Mudge, T. N., Brown, R. B. 1997; 46 (10): 1093-1102
  • A single-chip multiprocessor COMPUTER NAYFEH, B. A., Olukotun, K. 1997; 30 (9): 79-?
  • A Single Chip Multiprocessor Integrated with DRAM Yamauchi, T., Hammond, L., Olukotun, K. 1997
  • Java as a specification language for hardware-software systems 1997 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 97) HELAIHEL, R., Olukotun, K. I E E E, COMPUTER SOC PRESS. 1997: 690–697
  • Verifying correct pipeline implementation for microprocessors 1997 IEEE/ACM International Conference on Computer-Aided Design (ICCAD 97) LEVITT, J., Olukotun, K. I E E E, COMPUTER SOC PRESS. 1997: 162–169
  • Designing high bandwidth on-chip caches 24th Annual International Symposium on Computer Architecture Wilson, K. M., Olukotun, K. ASSOC COMPUTING MACHINERY. 1997: 121–132
  • A Single-Chip Multiprocessor IEEE Computer Special Issue on "Billion-Transistor Processors" Hammond, L., Nayfeh, Basem, A., Olukotun, K. 1997
  • Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor Stanford University Computer Systems Lab Technical Report CSL-TR-97-715 Oplinger, J., Heine, D., Liao, S., Nayfeh, Basem, A., Lam, M., Olukotun, K. 1997
  • The case for a single-chip multiprocessor ACM SIGPLAN NOTICES Olukotun, K., NAYFEH, B. A., Hammond, L., Wilson, K., Chang, K. Y. 1996; 31 (9): 2-11
  • The Case for a Single-Chip Multiprocessor Olukotun, K., Nayfeh, Basem, A., Hammond, L., Wilson, K., Chang, K. 1996
  • A scalable formal verification methodology for pipelined microprocessors 33rd Design Automation Conference LEVITT, J., Olukotun, K. ASSOC COMPUTING MACHINERY. 1996: 558–563
  • The impact of shared-cache clustering in small-scale shared-memory multiprocessors 2nd International Symposium on High-Performance Computer Architecture (HPCA-2) NAYFEH, B. A., Olukotun, K., Singh, J. P. I E E E, COMPUTER SOC PRESS. 1996: 74–84
  • Evaluation of design alternatives for a multiprocessor microprocessor 23rd Annual International Symposium on Computer Architecture Nayfeh, E. A., Hammond, L., Olukotun, K. ASSOC COMPUTING MACHINERY. 1996: 67–77
  • Emulation and prototyping of digital systems NATO Advanced Study Institute on Hardware/Software Co-Design HELAIHEL, R., Olukotun, K. SPRINGER. 1996: 339–366
  • Increasing cache port efficiency for dynamic superscalar microprocessors 23rd Annual International Symposium on Computer Architecture Wilson, K. M., Olukotun, K., Rosenblum, M. ASSOC COMPUTING MACHINERY. 1996: 147–157
  • Evaluation of Design Alternatives for a Multiprocessor Microprocessor Nayfeh, Basem, A., Hammond, L., Olukotun, K. 1996
  • The benefits of clustering in shared address space multiprocessors: An applications-driven investigation 1995 ACM/IEEE Supercomputing Conference (SC 95) Erlichson, A., NAYFEH, B. A., Singh, J. P., Olukotun, K. ASSOC COMPUTING MACHINERY. 1995: 1674–1704
  • A general method for compiling event driven simulations 32nd Design Automation Conference French, R. S., Lam, M. S., LEVITT, J. R., Olukotun, K. ASSOC COMPUTING MACHINERY. 1995: 151–156
  • A SOFTWARE-HARDWARE COSYNTHESIS APPROACH TO DIGITAL SYSTEM SIMULATION IEEE MICRO OLUKOTUN, K. A., HELAIHEL, R., LEVITT, J., Ramirez, R. 1994; 14 (4): 48-58
  • Rationale and Design of the Hydra Multiprocessor Stanford University Computer Systems Lab Technical Report CSL-TR-94-645 Olukotun, K., Bergmann, J., Chang, K., Nayfeh, Basem, A. 1994
  • EXPLORING THE DESIGN SPACE FOR A SHARED-CACHE MULTIPROCESSOR 21st Annual International Symposium on Computer Architecture NAYFEH, B. A., Olukotun, K. I E E E, COMPUTER SOC PRESS. 1994: 166–175
  • ANALYSIS AND DESIGN OF LATCH-CONTROLLED SYNCHRONOUS DIGITAL CIRCUITS IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS Sakallah, K. A., Mudge, T. N., Olukotun, O. A. 1992; 11 (3): 322-333
  • THE DESIGN OF A MICROSUPERCOMPUTER COMPUTER Mudge, T. N., Brown, R. B., Birmingham, W. P., DYKSTRA, J. A., Kayssi, A. I., Lomax, R. J., Olukotun, O. A., Sakallah, K. A., MILANO, R. A. 1991; 24 (1): 57-64
  • IMPLEMENTING A CACHE FOR A HIGH-PERFORMANCE GAAS MICROPROCESSOR 18TH ANNUAL INTERNATIONAL SYMP ON COMPUTER ARCHITECTURE Olukotun, O. A., Mudge, T. N., Brown, R. B. ASSOC COMPUTING MACHINERY. 1991: 138–147
  • HIERARCHICAL GATE-ARRAY ROUTING ON A HYPERCUBE MULTIPROCESSOR JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING Olukotun, O. A., Mudge, T. N. 1990; 8 (4): 313-324
  • INTERCONNECTING OFF-THE-SHELF MICROPROCESSORS AFIPS CONFERENCE PROCEEDINGS ALSADOUN, H. B., Olukotun, O. A., Mudge, T. N. 1985; 54: 175-?
  • Plasticine: A Reconfigurable Architecture For Parallel Patterns ISCA '17: 44th International Symposium on Computer Architecture, June 2017 Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T., Hadjis, S., Pedram, A., Kozyrakis, C., Olukotun, K. 2017

    View details for DOI 10.1145/3079856.3080256