Bio


Jiajun Wu is an Assistant Professor of Computer Science at Stanford University, working on computer vision, machine learning, and computational cognitive science. Before joining Stanford, he was a Visiting Faculty Researcher at Google Research. He received his PhD in Electrical Engineering and Computer Science at Massachusetts Institute of Technology. Wu's research has been recognized through the ACM Doctoral Dissertation Award Honorable Mention, the MIT George M. Sprowls PhD Thesis Award in Artificial Intelligence and Decision-Making, the IROS Best Paper Award on Cognitive Robotics, and fellowships from Facebook, Nvidia, Samsung, and Adobe.

Academic Appointments


Honors & Awards


  • ACM Doctoral Dissertation Award Honorable Mention, ACM (2019)
  • George M. Sprowls PhD Thesis Award in Artificial Intelligence and Decision-Making, MIT (2019)
  • IROS Best Paper Award on Cognitive Robotics, IEEE (2018)

Professional Education


  • Ph.D., MIT, EECS (2020)
  • S.M., MIT, EECS (2016)

2020-21 Courses


Stanford Advisees


  • Doctoral Dissertation Co-Advisor (AC)
    Michelle Guo, Sumith Kulal
  • Master's Program Advisor
    Yinan Zhang, Qirui Zhou

All Publications


  • Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Xue, T., Wu, J., Bouman, K. L., Freeman, W. T. 2019; 41 (9): 2236–50

    Abstract

    We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods that have tackled this problem in a deterministic or non-parametric way, we propose to model future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. To synthesize realistic movement of objects, we propose a novel network structure, namely a Cross Convolutional Network; this network encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, and on real-world video frames. We present analyses of the learned network representations, showing it is implicitly learning a compact encoding of object appearance and motion. We also demonstrate a few of its applications, including visual analogy-making and video extrapolation.

    View details for DOI 10.1109/TPAMI.2018.2854726

    View details for Web of Science ID 000480343900014

    View details for PubMedID 30004870

  • See, feel, act: Hierarchical learning for complex manipulation skills with multisensory fusion SCIENCE ROBOTICS Fazeli, N., Oller, M., Wu, J., Wu, Z., Tenenbaum, J. B., Rodriguez, A. 2019; 4 (26)
  • 3D Interpreter Networks for Viewer-Centered Wireframe Modeling INTERNATIONAL JOURNAL OF COMPUTER VISION Wu, J., Xue, T., Lim, J. J., Tian, Y., Tenenbaum, J. B., Torralba, A., Freeman, W. T. 2018; 126 (9): 1009–26
  • Video Enhancement with Task-Oriented Flow INTERNATIONAL JOURNAL OF COMPUTER VISION Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W. T. 2019; 127 (8): 1106–25
  • An integrative computational architecture for object-driven cortex CURRENT OPINION IN NEUROBIOLOGY Yildirim, I., Wu, J., Kanwisher, N., Tenenbaum, J. 2019; 55: 73–81

    Abstract

    Computational architecture for object-driven cortex Objects in motion activate multiple cortical regions in every lobe of the human brain. Do these regions represent a collection of independent systems, or is there an overarching functional architecture spanning all of object-driven cortex? Inspired by recent work in artificial intelligence (AI), machine learning, and cognitive science, we consider the hypothesis that these regions can be understood as a coherent network implementing an integrative computational system that unifies the functions needed to perceive, predict, reason about, and plan with physical objects-as in the paradigmatic case of using or making tools. Our proposal draws on a modeling framework that combines multiple AI methods, including causal generative models, hybrid symbolic-continuous planning algorithms, and neural recognition networks, with object-centric, physics-based representations. We review evidence relating specific components of our proposal to the specific regions that comprise object-driven cortex, and lay out future research directions with the goal of building a complete functional and mechanistic account of this system.

    View details for DOI 10.1016/j.conb.2019.01.010

    View details for Web of Science ID 000472127600011

    View details for PubMedID 30825704

    View details for PubMedCentralID PMC6548583

  • Visual Concept-Metaconcept Learning Han, C., Mao, J., Gan, C., Tenenbaum, J. B., Wu, J., Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
  • Combining Physical Simulators and Object-Based Networks for Control Ajay, A., Bauza, M., Wu, J., Fazeli, N., Tenenbaum, J. B., Rodriguez, A., Kaelbling, L. P., IEEE, Howard, A., Althoefer, K., Arai, F., Arrichiello, F., Caputo, B., Castellanos, J., Hauser, K., Isler, Kim, J., Liu, H., Oh, P., Santos, Scaramuzza, D., Ude, A., Voyles, R., Yamane, K., Okamura, A. IEEE. 2019: 3217–23
  • Propagation Networks for Model-Based Control Under Partial Observation Li, Y., Wu, J., Zhu, J., Tenenbaum, J. B., Torralba, A., Tedrake, R., IEEE, Howard, A., Althoefer, K., Arai, F., Arrichiello, F., Caputo, B., Castellanos, J., Hauser, K., Isler, Kim, J., Liu, H., Oh, P., Santos, Scaramuzza, D., Ude, A., Voyles, R., Yamane, K., Okamura, A. IEEE. 2019: 1205–11
  • ChainQueen: A Real-Time Differentiable Physical Simulator for Soft Robotics Hu, Y., Liu, J., Spielberg, A., Tenenbaum, J. B., Freeman, W. T., Wu, J., Rus, D., Matusik, W., IEEE, Howard, A., Althoefer, K., Arai, F., Arrichiello, F., Caputo, B., Castellanos, J., Hauser, K., Isler, Kim, J., Liu, H., Oh, P., Santos, Scaramuzza, D., Ude, A., Voyles, R., Yamane, K., Okamura, A. IEEE. 2019: 6265–71
  • Program-Guided Image Manipulators Mao, J., Zhang, X., Li, Y., Freeman, W. T., Tenenbaum, J. B., Wu, J., IEEE IEEE COMPUTER SOC. 2019: 4029–38
  • Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations Smith, K. A., Mei, L., Yao, S., Wu, J., Spelke, E., Tenenbaum, J. B., Ullman, T. D., Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
  • Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning INTERNATIONAL JOURNAL OF COMPUTER VISION Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., Torralba, A. 2018; 126 (10): 1120–37
  • Augmenting Physical Simulators with Stochastic Neural Networks: Case Study of Planar Pushing and Bouncing Ajay, A., Wu, J., Fazeli, N., Bauza, M., Kaelbling, L. P., Tenenbaum, J. B., Rodriguez, A., Kosecka, J., Maciejewski, A. A., Okamura, A., Bicchi, A., Stachniss, C., Song, D. Z., Lee, D. H., Chaumette, F., Ding, H., Li, J. S., Wen, J., Roberts, J., Masamune, K., Chong, N. Y., Amato, N., Tsagwarakis, N., Rocco, P., Asfour, T., Chung, W. K., Yasuyoshi, Y., Sun, Y., Maciekeski, T., Althoefer, K., AndradeCetto, J., Chung, W. K., Demircan, E., Dias, J., Fraisse, P., Gross, R., Harada, H., Hasegawa, Y., Hayashibe, M., Kiguchi, K., Kim, K., Kroeger, T., Li, Y., Ma, S., Mochiyama, H., Monje, C. A., Rekleitis, Roberts, R., Stulp, F., Tsai, C. H., Zollo, L. IEEE. 2018: 3066–73
  • Unsupervised Learning of Latent Physical Properties Using Perception-Prediction Networks Zheng, D., Luo, V., Wu, J., Tenenbaum, J. B., Globerson, A., Silva, R. AUAI PRESS. 2018: 497–507
  • MoSculp: Interactive Visualization of Shape and Time Zhang, X., Dekel, T., Xue, T., Owens, A., He, Q., Wu, J., Mueller, S., Freeman, W. T., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2018: 275–85
  • Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification Long, X., Gan, C., de Melo, G., Wu, J., Liu, X., Wen, S., IEEE IEEE. 2018: 7834–43
  • Pix3D: Dataset and Methods for Single-Image 3D Shape Modeling Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T., Tenenbaum, J. B., Freeman, W. T., IEEE IEEE. 2018: 2974–83
  • 3D Shape Perception from Monocular Vision, Touch, and Shape Priors Wang, S., Wu, J., Sun, X., Yuan, W., Freeman, W. T., Tenenbaum, J. B., Adelson, E. H., Kosecka, J., Maciejewski, A. A., Okamura, A., Bicchi, A., Stachniss, C., Song, D. Z., Lee, D. H., Chaumette, F., Ding, H., Li, J. S., Wen, J., Roberts, J., Masamune, K., Chong, N. Y., Amato, N., Tsagwarakis, N., Rocco, P., Asfour, T., Chung, W. K., Yasuyoshi, Y., Sun, Y., Maciekeski, T., Althoefer, K., AndradeCetto, J., Chung, W. K., Demircan, E., Dias, J., Fraisse, P., Gross, R., Harada, H., Hasegawa, Y., Hayashibe, M., Kiguchi, K., Kim, K., Kroeger, T., Li, Y., Ma, S., Mochiyama, H., Monje, C. A., Rekleitis, Roberts, R., Stulp, F., Tsai, C. H., Zollo, L. IEEE. 2018: 1606–13
  • 3D-Aware Scene Manipulation via Inverse Graphics Yao, S., Hsu, T., Zhu, J., Wu, J., Torralba, A., Freeman, W. T., Tenenbaum, J. B., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • Visual Object Networks: Image Generation with Disentangled 3D Representation Zhu, J., Zhang, Z., Zhang, C., Wu, J., Torralba, A., Tenenbaum, J. B., Freeman, W. T., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • Learning to Exploit Stability for 3D Scene Parsing Du, Y., Liu, Z., Basevi, H., Leonardis, A., Freeman, W. T., Tenenbaum, J. B., Wu, J., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J. B., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • Learning to See Physics via Visual De-animation Wu, J., Lu, E., Kohli, P., Freeman, W. T., Tenenbaum, J. B., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • MarrNet: 3D Shape Reconstruction via 2.5D Sketches Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., Tenenbaum, J. B., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • Synthesizing 3D Shapes via Modeling Multi-View Depth Maps and Silhouettes with Deep Generative Networks Soltani, A., Huang, H., Wu, J., Kulkarni, T. D., Tenenbaum, J. B., IEEE IEEE. 2017: 2511–19
  • Neural Scene De-rendering Wu, J., Tenenbaum, J. B., Kohli, P., IEEE IEEE. 2017: 7035–43
  • Raster-to-Vector: Revisiting Floorplan Transformation Liu, C., Wu, J., Kohli, P., Furukawa, Y., IEEE IEEE. 2017: 2214–22
  • Generative Modeling of Audible Shapes for Object Perception Zhang, Z., Wu, J., Li, Q., Huang, Z., Traer, J., McDermott, J. H., Tenenbaum, J. B., Freeman, W. T., IEEE IEEE. 2017: 1260–69
  • Shape and Material from Sound Zhang, Z., Li, Q., Huang, Z., Wu, J., Tenenbaum, J. B., Freeman, W. T., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • Self-Supervised Intrinsic Image Decomposition Janner, M., Wu, J., Kulkarni, T. D., Yildirim, I., Tenenbaum, J. B., Guyon, Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
  • Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks Xue, T., Wu, J., Bouman, K. L., Freeman, W. T., Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2016
  • Single Image 3D Interpreter Network Wu, J., Xue, T., Lim, J. J., Tian, Y., Tenenbaum, J. B., Torralba, A., Freeman, W. T., Leibe, B., Matas, J., Sebe, N., Welling, M. SPRINGER INT PUBLISHING AG. 2016: 365–82
  • Ambient Sound Provides Supervision for Visual Learning Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., Torralba, A., Leibe, B., Matas, J., Sebe, N., Welling, M. SPRINGER INTERNATIONAL PUBLISHING AG. 2016: 801–16
  • Unsupervised Object Class Discovery via Saliency-Guided Multiple Class Learning IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Zhu, J., Wu, J., Xu, Y., Chang, E., Tu, Z. 2015; 37 (4): 862–75

    Abstract

    In this paper, we tackle the problem of common object (multiple classes) discovery from a set of input images, where we assume the presence of one object class in each image. This problem is, loosely speaking, unsupervised since we do not know a priori about the object type, location, and scale in each image. We observe that the general task of object class discovery in a fully unsupervised manner is intrinsically ambiguous; here we adopt saliency detection to propose candidate image windows/patches to turn an unsupervised learning problem into a weakly-supervised learning problem. In the paper, we propose an algorithm for simultaneously localizing objects and discovering object classes via bottom-up (saliency-guided) multiple class learning (bMCL). Our contributions are three-fold: (1) we adopt saliency detection to convert unsupervised learning into multiple instance learning, formulated as bottom-up multiple class learning (bMCL); (2) we propose an integrated framework that simultaneously performs object localization, object class discovery, and object detector training; (3) we demonstrate that our framework yields significant improvements over existing methods for multi-class object discovery and possess evident advantages over competing methods in computer vision. In addition, although saliency detection has recently attracted much attention, its practical usage for high-level vision tasks has yet to be justified. Our method validates the usefulness of saliency detection to output "noisy input" for a top-down method to extract common patterns.

    View details for DOI 10.1109/TPAMI.2014.2353617

    View details for Web of Science ID 000351213400012

    View details for PubMedID 26353299

  • Deep Multiple Instance Learning for Image Classification and Auto-Annotation Wu, J., Yu, Y., Huang, C., Yu, K., IEEE IEEE. 2015: 3460–69
  • MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation Wu, J., Zhao, Y., Zhu, J., Luo, S., Tu, Z., IEEE IEEE. 2014: 256–63
  • Harvesting Mid-level Visual Concepts from Large-scale Internet Images Li, Q., Wu, J., Tul, Z., IEEE IEEE. 2013: 851–58
  • A classification approach to coreference in discharge summaries: 2011 i2b2 challenge JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Xu, Y., Liu, J., Wu, J., Wang, Y., Tu, Z., Sun, J., Tsujii, J., Chang, E. 2012; 19 (5): 897–905

    Abstract

    To create a highly accurate coreference system in discharge summaries for the 2011 i2b2 challenge. The coreference categories include Person, Problem, Treatment, and Test.An integrated coreference resolution system was developed by exploiting Person attributes, contextual semantic clues, and world knowledge. It includes three subsystems: Person coreference system based on three Person attributes, Problem/Treatment/Test system based on numerous contextual semantic extractors and world knowledge, and Pronoun system based on a multi-class support vector machine classifier. The three Person attributes are patient, relative and hospital personnel. Contextual semantic extractors include anatomy, position, medication, indicator, temporal, spatial, section, modifier, equipment, operation, and assertion. The world knowledge is extracted from external resources such as Wikipedia.Micro-averaged precision, recall and F-measure in MUC, BCubed and CEAF were used to evaluate results.The system achieved an overall micro-averaged precision, recall and F-measure of 0.906, 0.925, and 0.915, respectively, on test data (from four hospitals) released by the challenge organizers. It achieved a precision, recall and F-measure of 0.905, 0.920 and 0.913, respectively, on test data without Pittsburgh data. We ranked the first out of 20 competing teams. Among the four sub-tasks on Person, Problem, Treatment, and Test, the highest F-measure was seen for Person coreference.This system achieved encouraging results. The Person system can determine whether personal pronouns and proper names are coreferent or not. The Problem/Treatment/Test system benefits from both world knowledge in evaluating the similarity of two mentions and contextual semantic extractors in identifying semantic clues. The Pronoun system can automatically detect whether a Pronoun mention is coreferent to that of the other four types. This study demonstrates that it is feasible to accomplish the coreference task in discharge summaries.

    View details for DOI 10.1136/amiajnl-2011-000734

    View details for Web of Science ID 000307934600030

    View details for PubMedID 22505762

    View details for PubMedCentralID PMC3422828

  • Unsupervised Object Class Discovery via Saliency-Guided Multiple Class Learning Zhu, J., Wu, J., Wei, Y., Chang, E., Tu, Z., IEEE IEEE. 2012: 3218–25