Professional Education


  • Doctor of Philosophy, University of Iowa (2023)
  • MS, University of Iowa, Computer Science (2018)
  • BE, Harbin Institute of Technology, Bioinformatics (2016)

Stanford Advisors


All Publications


  • Robust self-supervised machine learning for single cell embeddings and annotations. bioRxiv : the preprint server for biology Yeh, C. Y., Sun, M. W., Zhu, D., Jerby, L. 2025

    Abstract

    Dimensionality reduction and clustering are critical steps in single-cell and spatial genomics studies. Here, we show that existing dimensionality reduction and clustering methods suffer from: (1) overfitting to the dominant patterns while missing unique ones, which impairs the detection and annotation of rare cell types and states, and (2) fitting to technical noise over biological signal. To address this, we developed DR-GEM, a self-supervised meta-algorithm that combines principles in distributionally robust optimization with balanced consensus machine learning. DR-GEM supervises itself by (1) using the reconstruction error to identify and reorient its attention to samples/cells that are otherwise poorly embedded, and (2) using balanced consensus learning as a mechanism to increase robustness and mitigate the impact of low-quality samples/cells. Applied to synthetic and real-world single cell 'omics data, single cell resolution spatial transcriptomics, and Perturb-seq datasets, DR-GEM markedly and consistently outperforms existing methods in obtaining reliable embeddings, recovering rare cell types, filtering noise, and uncovering the underlying biology. In summary, this study surfaces and addresses a gap in single cell genomics and brings self-supervision to the realm of dimensionality reduction and clustering to better support data-driven discoveries.

    View details for DOI 10.1101/2025.06.05.658097

    View details for PubMedID 40502088

    View details for PubMedCentralID PMC12157554

  • Libauc: A deep learning library for x-risk optimization ACM SIGKDD Conference on Knowledge Discovery and Data Mining Yuan, Z., Zhu, D., Qiu, Z., Li, G., Wang, X., Yang, T. 2023
  • Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization. Conference on Neural Information Processing Systems (NeurIPS) Hu, Q., Zhu, D., Yang, T. 2023
  • Deep unsupervised binary coding networks for multivariate time series retrieval AAAI Conference on Artificial Intelligence Zhu, D., Song, D., Chen, Y., Lumezanu, C., Cheng, W., Zong, B., Ni, J., Mizoguchi, T., Yang, T., Chen, H. 2020
  • deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding Liu, B., Zhu, D., Wang, Y. OXFORD UNIV PRESS. 2016: 174-182

    Abstract

    With the development of high-throughput sequencing, the number of assembled genomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows-Wheeler Transform (BWT) is an important data structure of genome indexing, which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing owing to their incremental nature, which is a bottleneck to use modern computers to accelerate BWT construction.We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT construction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also uses the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWTContact: ydwang@hit.edu.cnSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btw266

    View details for Web of Science ID 000379734300020

    View details for PubMedID 27307614

    View details for PubMedCentralID PMC4908350