All Publications


  • A statistical, reference-free algorithm subsumes myriad problems in genome science and enables novel discovery. bioRxiv : the preprint server for biology Chaung, K., Baharav, T., Zheludev, I., Salzman, J. 2022

    Abstract

    We present a unifying statistical formulation for many fundamental problems in genome science and develop a reference-free, highly efficient algorithm that solves it. Sequence diversification - nucleic acid mutation, rearrangement, and reassortment - is necessary for the differentiation and adaptation of all replicating organisms. Identifying sample-dependent sequence diversification, e.g. adaptation or regulated isoform expression, is fundamental to many biological studies, and is achieved today with next-generation sequencing. Paradoxically, current analyses begin with attempts to align to or assemble necessarily incomplete reference genomes, a step that is at odds with detecting the most important examples of sequence diversification. In addition to being computationally expensive, reference-first approaches suffer from diminished discovery power: they are blind to unaligned or mis-aligned sequences. We provide a unifying formulation for detecting sample-dependent sequence diversification that subsumes core problems faced in diverse biological fields. This formulation allows us to construct an algorithm that performs inference on raw reads, avoiding references completely. We illustrate the power of our approach for new data-driven biological discovery with examples of novel single-cell resolved, cell-type-specific isoform expression, including expression in the major histocompatibility complex, and de novo prediction of viral protein adaptation including in SARS-CoV-2.

    View details for DOI 10.1101/2022.06.24.497555

    View details for PubMedID 35794890

  • Approximate Function Evaluation via Multi-Armed Bandits Baharav, T. Z., Cheng, G., Pilanci, M., Tse, D., Camps-Valls, G., Ruiz, F. J., Valera JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2022: 108-135
  • Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments. Patterns (New York, N.Y.) Baharav, T. Z., Kamath, G. M., Tse, D. N., Shomorony, I. 2020; 1 (6): 100081

    Abstract

    Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.

    View details for DOI 10.1016/j.patter.2020.100081

    View details for PubMedID 33205128

  • Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment Kamath, G. M., Baharav, T. Z., Shomorony, I., Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., Lin, H. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2020
  • My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits Bistritz, I., Baharav, T. Z., Leshem, A., Bambos, N., Daume, H., Singh, A. JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020
  • Ultra Fast Medoid Identification via Correlated Sequential Halving Baharav, T. Z., Tse, D., Wallach, H., Larochelle, H., Beygelzimer, A., d'Alche-Buc, F., Fox, E., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019