A statistical, reference-free algorithm subsumes myriad problems in genome science and enables novel discovery.
bioRxiv : the preprint server for biology
We present a unifying statistical formulation for many fundamental problems in genome science and develop a reference-free, highly efficient algorithm that solves it. Sequence diversification - nucleic acid mutation, rearrangement, and reassortment - is necessary for the differentiation and adaptation of all replicating organisms. Identifying sample-dependent sequence diversification, e.g. adaptation or regulated isoform expression, is fundamental to many biological studies, and is achieved today with next-generation sequencing. Paradoxically, current analyses begin with attempts to align to or assemble necessarily incomplete reference genomes, a step that is at odds with detecting the most important examples of sequence diversification. In addition to being computationally expensive, reference-first approaches suffer from diminished discovery power: they are blind to unaligned or mis-aligned sequences. We provide a unifying formulation for detecting sample-dependent sequence diversification that subsumes core problems faced in diverse biological fields. This formulation allows us to construct an algorithm that performs inference on raw reads, avoiding references completely. We illustrate the power of our approach for new data-driven biological discovery with examples of novel single-cell resolved, cell-type-specific isoform expression, including expression in the major histocompatibility complex, and de novo prediction of viral protein adaptation including in SARS-CoV-2.
View details for DOI 10.1101/2022.06.24.497555
View details for PubMedID 35794890
Approximate Function Evaluation via Multi-Armed Bandits
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2022: 108-135
View details for Web of Science ID 000828072700007
Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.
Patterns (New York, N.Y.)
2020; 1 (6): 100081
Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.
View details for DOI 10.1016/j.patter.2020.100081
View details for PubMedID 33205128
Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2020
View details for Web of Science ID 000627697000040
My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2020
View details for Web of Science ID 000683178501003
Ultra Fast Medoid Identification via Correlated Sequential Halving
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000534424303062