Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.
Patterns (New York, N.Y.)
2020; 1 (6): 100081
Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.
View details for DOI 10.1016/j.patter.2020.100081
View details for PubMedID 33205128
Adaptive Learning of Rank-One Models for Efficient Pairwise Sequence Alignment
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2020
View details for Web of Science ID 000627697000040
Ultra Fast Medoid Identification via Correlated Sequential Halving
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000534424303062