Itschak Weissman
Robert and Barbara Kleist Professor in the School of Engineering
Electrical Engineering
Bio
Tsachy's research focuses on Information Theory, Data Compression and Communications, Statistical Signal Processing, Machine Learning, the interplay between them, and their applications, with recent focus on applications to genomic data compression and processing. He is inventor of several patents and involved in several companies as member of the technical board. IEEE fellow, he serves on the board of governors of the information theory society as well as the editorial boards of the Transactions on Information Theory and Foundations and Trends in Communications and Information Theory. He is founding Director of the Stanford Compression Forum.
Academic Appointments
-
Professor, Electrical Engineering
-
Member, Bio-X
-
Member, Wu Tsai Neurosciences Institute
Administrative Appointments
-
Founder/Director, Stanford Compression Forum (2015 - Present)
-
Academic Affairs Committee, Stanford University (2013 - 2018)
-
Graduate Admissions Committee, Stanford University (2004 - 2012)
Honors & Awards
-
Google Research Award, Google (2016)
-
Yahoo! Faculty Research Award, Yahoo! (2015)
-
Google Research Award, Google (2013)
-
Yahoo! Faculty Research And Engagement Award, Yahoo! (2013)
-
IEEE Fellow Grade, IEEE (2012)
-
Google Research Award, Google (2011)
-
STMicroelectronics Faculty Scholar Award, Stanford University (2011)
Boards, Advisory Committees, Professional Organizations
-
Guest Editor, Proceedings of IEEE (2017 - 2017)
-
Organizer/Technical Program Committee, IEEE (ITW) (2015 - 2015)
-
Technical Program Committee, IEEE (ISIT) (2015 - 2015)
-
Technical Program Committee, IEEE (ISIT) (2014 - 2014)
-
Editorial Board, Foundations and Trends in Communications and Information Theory (2012 - Present)
-
Fellow, IEEE (2012 - Present)
-
Technical Program Committee, IEEE (ISIT) (2012 - 2012)
-
Technical Program Committee, IEEE (ISIT) (2011 - 2011)
-
Editorial Board, Transactions on Information Theory Associate Editor for Shannon Theory (2010 - Present)
-
Organizer/Technical Program Committee, IEEE (ITW) (2010 - 2010)
-
Paper Award Committee, IEEE (2009 - 2009)
-
Technical Program Committee, Data Compression Conference (DCC) (2009 - 2009)
-
Technical Program Committee, IEEE International Symposium on Information Theory (ISIT) (2009 - 2009)
-
Technical Program Committee, Control Over Communication Channels Workshop (ConCom) (2009 - 2009)
-
Technical Program Committee, DCC (Data Compression Conference) (2008 - 2008)
-
Technical Program Committee, IAPR Workshop on Cognitive Information Processing (CIP) (2008 - 2008)
-
Technical Program Committee, IEEE (ISIT) (2007 - 2007)
-
Member, The Mathematical Association of America (MAA) (2006 - Present)
-
Panelist, Allerton Conference (2006 - 2006)
-
Technical Program Committee, IEEE (ISIT) (2005 - 2005)
Professional Education
-
PhD, Technion-Israel Institute of Technology, Electrical Engineering (2001)
-
B.Sc., Technion-Israel Institute of Technology, Electrical Engineering (1997)
Patents
-
Itschak Weissman. "United States Patent US8320687 Universal Lossy Compression Methods", Nov 27, 2012
-
Itschak Weissman. "United States Patent US200404580 Denoising and Error Correction for Finite Input General Output Channel", May 10, 2012
-
Itschak Weissman. "United States Patent US7624009 Method and System for Producing Variable Length Context Models", Nov 24, 2009
-
Itschak Weissman. "United States Patent US7498961 Context Identification Using a Denoised Signal", Mar 3, 2009
-
Itschak Weissman. "United States Patent US7474793 Methods for Compression Using a Denoiser", Jan 6, 2009
-
Itschak Weissman. "United States Patent EP1766832 Discrete Universal Denoising with Error Correction Coding", Dec 10, 2008
-
Itschak Weissman. "United States Patent US7436969 Method and System for Optimizing Denoising Parameters Using Compressibility", Oct 14, 2008
-
Itschak Weissman. "United States Patent US7433427 Enhanced Denoising System Utilizing Incremental Parsing", Oct 7, 2008
-
Itschak Weissman. "United States Patent US7420487 Denoising Video", Sep 2, 2008
-
Itschak Weissman. "United States Patent US7271749 Context-based Denoiser That Simultaneously Updates Probabilities for Multiple Contexts", Sep 18, 2007
-
Itschak Weissman. "United States Patent US7269781 Discrete Universal Denoising with Reliability Information", Sep 11, 2007
-
Itschak Weissman. "United States Patent US7123172 Method and System for Determining an Optimal or Near Optimal Set of Contexts by Constructing a Multi-directional Context Tree", Oct 17, 2006
-
Itschak Weissman. "United States Patent US7047472 Method For Correcting Noise Errors in a Digital Signal", May 16, 2006
2024-25 Courses
- Information Theory
EE 276 (Win) - Universal Information Processing
EE 376C (Aut) -
Independent Studies (7)
- Directed Studies in Applied Physics
APPPHYS 290 (Aut, Win, Spr, Sum) - Master's Thesis and Thesis Research
EE 300 (Aut, Win, Spr, Sum) - Special Studies and Reports in Electrical Engineering
EE 191 (Aut, Win, Spr, Sum) - Special Studies and Reports in Electrical Engineering
EE 391 (Aut, Win, Spr, Sum) - Special Studies and Reports in Electrical Engineering (WIM)
EE 191W (Aut, Win, Spr, Sum) - Special Studies or Projects in Electrical Engineering
EE 190 (Aut, Win, Spr, Sum) - Special Studies or Projects in Electrical Engineering
EE 390 (Aut, Win, Spr, Sum)
- Directed Studies in Applied Physics
-
Prior Year Courses
2023-24 Courses
- Data Compression: Theory and Applications
EE 274 (Aut) - Information Theory
EE 276 (Win)
2022-23 Courses
- Data Compression: Theory and Applications
EE 274 (Aut)
- Data Compression: Theory and Applications
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Yiqi Jiang, Sreela Kodali, Erin Kunz, Yi Liu, Mark Nishimura, Mira Partha, Saarthak Sarup, Dan Song, Alice Tor, Ajay Tripathi, Praful Vasireddy, Raman Vilkhu, Kailas Vodrahalli -
Doctoral Dissertation Advisor (AC)
Noah Huffman, Pumiao Yan -
Master's Program Advisor
Munir Bshara, Nicholas Bukovec, Aidan Dougherty, Aniketh Iyengar, Xinling Li, Bridget Patrick -
Doctoral (Program)
Lara Arikan, Yiqi Jiang, Sahasrajit Sarmasarkar
All Publications
-
Optimal No-Regret Learning in Repeated First-Price Auctions
OPERATIONS RESEARCH
2024
View details for DOI 10.1287/opre.2020.0282
View details for Web of Science ID 001264870900001
-
Mutual Information Upper Bounds for Uniform Inputs Through the Deletion Channel
IEEE TRANSACTIONS ON INFORMATION THEORY
2024; 70 (7): 4599-4610
View details for DOI 10.1109/TIT.2024.3355841
View details for Web of Science ID 001253773500008
-
Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome.
Cell reports methods
2023; 3 (8): 100543
Abstract
The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.
View details for DOI 10.1016/j.crmeth.2023.100543
View details for PubMedID 37671027
View details for PubMedCentralID PMC10475782
-
Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing.
Scientific reports
2023; 13 (1): 8514
Abstract
The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.
View details for DOI 10.1038/s41598-023-29575-z
View details for PubMedID 37231057
-
Neural Network Compression for Noisy Storage Devices
ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS
2023; 22 (3)
View details for DOI 10.1145/3588436
View details for Web of Science ID 001020424400019
-
Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.
Scientific reports
2023; 13 (1): 2082
Abstract
The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .
View details for DOI 10.1038/s41598-023-29267-8
View details for PubMedID 36747011
-
Exact Optimality of Communication-Privacy-Utility Tradeoffs in Distributed Mean Estimation
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2023
View details for Web of Science ID 001220818806016
-
Txt2Vid-Web: Web-based, Text-to-Video, Video Conferencing Pipeline
IEEE COMPUTER SOC. 2023: 331
View details for DOI 10.1109/DCC55655.2023.00073
View details for Web of Science ID 001008252400034
-
Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS
2023; 41 (1): 107-118
View details for DOI 10.1109/JSAC.2022.3221953
View details for Web of Science ID 000927934500007
-
KmerKeys: a web resource for searching indexed genome assemblies and variants.
Nucleic acids research
2022
Abstract
K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.
View details for DOI 10.1093/nar/gkac266
View details for PubMedID 35474383
-
The Human Pangenome Project: a global resource to map genomic diversity.
Nature
2022; 604 (7906): 437-446
Abstract
The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goalof constructing the highest-possible quality human pangenome reference. Our goal is toimprove data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.
View details for DOI 10.1038/s41586-022-04601-8
View details for PubMedID 35444317
-
Classification and clustering of RNA crosslink-ligation data reveal complex structures and homodimers.
Genome research
2022
Abstract
The recent development and application of methods based on the general principle of crosslinking and proximity ligation (crosslink-ligation) are revolutionizing RNA structure studies in living cells. However, extracting structure information from such data presents unique challenges. Here we introduce a set of computational tools for the systematic analysis of data from a wide variety of crosslink-ligation methods, specifically focusing on read mapping, alignment classification and clustering. We design a new strategy to map short reads with irregular gaps at high sensitivity and specificity. Analysis of previously published data reveals distinct properties and bias caused by the crosslinking reactions. We perform rigorous and exhaustive classification of alignments and discover 8 types of arrangements that provide distinct information on RNA structures and interactions. To deconvolve the dense and intertwined gapped alignments, we develop a net-work/graph-based tool CRSSANT (Crosslinked RNA Secondary Structure Analysis using Network Techniques), which enables clustering of gapped alignments and discovery of new alternative and dynamic conformations. We discover that multiple crosslinking and ligation events can occur on the same RNA, generating multi-segment alignments to report complex high level RNA structures and multi-RNA interactions. We find that alignments with overlapped segments are produced from potential homodimers and develop a new method for their de novo identification. Analysis of overlapping alignments revealed potential new homodimers in cellular noncoding RNAs and RNA virus genomes in the Picornaviridae family. Together, this suite of computational tools enables rapid and efficient analysis of RNA structure and interaction data in living cells.
View details for DOI 10.1101/gr.275979.121
View details for PubMedID 35332099
-
An Information-Theoretic Justification for Model Pruning
JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2022
View details for Web of Science ID 000828072703040
-
Geometric Lower Bounds for Distributed Parameter Estimation Under Communication Constraints
IEEE TRANSACTIONS ON INFORMATION THEORY
2021; 67 (12): 8248-8263
View details for DOI 10.1109/TIT.2021.3108952
View details for Web of Science ID 000720518300039
-
Optimal Communication Rates and Combinatorial Properties for Common Randomness Generation
IEEE TRANSACTIONS ON INFORMATION THEORY
2021; 67 (12): 7723-7739
View details for DOI 10.1109/TIT.2021.3119976
View details for Web of Science ID 000720518300008
-
Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry
IEEE COMPUTER SOC. 2021: 360
View details for DOI 10.1109/DCC50243.2021.00057
View details for Web of Science ID 000675592800059
-
MEOW: A Space-Efficient Nonparametric Bid Shading Algorithm
ASSOC COMPUTING MACHINERY. 2021: 3928-3936
View details for DOI 10.1145/3447548.3467113
View details for Web of Science ID 000749556803097
-
Optimal Communication Rates and Combinatorial Properties for Common Randomness Generation
IEEE. 2021: 1931-1936
View details for DOI 10.1109/ISIT45174.2021.9518090
View details for Web of Science ID 000701502202004
-
OPTIMAL RATES OF ENTROPY ESTIMATION OVER LIPSCHITZ BALLS
ANNALS OF STATISTICS
2020; 48 (6): 3228–50
View details for DOI 10.1214/19-AOS1927
View details for Web of Science ID 000598369200006
-
OVERCOMING HIGH NANOPORE BASECALLER ERROR RATES FOR DNA STORAGE VIA BASECALLER-DECODER INTEGRATION AND CONVOLUTIONAL CODES
IEEE. 2020: 8822–26
View details for Web of Science ID 000615970409020
-
LFZip: Lossy compression of multivariate floating-point time series data via improved prediction
IEEE COMPUTER SOC. 2020: 342–51
View details for DOI 10.1109/DCC47342.2020.00042
View details for Web of Science ID 000591183800035
-
Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy
Bioinformatics (Oxford, England)
2020
Abstract
Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications.We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications.The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.
View details for DOI 10.1093/bioinformatics/btaa1017
View details for PubMedID 33325499
-
Denoising of Aligned Genomic Data.
Scientific reports
2019; 9 (1): 15067
Abstract
Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .
View details for DOI 10.1038/s41598-019-51418-z
View details for PubMedID 31636330
-
Estimating the Fundamental Limits Is Easier Than Achieving the Fundamental Limits
IEEE TRANSACTIONS ON INFORMATION THEORY
2019; 65 (10): 6704–15
View details for DOI 10.1109/TIT.2019.2927697
View details for Web of Science ID 000487041200041
-
SPRING: a next-generation compressor for FASTQ data
BIOINFORMATICS
2019; 35 (15): 2674–76
View details for DOI 10.1093/bioinformatics/bty1015
View details for Web of Science ID 000484378200023
-
Improved read/write cost tradeoff in DNA-based data storage using LDPC codes
IEEE. 2019: 147–56
View details for Web of Science ID 000535355700022
-
Genomic Data Compression
ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 2, 2019
2019; 2: 19–37
View details for DOI 10.1146/annurev-biodatasci-072018-021229
View details for Web of Science ID 000484798500002
-
Approximate Profile Maximum Likelihood
JOURNAL OF MACHINE LEARNING RESEARCH
2019; 20
View details for Web of Science ID 000487068900006
-
Humans are still the best lossy image compressors
IEEE. 2019: 558
View details for DOI 10.1109/DCC.2019.00070
View details for Web of Science ID 000470908200063
-
SPRING: A next-generation compressor for FASTQ data.
Bioinformatics (Oxford, England)
2018
Abstract
Motivation: High-Throughput Sequencing (HTS) technologies produce huge amounts of data in the form of short genomic reads, associated quality values, and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.Results: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairingpreserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25x whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6x smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.Availability: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING.Supplementary Information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 30535063
-
Minimax Estimation of the L-1 Distance
IEEE TRANSACTIONS ON INFORMATION THEORY
2018; 64 (10): 6672–6706
View details for DOI 10.1109/TIT.2018.2846245
View details for Web of Science ID 000444800600016
-
Mutual Information, Relative Entropy and Estimation Error in Semi-Martingale Channels
IEEE TRANSACTIONS ON INFORMATION THEORY
2018; 64 (10): 6662–71
View details for DOI 10.1109/TIT.2018.2829889
View details for Web of Science ID 000444800600015
-
Generalizations of maximal inequalities to arbitrary selection rules
STATISTICS & PROBABILITY LETTERS
2018; 137: 19–25
View details for DOI 10.1016/j.spl.2018.01.002
View details for Web of Science ID 000435046400004
-
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis
BIOINFORMATICS
2018; 34 (4): 558–67
Abstract
New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data.We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference.schandak@stanford.edu.Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
View details for PubMedID 29444237
View details for PubMedCentralID PMC5860611
-
Minimax redundancy for Markov chains with large state space
IEEE. 2018: 216–20
View details for Web of Science ID 000448139300044
-
Entropy Rate Estimation for Markov Chains with Large State Space
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461852004035
-
Distributed Statistical Estimation of High-Dimensional and Nonparametric Distributions
IEEE. 2018: 506–10
View details for Web of Science ID 000448139300102
-
On Universal Compression with Constant Random Access
IEEE. 2018: 891–95
View details for Web of Science ID 000448139300179
-
Maximum Likelihood Estimation of Functionals of Discrete Distributions
IEEE TRANSACTIONS ON INFORMATION THEORY
2017; 63 (10): 6774–98
View details for DOI 10.1109/TIT.2017.2733537
View details for Web of Science ID 000411021000043
-
DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing
PLOS ONE
2017; 12 (7): e0181463
Abstract
We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.
View details for PubMedID 28749987
-
Relations Between Information and Estimation in Discrete-Time Levy Channels
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2017: 3579–94
View details for DOI 10.1109/TIT.2017.2692211
View details for Web of Science ID 000402058900016
-
Effect of lossy compression of quality scores on variant calling
BRIEFINGS IN BIOINFORMATICS
2017; 18 (2): 183-194
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
View details for DOI 10.1093/bib/bbw011
View details for Web of Science ID 000397060200002
-
Effect of lossy compression of quality scores on variant calling.
Briefings in bioinformatics
2017; 18 (2): 183-194
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
View details for DOI 10.1093/bib/bbw011
View details for PubMedID 26966283
-
When is Noisy State Information at the Encoder as Useless as No Information or as Good as Noise-Free State?
IEEE TRANSACTIONS ON INFORMATION THEORY
2017; 63 (2): 960-974
View details for DOI 10.1109/TIT.2016.2631150
View details for Web of Science ID 000394667700010
-
Principles and Applications of Science of Information
PROCEEDINGS OF THE IEEE
2017; 105 (2): 183–88
View details for DOI 10.1109/JPROC.2016.2646778
View details for Web of Science ID 000394532000002
-
GeneComp, a new reference-based compressor for SAM files
IEEE COMPUTER SOC. 2017: 330–39
Abstract
The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.
View details for PubMedID 29046896
-
Compressing Tabular Data via Pairwise Dependencies
IEEE COMPUTER SOC. 2017: 455
View details for PubMedID 29046897
-
Dependence Measures Bounding the Exploration Bias for General Measurements
IEEE. 2017: 1475–79
View details for Web of Science ID 000430345201108
-
Rateless Lossy Compression via the Extremes
IEEE TRANSACTIONS ON INFORMATION THEORY
2016; 62 (10): 5484-5495
Abstract
We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log β n) per source symbol for β > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.
View details for DOI 10.1109/TIT.2016.2598148
View details for Web of Science ID 000384304600013
View details for PubMedCentralID PMC5786423
-
Rateless Lossy Compression via the Extremes.
IEEE transactions on information theory
2016; 62 (10): 5484-5495
Abstract
We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log β n) per source symbol for β > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.
View details for DOI 10.1109/tit.2016.2598148
View details for PubMedID 29375154
View details for PubMedCentralID PMC5786423
-
GTRAC: fast retrieval from compressed collections of genomic variants
BIOINFORMATICS
2016; 32 (17): 479-486
Abstract
The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btw437
View details for Web of Science ID 000384666800012
View details for PubMedCentralID PMC5013914
-
GTRAC: fast retrieval from compressed collections of genomic variants.
Bioinformatics
2016; 32 (17): i479-i486
Abstract
The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btw437
View details for PubMedID 27587665
View details for PubMedCentralID PMC5013914
-
Information, Estimation, and Lookahead in the Gaussian Channel
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2016; 64 (14): 3605-3618
View details for DOI 10.1109/TSP.2016.2544748
View details for Web of Science ID 000377370500005
-
Secure Source Coding With a Public Helper
IEEE TRANSACTIONS ON INFORMATION THEORY
2016; 62 (7): 3930-3949
View details for DOI 10.1109/TIT.2016.2570011
View details for Web of Science ID 000382135600009
-
Strong Successive Refinability and Rate-Distortion-Complexity Tradeoff
IEEE TRANSACTIONS ON INFORMATION THEORY
2016; 62 (6): 3618-3635
View details for DOI 10.1109/TIT.2016.2549540
View details for Web of Science ID 000380070600038
-
Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes.
IEEE transactions on information theory
2016; 62 (5): 2737-2747
Abstract
We study the problem of compression for the purpose of similarity identification, where similarity is measured by the mean square Euclidean distance between vectors. While the asymptotical fundamental limits of the problem - the minimal compression rate and the error exponent - were found in a previous work, in this paper we focus on the nonasymptotic domain and on practical, implementable schemes. We first present a finite blocklength achievability bound based on shape-gain quantization: The gain (amplitude) of the vector is compressed via scalar quantization and the shape (the projection on the unit sphere) is quantized using a spherical code. The results are numerically evaluated and they converge to the asymptotic values as predicted by the error exponent. We then give a nonasymptotic lower bound on the performance of any compression scheme, and compare to the upper (achievability) bound. For a practical implementation of such a scheme, we use wrapped spherical codes, studied by Hamkins and Zeger, and use the Leech lattice as an example for an underlying lattice. As a side result, we obtain a bound on the covering angle of any wrapped spherical code, as a function of the covering radius of the underlying lattice.
View details for DOI 10.1109/TIT.2016.2535172
View details for PubMedID 29398721
View details for PubMedCentralID PMC5796586
-
Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes
IEEE TRANSACTIONS ON INFORMATION THEORY
2016; 62 (5): 2737-2747
Abstract
We study the problem of compression for the purpose of similarity identification, where similarity is measured by the mean square Euclidean distance between vectors. While the asymptotical fundamental limits of the problem - the minimal compression rate and the error exponent - were found in a previous work, in this paper we focus on the nonasymptotic domain and on practical, implementable schemes. We first present a finite blocklength achievability bound based on shape-gain quantization: The gain (amplitude) of the vector is compressed via scalar quantization and the shape (the projection on the unit sphere) is quantized using a spherical code. The results are numerically evaluated and they converge to the asymptotic values as predicted by the error exponent. We then give a nonasymptotic lower bound on the performance of any compression scheme, and compare to the upper (achievability) bound. For a practical implementation of such a scheme, we use wrapped spherical codes, studied by Hamkins and Zeger, and use the Leech lattice as an example for an underlying lattice. As a side result, we obtain a bound on the covering angle of any wrapped spherical code, as a function of the covering radius of the underlying lattice.
View details for DOI 10.1109/TIT.2016.2535172
View details for Web of Science ID 000375005500026
View details for PubMedCentralID PMC5796586
-
Comment on: 'ERGC: an efficient referential genome compression algorithm'
BIOINFORMATICS
2016; 32 (7): 1115–17
Abstract
Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.sebastian.deorowicz@polsl.pl, iochoa@stanford.eduSupplementary data are available at Bioinformatics online.
View details for PubMedID 26615213
View details for PubMedCentralID PMC4907388
-
smallWig: parallel compression of RNA-seq WIG files
BIOINFORMATICS
2016; 32 (2): 173-180
Abstract
We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times.The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/.zhiyingw@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv561
View details for Web of Science ID 000368360100003
View details for PubMedCentralID PMC6078172
-
smallWig: parallel compression of RNA-seq WIG files.
Bioinformatics (Oxford, England)
2016; 32 (2): 173-80
Abstract
We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times.The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/.zhiyingw@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv561
View details for PubMedID 26424856
View details for PubMedCentralID PMC6078172
-
Distortion Rate Function of Sub-Nyquist Sampled Gaussian Sources
IEEE TRANSACTIONS ON INFORMATION THEORY
2016; 62 (1): 401-429
View details for DOI 10.1109/TIT.2015.2485271
View details for Web of Science ID 000369309500027
-
Mutual Information, Relative Entropy and Estimation Error in Semi-Martingale Channels
IEEE. 2016: 2794–98
View details for Web of Science ID 000390098702172
-
Chained Kullback-Leibler Divergences
IEEE. 2016: 580–84
Abstract
We define and characterize the "chained" Kullback-Leibler divergence min w D(p‖w) + D(w‖q) minimized over all intermediate distributions w and the analogous k-fold chained K-L divergence min D(p‖wk-1) + … + D(w2‖w1) + D(w1‖q) minimized over the entire path (w1,…,wk-1). This quantity arises in a large deviations analysis of a Markov chain on the set of types - the Wright-Fisher model of neutral genetic drift: a population with allele distribution q produces offspring with allele distribution w, which then produce offspring with allele distribution p, and so on. The chained divergences enjoy some of the same properties as the K-L divergence (like joint convexity in the arguments) and appear in k-step versions of some of the same settings as the K-L divergence (like information projections and a conditional limit theorem). We further characterize the optimal k-step "path" of distributions appearing in the definition and apply our findings in a large deviations analysis of the Wright-Fisher process. We make a connection to information geometry via the previously studied continuum limit, where the number of steps tends to infinity, and the limiting path is a geodesic in the Fisher information metric. Finally, we offer a thermodynamic interpretation of the chained divergence (as the rate of operation of an appropriately defined Maxwell's demon) and we state some natural extensions and applications (a k-step mutual information and k-step maximum likelihood inference). We release code for computing the objects we study.
View details for PubMedID 29130024
-
A cluster-based approach to compression of Quality Scores
IEEE. 2016: 261–70
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.
View details for PubMedID 29057318
-
Minimax Estimation of the L-I Distance
IEEE. 2016: 750–54
View details for Web of Science ID 000390098700150
-
When is Noisy State Information at the Eneoder as Useless as No Information or as Good as Noise-Free State?
IEEE. 2016: 890–94
View details for Web of Science ID 000390098700178
-
Minimax Rate-optimal Estimation of KL Divergence between Discrete Distributions
IEEE. 2016: 256–60
Abstract
We refine the general methodology in [1] for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions with support size S comparable with the number of observations n. Specifically, we determine the "smooth" and "non-smooth" regimes based on the confidence set and the smoothness of the functional. In the "non-smooth" regime, we apply an unbiased estimator for a "suitable" polynomial approximation of the functional. In the "smooth" regime, we construct a bias corrected version of the Maximum Likelihood Estimator (MLE) based on Taylor expansion. We apply the general methodology to the problem of estimating the KL divergence between two discrete distributions from empirical data. We construct a minimax rate-optimal estimator which is adaptive in the sense that it does not require the knowledge of the support size nor the upper bound on the likelihood ratio. Moreover, the performance of the optimal estimator with n samples is essentially that of the MLE with n ln n samples, i.e., the effective sample size enlargement phenomenon holds.
View details for PubMedID 29457152
-
Beyond Maximum Likelihood: Boosting the Chow-Liu Algorithm for Large Alphabets
IEEE COMPUTER SOC. 2016: 321–25
View details for Web of Science ID 000406057400056
-
Denoising of Quality Scores for Boosted Inference and Reduced Storage
IEEE. 2016: 251–60
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.
View details for PubMedID 29098178
-
Minimax Estimation of Discrete Distributions Under l(1) Loss
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (11): 6343-6354
View details for DOI 10.1109/TIT.2015.2478816
View details for Web of Science ID 000363256500038
-
Justification of Logarithmic Loss via the Benefit of Side Information
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (10): 5357-5365
View details for DOI 10.1109/TIT.2015.2462848
View details for Web of Science ID 000361486800010
-
QVZ: lossy compression of quality values.
Bioinformatics
2015; 31 (19): 3122-3129
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit.The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained.QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz.mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.eduSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btv330
View details for PubMedID 26026138
-
Network Compression: Worst Case Analysis
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (7): 3980-3995
Abstract
We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: 1) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples and 2) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance.
View details for DOI 10.1109/TIT.2015.2434829
View details for Web of Science ID 000356301600020
View details for PubMedCentralID PMC5786424
-
Network Compression: Worst Case Analysis.
IEEE transactions on information theory
2015; 61 (7): 3980-3995
Abstract
We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: 1) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples and 2) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance.
View details for DOI 10.1109/tit.2015.2434829
View details for PubMedID 29375153
View details for PubMedCentralID PMC5786424
-
Minimax Estimation of Functionals of Discrete Distributions
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (5): 2835-2885
Abstract
We propose a general methodology for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the support size S is unknown and may be comparable with or even much larger than the number of observations n. We treat the respective regions where the functional is nonsmooth and smooth separately. In the nonsmooth regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the smooth regime, we apply a bias-corrected version of the maximum likelihood estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing the performance of the resulting schemes for estimating two important information measures: 1) the entropy [Formula: see text] and 2) [Formula: see text], α > 0. We obtain the minimax L2 rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity n ≍ S/ln S for entropy estimation. We also demonstrate that the sample complexity for estimating Fα (P), 0 < α < 1, is n ≍ S1/α /ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show the minimax L2 rate for estimating Fα (P) is (n ln n)-2(α-1) for infinite support size, while the maximum L2 rate for the MLE is n-2(α-1). For all the above cases, the behavior of the minimax rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term "effective sample size enlargement." We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator yielded by our framework leads to significant performance boosts over the Chow-Liu algorithm in learning graphical models. The wide use of information measure estimation suggests that the insights and estimators obtained in this paper could be broadly applicable.
View details for DOI 10.1109/TIT.2015.2412945
View details for Web of Science ID 000353515700038
View details for PubMedCentralID PMC5786426
-
Compression for Quadratic Similarity Queries.
IEEE transactions on information theory
2015; 61 (5): 2729-2747
Abstract
The problem of performing similarity queries on compressed data is considered. We focus on the quadratic similarity measure, and study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on the compressed data. For a Gaussian source, we show that the queries can be answered reliably if and only if the compression rate exceeds a given threshold-the identification rate- which we explicitly characterize. Moreover, when compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively. For a general source, we prove that, as with classical compression, the Gaussian source requires the largest compression rate among sources with a given variance. Moreover, a robust scheme is described that attains this maximal rate for any source distribution.
View details for DOI 10.1109/tit.2015.2402972
View details for PubMedID 29375151
View details for PubMedCentralID PMC5786438
-
Compression for Quadratic Similarity Queries
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (5): 2729-2747
Abstract
The problem of performing similarity queries on compressed data is considered. We focus on the quadratic similarity measure, and study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on the compressed data. For a Gaussian source, we show that the queries can be answered reliably if and only if the compression rate exceeds a given threshold-the identification rate- which we explicitly characterize. Moreover, when compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively. For a general source, we prove that, as with classical compression, the Gaussian source requires the largest compression rate among sources with a given variance. Moreover, a robust scheme is described that attains this maximal rate for any source distribution.
View details for DOI 10.1109/TIT.2015.2402972
View details for Web of Science ID 000353515700032
View details for PubMedCentralID PMC5786438
-
Minimax Estimation of Functionals of Discrete Distributions.
IEEE transactions on information theory
2015; 61 (5): 2835-2885
Abstract
We propose a general methodology for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the support size S is unknown and may be comparable with or even much larger than the number of observations n. We treat the respective regions where the functional is nonsmooth and smooth separately. In the nonsmooth regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the smooth regime, we apply a bias-corrected version of the maximum likelihood estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing the performance of the resulting schemes for estimating two important information measures: 1) the entropy [Formula: see text] and 2) [Formula: see text], α > 0. We obtain the minimax L2 rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity n ≍ S/ln S for entropy estimation. We also demonstrate that the sample complexity for estimating Fα (P), 0 < α < 1, is n ≍ S1/α /ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show the minimax L2 rate for estimating Fα (P) is (n ln n)-2(α-1) for infinite support size, while the maximum L2 rate for the MLE is n-2(α-1). For all the above cases, the behavior of the minimax rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term "effective sample size enlargement." We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator yielded by our framework leads to significant performance boosts over the Chow-Liu algorithm in learning graphical models. The wide use of information measure estimation suggests that the insights and estimators obtained in this paper could be broadly applicable.
View details for DOI 10.1109/tit.2015.2412945
View details for PubMedID 29375152
View details for PubMedCentralID PMC5786426
-
Comparison of the Achievable Rates in OFDM and Single Carrier Modulation with IID Inputs
IEEE TRANSACTIONS ON INFORMATION THEORY
2015; 61 (4): 1795-1818
View details for DOI 10.1109/TIT.2015.2403354
View details for Web of Science ID 000351470800018
-
iDoComp: a compression scheme for assembled genomes
BIOINFORMATICS
2015; 31 (5): 626-633
Abstract
With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references.We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms.iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).
View details for DOI 10.1093/bioinformatics/btu698
View details for Web of Science ID 000352268500002
View details for PubMedID 25344501
-
Macrophages eat cancer cells using their own calreticulin as a guide: Roles of TLR and Btk.
Proceedings of the National Academy of Sciences of the United States of America
2015; 112 (7): 2145-2150
Abstract
Macrophage-mediated programmed cell removal (PrCR) is an important mechanism of eliminating diseased and damaged cells before programmed cell death. The induction of PrCR by eat-me signals on tumor cells is countered by don't-eat-me signals such as CD47, which binds macrophage signal-regulatory protein α to inhibit phagocytosis. Blockade of CD47 on tumor cells leads to phagocytosis by macrophages. Here we demonstrate that the activation of Toll-like receptor (TLR) signaling pathways in macrophages synergizes with blocking CD47 on tumor cells to enhance PrCR. Bruton's tyrosine kinase (Btk) mediates TLR signaling in macrophages. Calreticulin, previously shown to be an eat-me signal on cancer cells, is activated in macrophages for secretion and cell-surface exposure by TLR and Btk to target cancer cells for phagocytosis, even if the cancer cells themselves do not express calreticulin.
View details for DOI 10.1073/pnas.1424907112
View details for PubMedID 25646432
-
Universality of Logarithmic Loss in Lossy Compression
IEEE. 2015: 2166–70
View details for Web of Science ID 000380904702044
-
Compression for Similarity Identification: Computing the Error Exponent
IEEE. 2015: 413–22
Abstract
We consider the problem of compressing discrete memoryless data sequences for the purpose of similarity identification, first studied by Ahlswede et al. (1997). In this setting, a source sequence is compressed, where the goal is to be able to identify whether the original source sequence is similar to another given sequence (called the query sequence). There is no requirement that the source will be reproducible from the compressed version. In the case where no false negatives are allowed, a compression scheme is said to be reliable if the probability of error (false positive) vanishes as the sequence length grows. The minimal compression rate in this sense, which is the parallel of the classical rate distortion function, is called the identification rate. The rate at which the error probability vanishes is measured by its exponent, called the identification exponent (which is the analog of the classical excess distortion exponent). While an information-theoretic expression for the identification exponent was found in past work, it is uncomputable due to a dependency on an auxiliary random variable with unbounded cardinality. The main result of this paper is a cardinality bound on the auxiliary random variable in the identification exponent, thereby making the quantity computable (solving the problem that was left open by Ahlswede et al.). The new proof technique relies on the fact that the Lagrangian in the optimization problem (in the expression for the exponent) can be decomposed by coordinate (of the auxiliary random variable). Then a standard Carathéodory - style argument completes the proof.
View details for PubMedID 29046895
-
Maximum Likelihood Estimation of Information Measures
IEEE. 2015: 839–43
View details for Web of Science ID 000380904700168
-
Minimax Estimation of Information Measures
IEEE. 2015: 2296–2300
View details for Web of Science ID 000380904702070
-
Minimax Estimation of Discrete Distributions
IEEE. 2015: 2291–95
View details for Web of Science ID 000380904702069
-
Adaptive Estimation of Shannon Entropy
IEEE. 2015: 1372–76
View details for Web of Science ID 000380904701085
-
Does Dirichlet Prior Smoothing Solve the Shannon Entropy Estimation Problem?
IEEE. 2015: 1367–71
View details for Web of Science ID 000380904701084
-
Information Measures: The Curious Case of the Binary Alphabet
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (12): 7616-7626
View details for DOI 10.1109/TIT.2014.2360184
View details for Web of Science ID 000345511400014
-
Aligned genomic data compression via improved modeling
JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY
2014; 12 (6)
Abstract
With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.
View details for DOI 10.1142/S0219720014420025
View details for PubMedID 25395305
-
Capacity of a POST Channel With and Without Feedback
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (10): 6041-6057
View details for DOI 10.1109/TIT.2014.2343232
View details for Web of Science ID 000342416900017
-
To Feed or Not to Feedback
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (9): 5150-5172
View details for DOI 10.1109/TIT.2014.2331957
View details for Web of Science ID 000342415600006
-
Minimax Filtering Regret via Relations Between Information and Estimation
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (8): 4832-4847
View details for DOI 10.1109/TIT.2014.2323069
View details for Web of Science ID 000340273700027
-
The Porosity of Additive Noise Channels
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (6): 3144-3162
View details for DOI 10.1109/TIT.2014.2313578
View details for Web of Science ID 000341980300006
-
Compression With Actions
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (2): 796-807
View details for DOI 10.1109/TIT.2013.2293339
View details for Web of Science ID 000330286100002
-
Multiterminal Source Coding Under Logarithmic Loss
IEEE TRANSACTIONS ON INFORMATION THEORY
2014; 60 (1): 740-761
View details for DOI 10.1109/TIT.2013.2288257
View details for Web of Science ID 000330282600049
-
Compression for Similarity Identification: Fundamental Limits
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 1–5
View details for Web of Science ID 000346496100001
-
Compression for Quadratic Similarity Queries via Shape-Gain Quantizers
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 2839–2843
View details for Web of Science ID 000346496102196
-
Rateless Lossy Compression via the Extremes
IEEE. 2014: 88–92
View details for Web of Science ID 000380426900014
-
Information Divergences and the Curious Case of the Binary Alphabet
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 351–355
View details for Web of Science ID 000346496100070
-
Relations between Information and Estimation in Scalar Levy Channels
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 2212–2216
View details for Web of Science ID 000346496102070
-
Strong Successive Refinability: Sufficient Conditions
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 2659–2663
View details for Web of Science ID 000346496102160
-
Justification of Logarithmic Loss via the Benefit of Side Information
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2014: 946–950
View details for Web of Science ID 000346496101016
-
Achievable Error Exponents in the Gaussian Channel With Rate-Limited Feedback
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (12): 8144-8156
View details for DOI 10.1109/TIT.2013.2280918
View details for Web of Science ID 000327594500024
-
Estimation With a Helper Who Knows the Interference
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (11): 7097-7117
View details for DOI 10.1109/TIT.2013.2273531
View details for Web of Science ID 000325981100008
-
Universal Estimation of Directed Information
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (10): 6220-6242
View details for DOI 10.1109/TIT.2013.2267934
View details for Web of Science ID 000324573500004
-
Successive Refinement With Decoder Cooperation and Its Channel Coding Duals
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (9): 5511-5533
View details for DOI 10.1109/TIT.2013.2266655
View details for Web of Science ID 000323455800017
-
The human genome contracts again
BIOINFORMATICS
2013; 29 (17): 2199-2202
Abstract
The number of human genomes that have been sequenced completely for different individuals has increased rapidly in recent years. Storing and transferring complete genomes between computers for the purpose of applying various applications and analysis tools will soon become a major hurdle, hindering the analysis phase. Therefore, there is a growing need to compress these data efficiently. Here, we describe a technique to compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). Furthermore, we explore several intrinsic features of genomes and information in other genomic databases to further improve the compression attained. Using these methods, we compress James Watson's genome to 2.5 megabytes (MB), improving on recent work by 37%. Similar compression is obtained for most genomes available from the 1000 Genomes Project. Our biologically inspired techniques promise even greater gains for genomes of lower organisms and for human genomes as more genomic data become available.Code is available at sourceforge.net/projects/genomezip/
View details for DOI 10.1093/bioinformatics/btt362
View details for Web of Science ID 000323344800018
View details for PubMedID 23793748
-
QualComp: a new lossy compressor for quality scores based on rate distortion theory
BMC BIOINFORMATICS
2013; 14
Abstract
Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment).QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp.
View details for DOI 10.1186/1471-2105-14-187
View details for Web of Science ID 000321136300001
View details for PubMedID 23758828
View details for PubMedCentralID PMC3698011
-
Real-Time Coding With Limited Lookahead
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (6): 3582-3606
View details for DOI 10.1109/TIT.2013.2245396
View details for Web of Science ID 000320709800024
-
Multiterminal Source Coding With Action-Dependent Side Information
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (6): 3653-3667
View details for DOI 10.1109/TIT.2013.2245398
View details for Web of Science ID 000320709800028
-
Directed Information, Causal Estimation, and Communication in Continuous Time
IEEE TRANSACTIONS ON INFORMATION THEORY
2013; 59 (3): 1271-1287
View details for DOI 10.1109/TIT.2012.2227677
View details for Web of Science ID 000315120400004
- Multiterminal Source Coding with Action Dependent Side Information IEEE Trans. Inform. Theory 2013; 59 (6): 3653-3667
-
Secure Source Coding with a Public Helper
IEEE. 2013: 2209-+
Abstract
We consider secure multi-terminal source coding problems in the presence of a public helper. Two main scenarios are studied: 1) source coding with a helper where the coded side information from the helper is eavesdropped by an external eavesdropper, 2) triangular source coding with a helper where the helper is considered as a public terminal. We are interested in how the helper can support the source transmission subject to a constraint on the amount of information leaked due to its public nature. We characterize the tradeoff between transmission rate, incurred distortion, and information leakage rate at the helper/eavesdropper in the form of a rate-distortion-leakage region for various classes of problems.
View details for PubMedID 29398720
-
Efficient Similarity Queries via Lossy Compression
51st Annual Allerton Conference on Communication, Control, and Computing
IEEE. 2013: 883–889
View details for Web of Science ID 000350802400122
-
Unsupervised Learning and Universal Communication
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 261–265
View details for Web of Science ID 000348913400053
-
QualComp: a new lossy compressor for quality scores based on rate distortion theory.
BMC bioinformatics
2013; 14: 187-?
Abstract
Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment).QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp.
View details for DOI 10.1186/1471-2105-14-187
View details for PubMedID 23758828
-
The Role of Lookahead in Estimation under Gaussian Noise
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 2850–2854
View details for Web of Science ID 000348913402197
-
Minimax Filtering Regret via Relations Between Information and Estimation
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 444–448
View details for Web of Science ID 000348913400090
-
Complexity and Rate-Distortion Tradeoff via Successive Refinement
51st Annual Allerton Conference on Communication, Control, and Computing
IEEE. 2013: 1531–1536
View details for Web of Science ID 000350802400213
-
Compression for Exact Match Identification
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 654–658
View details for Web of Science ID 000348913400132
-
Pointwise Relations between Information and Estimation in the Poisson Channel
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 449–453
View details for Web of Science ID 000348913400091
-
Network Compression: Worst-Case Analysis
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 196–200
View details for Web of Science ID 000348913400040
-
Capacity of a POST Channel with and without Feedback
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 2538–2542
View details for Web of Science ID 000348913402134
- Relations between Information and Estimation in the presence of Feedback (invited contribution to) Information and Control in Networks (accepted) Cambridge University Press. 2013
-
Quadratic Similarity Queries on Compressed Data
Data Compression Conference (DCC)
IEEE. 2013: 441–450
View details for DOI 10.1109/DCC.2013.52
View details for Web of Science ID 000325712000045
- On Real Time Coding with Limited Lookahead IEEE Trans. Inform. Theory 2013; 59 (6): 3582-3606
-
On Information, Estimation and Lookahead
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
IEEE. 2013: 1292–1299
View details for Web of Science ID 000320654000176
- Estimation with a Helper who Knows the Interference IEEE Trans. Inform. Theory 2013; 59 (11): 7097-7117
- QualComp: a new lossy compressor for quality scores based on rate distortion theory BMC Bioinformatics 2013: 14:187
- Achievable Error Exponents in the Gaussian Channel with Rate-Limited Feedback IEEE Trans. Inform. Theory 2013; 59 (12): 8144-8156
-
Distortion Rate Function of Sub-Nyquist Sampled Gaussian Sources Corrupted by Noise
51st Annual Allerton Conference on Communication, Control, and Computing
IEEE. 2013: 901–908
View details for Web of Science ID 000350802400125
-
Reliable Uncoded Communication in the Underdetermined SIMO MAC with Low-Complexity Decoding
51st Annual Allerton Conference on Communication, Control, and Computing
IEEE. 2013: 1075–1081
View details for Web of Science ID 000350802400148
-
Reliable Uncoded Communication in the SIMO MAC via Low-Complexity Decoding
IEEE International Symposium on Information Theory Proceedings (ISIT)
IEEE. 2013: 1645–1649
View details for Web of Science ID 000348913401156
-
Operational Extremality of Gaussianity in Network Compression, Communication, and Coding
IEEE Information Theory Workshop (ITW)
IEEE. 2013
View details for Web of Science ID 000330643200008
-
Uncoded transmission in MAC channels achieves arbitrarily small error probability
50th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
IEEE. 2013: 1983–1990
View details for Web of Science ID 000320654000274
- The Human Genome Contracts Again Bioinformatics 2013
-
Pointwise Relations Between Information and Estimation in Gaussian Noise
IEEE TRANSACTIONS ON INFORMATION THEORY
2012; 58 (10): 6264-6281
View details for DOI 10.1109/TIT.2012.2206794
View details for Web of Science ID 000308965600002
-
Achievable Complexity-Performance Tradeoffs in Lossy Compression
PROBLEMS OF INFORMATION TRANSMISSION
2012; 48 (4): 352-375
View details for DOI 10.1134/S0032946012040060
View details for Web of Science ID 000314036400006
-
An MCMC Approach to Universal Lossy Compression of Analog Sources
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2012; 60 (10): 5230-5240
View details for DOI 10.1109/TSP.2012.2206585
View details for Web of Science ID 000308963300018
-
Block and Sliding-Block Lossy Compression via MCMC
IEEE TRANSACTIONS ON COMMUNICATIONS
2012; 60 (8): 2187-2198
View details for DOI 10.1109/TCOMM.2012.061412.110194
View details for Web of Science ID 000309203600017
-
Denoising via MCMC-Based Lossy Compression
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2012; 60 (6): 3092-3100
View details for DOI 10.1109/TSP.2012.2190597
View details for Web of Science ID 000304154500029
-
Cascade and Triangular Source Coding With Side Information at the First Two Nodes
IEEE TRANSACTIONS ON INFORMATION THEORY
2012; 58 (6): 3339-3349
View details for DOI 10.1109/TIT.2012.2188273
View details for Web of Science ID 000304245100002
-
Lossy Compression of Discrete Sources via the Viterbi Algorithm
IEEE TRANSACTIONS ON INFORMATION THEORY
2012; 58 (4): 2475-2489
View details for DOI 10.1109/TIT.2011.2178059
View details for Web of Science ID 000302079800033
-
Mutual Information, Relative Entropy, and Estimation in the Poisson Channel
IEEE TRANSACTIONS ON INFORMATION THEORY
2012; 58 (3): 1302-1318
View details for DOI 10.1109/TIT.2011.2172572
View details for Web of Science ID 000300845900003
-
Reference Based Genome Compression
IEEE Information Theory Workshop (ITW)
IEEE. 2012: 427–431
View details for Web of Science ID 000313526400088
-
The Degraded Broadcast Channel with Action-Dependent States
IEEE International Symposium on Information Theory
IEEE. 2012: 596–600
View details for Web of Science ID 000312544300122
-
Estimation with a Helper Who Knows the Interference
IEEE International Symposium on Information Theory
IEEE. 2012: 706–710
View details for Web of Science ID 000312544300144
-
Multiterminal Source Coding under Logarithmic Loss
IEEE International Symposium on Information Theory
IEEE. 2012: 761–765
View details for Web of Science ID 000312544300155
-
Universal Estimation of Directed Information via Sequential Probability Assignments
IEEE International Symposium on Information Theory
IEEE. 2012: 523–527
View details for Web of Science ID 000312544300107
-
Worst-Case Source for Distributed Compression with Quadratic Distortion
IEEE Information Theory Workshop (ITW)
IEEE. 2012: 187–191
View details for Web of Science ID 000313526400040
-
Successive Refinement with Cribbing Decoders and its Channel Coding Duals
IEEE International Symposium on Information Theory
IEEE. 2012
View details for Web of Science ID 000312544301093
-
Cascade, Triangular, and Two-Way Source Coding With Degraded Side Information at the Second User
IEEE TRANSACTIONS ON INFORMATION THEORY
2012; 58 (1): 189-206
View details for DOI 10.1109/TIT.2011.2167738
View details for Web of Science ID 000298989200016
-
Pointwise Relations between Information and Estimation in Gaussian Noise
IEEE International Symposium on Information Theory
IEEE. 2012: 701–705
View details for Web of Science ID 000312544300143
-
Joint Source-Channel Coding of one Random Variable over the Poisson Channel
IEEE International Symposium on Information Theory
IEEE. 2012
View details for Web of Science ID 000312544302044
-
The Porosity of Additive Noise Sequences
IEEE International Symposium on Information Theory
IEEE. 2012
View details for Web of Science ID 000312544302136
- Achievable Complexity-Performance Tradeoffs in Lossy Compression Problems of Information Transmission 2012; 48 (4): 352-375
- An MCMC Approach to Universal Lossy Compression of Analog Sources IEEE Trans. Sig. Proc. 2012; 60 (10): 5230-5240
- Cascade and Triangular Source Coding with Side Information at the First Two Nodes IEEE Trans. Inform. Theory 2012; 58 (6): 3309-3349
- Denoising via MCMC-based Lossy Compression IEEE Trans. Sig. Proc. 2012; 60 (6): 3092-3100
- Block and Sliding-Block Lossy Compression via MCMC IEEE Trans. on Communications 2012; 60 (8): 2187-2198
- Mutual Information, Relative Entropy, and Estimation in the Poisson Channel IEEE Trans. Inform. Theory 2012; 58 (3): 1302-1318
- Cascade, Triangular and Two Way Source Coding with Degraded Side Information at the Second User IEEE Trans. Inform. Theory 2012; 58 (1): 189-206
-
Probing Capacity
IEEE TRANSACTIONS ON INFORMATION THEORY
2011; 57 (11): 7317-7332
View details for DOI 10.1109/TIT.2011.2162089
View details for Web of Science ID 000297046100008
-
Source Coding With a Side Information "Vending Machine"
IEEE TRANSACTIONS ON INFORMATION THEORY
2011; 57 (7): 4530-4544
View details for DOI 10.1109/TIT.2011.2145510
View details for Web of Science ID 000291922500035
-
Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing
IEEE TRANSACTIONS ON INFORMATION THEORY
2011; 57 (6): 3248-3259
View details for DOI 10.1109/TIT.2011.2136270
View details for Web of Science ID 000291003900007
-
Error Exponents for the Gaussian Channel With Active Noisy Feedback
IEEE TRANSACTIONS ON INFORMATION THEORY
2011; 57 (3): 1223-1236
View details for DOI 10.1109/TIT.2011.2104991
View details for Web of Science ID 000287657200003
-
Cascade and Triangular source coding with causal side information
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2011: 1683–1687
View details for Web of Science ID 000297465101175
-
Mutual Information, Relative Entropy, and Estimation in the Poisson Channel
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2011: 708–712
View details for Web of Science ID 000297465100142
-
Discrete denoising of heterogeneous two-dimensional data
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2011: 1041–1045
View details for Web of Science ID 000297465101045
-
Continuous-Time Directed Information and Its Role in Communication
IEEE Information Theory Workshop (ITW)
IEEE. 2011
View details for Web of Science ID 000299416200039
- Bounds on the Entropy Rate of Binary Hidden Markov Processes Entropy of Hidden Markov Processes and Connections to Dynamical Systems edited by Marcus, B., Petersen, K., Weissman, T. Cambridge University Press. 2011: 117–171
- Entropy of Hidden Markov Processes and Connections to Dynamical Systems edited by Marcus, B., Petersen, K., Weissman, T. Cambridge University Press. 2011
-
Multi-terminal Source Coding With Action Dependent Side Information
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2011
View details for Web of Science ID 000297465102063
-
To Feed or Not to Feed Back
IEEE International Symposium on Information Theory (ISIT)
IEEE. 2011: 159–163
View details for Web of Science ID 000297465100033
- Error Exponents for the Gaussian Channel with Active Noisy Feedback IEEE Trans. Inform. Theory 2011; 57 (3): 1223-1236
-
Capacity of Channels With Action-Dependent States
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 56 (11): 5396-5411
View details for DOI 10.1109/TIT.2010.2068991
View details for Web of Science ID 000283449000003
-
The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-Time AWGN Channels
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 56 (9): 4256-4273
View details for DOI 10.1109/TIT.2010.2054430
View details for Web of Science ID 000283072400006
-
Tighter Bounds on the Capacity of Finite-State Channels Via Markov Set-Chains
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 57 (8): 3660-3691
View details for DOI 10.1109/TIT.2010.2050825
View details for Web of Science ID 000282001700003
-
Two-Way Source Coding With a Helper
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 56 (6): 2905-2919
View details for DOI 10.1109/TIT.2010.2046238
View details for Web of Science ID 000277880200031
-
Universal Reinforcement Learning
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 56 (5): 2441-2454
View details for DOI 10.1109/TIT.2010.2043762
View details for Web of Science ID 000278067900026
-
A Universal Scheme for Wyner-Ziv Coding of Discrete Sources
IEEE TRANSACTIONS ON INFORMATION THEORY
2010; 56 (4): 1737-1750
View details for DOI 10.1109/TIT.2010.2040889
View details for Web of Science ID 000275999500023
-
Universal Estimation of Directed Information
2010 IEEE International Symposium on Information Theory
IEEE. 2010: 1433–1437
View details for Web of Science ID 000287512700289
-
An MCMC Approach to Lossy Compression of Continuous Sources
IEEE COMPUTER SOC. 2010: 40–48
View details for DOI 10.1109/DCC.2010.11
View details for Web of Science ID 000397228500005
-
Universal Lossless Compression-based Denoising
2010 IEEE International Symposium on Information Theory
IEEE. 2010: 1648–1652
View details for Web of Science ID 000287512700332
- The Relationship Between Causal and Non-Causal Mismatched Estimation in Continuous-Time AWGN Channels IEEE Trans. Inform. Theory 2010; 56 (9): 4256-4273
- A Universal Scheme for Wyner-Ziv Coding of Discrete Sources IEEE Trans. Inform. Theory 2010; 56 (4): 1737-1750
- Capacity of Channels with Action-Dependent States IEEE Trans. Inform. Theory 2010; 56 (11): 5396-5411
-
Cascade and Triangular Source Coding with Side Information at the First Two Nodes
2010 IEEE International Symposium on Information Theory
IEEE. 2010: 31–35
View details for Web of Science ID 000287512700007
-
Discrete Denoising With Shifts
IEEE TRANSACTIONS ON INFORMATION THEORY
2009; 55 (11): 5284-5301
View details for DOI 10.1109/TIT.2009.2030461
View details for Web of Science ID 000271019700036
-
A Context Quantization Approach to Universal Denoising
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2009; 57 (6): 2110-2129
View details for DOI 10.1109/TSP.2008.2011847
View details for Web of Science ID 000266333200007
-
Capacity Region of the Finite-State Multiple-Access Channel With and Without Feedback
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2009: 2455–77
View details for DOI 10.1109/TIT.2009.2018346
View details for Web of Science ID 000266878500003
-
Universal FIR MMSE Filtering
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2009; 57 (3): 1068-1083
View details for DOI 10.1109/TSP.2008.2009894
View details for Web of Science ID 000263431900021
-
Finite State Channels With Time-Invariant Deterministic Feedback
IEEE TRANSACTIONS ON INFORMATION THEORY
2009; 55 (2): 644-662
View details for DOI 10.1109/TIT.2008.2009849
View details for Web of Science ID 000263375500015
- A Context Quantization Approach to Universal denoising IEEE Trans. Sig. Proc. 2009; 57 (6): 2110-2129
-
Two-way source coding with a common helper
IEEE International Symposium on Information Theory (ISIT 2009)
IEEE. 2009: 1473–1477
View details for Web of Science ID 000280141401006
-
Multiple Description Coding of Discrete Ergodic Sources
47th Annual Allerton Conference on Communication, Control, and Computing
IEEE. 2009: 1256–1261
View details for Web of Science ID 000279627100172
-
Directed Information, Causal Estimation, and Communication in Continuous Time
7th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless
IEEE. 2009: 632–638
View details for Web of Science ID 000275175100084
-
Source Coding with a Side Information 'Vending Machine' at the Decoder
IEEE International Symposium on Information Theory (ISIT 2009)
IEEE. 2009: 1030–1034
View details for Web of Science ID 000280141400210
-
Directed Information and Causal Estimation in Continuous Time
IEEE International Symposium on Information Theory (ISIT 2009)
IEEE. 2009: 819–823
View details for Web of Science ID 000280141400167
-
Capacity of Channels with Action-Dependent States
IEEE International Symposium on Information Theory (ISIT 2009)
IEEE. 2009: 1794–1798
View details for Web of Science ID 000280141401072
- Finite-state channels with time-invariant deterministic feedback IEEE Trans. Inform. Theory 2009; 55 (2): 644-662
-
Problems We Can Solve With a Helper
IEEE Information Theory Workshop on Networking and Information Theory
IEEE. 2009: 266–270
View details for Web of Science ID 000273966100056
-
An Outer Bound for Side-Information Scalable Source Coding with Partially Cooperating Decoders
IEEE International Symposium on Information Theory (ISIT 2009)
IEEE. 2009: 1194–1198
View details for Web of Science ID 000280141400243
-
An Iterative Scheme for Near Optimal and Universal Lossy Compression
IEEE Information Theory Workshop on Networking and Information Theory
IEEE. 2009: 231–235
View details for Web of Science ID 000273966100049
- Discrete Denoising with Shifts IEEE Trans. Inform. Theory 2009; 55 (11): 5284-5301
- Capacity Region of the Multiple Access Channel With or Without Feedback IEEE Trans. Inform. Theory 2009; 55 (6): 2455-2477
-
Universal Denoising of Discrete-Time Continuous-Amplitude Signals
IEEE TRANSACTIONS ON INFORMATION THEORY
2008; 54 (12): 5632-5660
View details for DOI 10.1109/TIT.2008.2006438
View details for Web of Science ID 000261648100022
-
Scanning and Sequential Decision Making for Multidimensional Data-Part II: The Noisy Case
IEEE TRANSACTIONS ON INFORMATION THEORY
2008; 54 (12): 5609-5631
View details for DOI 10.1109/TIT.2008.2006378
View details for Web of Science ID 000261648100021
-
The Information Lost in Erasures
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 5030–58
View details for DOI 10.1109/TIT.2008.929968
View details for Web of Science ID 000260426400016
-
Coding for additive white noise channels with feedback corrupted by quantization or bounded noise
IEEE TRANSACTIONS ON INFORMATION THEORY
2008; 54 (9): 4274-4282
View details for DOI 10.1109/TIT.2008.928289
View details for Web of Science ID 000258913400028
-
How to filter an "individual sequence with feedback"
44th Annual Allerton Conference on Communication, Control and Computing
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 3831–41
View details for DOI 10.1109/TIT.2008.926457
View details for Web of Science ID 000257861400047
-
Capacity of the trapdoor channel with feedback
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 3150–65
View details for DOI 10.1109/TIT.2008.924681
View details for Web of Science ID 000257111500021
-
Universal filtering via hidden Markov modeling
IEEE International Symposium on Information Theory and Its Applications
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 692–708
View details for DOI 10.1109/TIT.2007.913220
View details for Web of Science ID 000252612200012
- Discrete Universal Filtering via Hidden Markov Modeling IEEE Trans. Inform. Theory 2008; 54 (2): 692 - 708
-
On Successive Refinement for the Wyner-Ziv Problem with Partially Cooperating Decoders
2008 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS, VOLS 1-6
2008: 2172-?
View details for Web of Science ID 000260364401158
-
Rate-Distortion in Near-Linear Time
IEEE International Symposium on Information Theory
IEEE. 2008: 847–851
View details for Web of Science ID 000260364400171
-
Near optimal lossy source coding and compression-based denoising via Markov Chain Monte Carlo
42nd Annual Conference on Information Sciences and Systems
IEEE. 2008: 441–446
View details for Web of Science ID 000258931600084
-
Rate-Distortion via Markov Chain Monte Carlo
IEEE International Symposium on Information Theory
IEEE. 2008: 852–856
View details for Web of Science ID 000260364400172
-
Lossy source coding via Markov chain Monte Carlo
International Zurich Seminar on Communications
IEEE. 2008: 80–83
View details for Web of Science ID 000255784400021
-
On the Capacity of Finite-State Channels
IEEE International Symposium on Information Theory
IEEE. 2008: 1223–1227
View details for Web of Science ID 000260364400247
- Universal denoising of discretetime continuous-amplitude signals IEEE Trans. Inform. Theory 2008; 54 (12): 5632-5660
- The Information Lost in Erasures IEEE Trans. Inform. Theory 2008; 54 (11): 5030-5058
- How to filter an 'individual sequence with feedback' IEEE Trans. Inform. Theory 2008; 54 (8): 3831-3841
- Universal FIR MMSE filtering IEEE Trans. Sig. Proc. 2008; 57 (3): 1068-1083
- Coding Schemes for Additive White Noise Channels with Feedback Corrupted by Quantization or Bounded Noise IEEE Trans. Inform. Theory 2008; 54 (9): 4274-4282
-
RATE-DISTORTION WITH COMMON RATE-LIMITED SIDE INFORMATION TO THE ENCODER AND DECODER
47th AIAA Aerospace Sciences Meeting and Exhibit including the New Horizons Forum and Aerospace Exposition
IEEE. 2008: 777–779
View details for Web of Science ID 000265073200166
-
On Directed Information and Gambling
IEEE International Symposium on Information Theory
IEEE. 2008: 1403–1407
View details for Web of Science ID 000260364401001
-
New Bounds for the Capacity Region of the Finite-State Multiple Access Channel
IEEE International Symposium on Information Theory
IEEE. 2008: 394–398
View details for Web of Science ID 000260364400080
- Scanning and sequential decision making for multi-dimensional data, Part II: the noisy case IEEE Trans. Inform. Theory 2008; 54 (12): 5609-5631
-
An Implementable Scheme for Universal Lossy Compression of Discrete Markov Sources
19th Data Compression Conference
IEEE COMPUTER SOC. 2008: 292–301
View details for DOI 10.1109/DCC.2009.72
View details for Web of Science ID 000268029900030
-
Scanning and sequential decision making for multidimensional data - Part I : The noiseless case
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 3001–20
View details for DOI 10.1109/TIT.2007.903117
View details for Web of Science ID 000249025100001
-
Universal filtering via prediction
Data Compression Conference (DCC 2004)
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 1253–64
View details for DOI 10.1109/TIT.2007.892782
View details for Web of Science ID 000245306000001
-
Denoising and filtering under the probability of excess loss criterion
43rd Annual Allerton Conference on Communication, Control and Computing
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 1265–81
View details for DOI 10.1109/TIT.2007.892772
View details for Web of Science ID 000245306000002
-
Scanning and sequential decision making for multidimensional data
IEEE Information Theory and Applications Workshop
IEEE. 2007: 99–108
View details for Web of Science ID 000250812500016
-
Competitive on-line linear FIR MMSE filtering
IEEE International Symposium on Information Theory
IEEE. 2007: 1126–1130
View details for Web of Science ID 000257010201057
-
A context quantization approach to universal denoising
IEEE International Symposium on Information Theory
IEEE. 2007: 2361–2365
View details for Web of Science ID 000257010202134
- Scanning and sequential decision making for multi-dimensional data, Part I: the noiseless case IEEE Trans. Inform. Theory 2007; 53 (9): 3001 - 3020
-
New bounds on the rate-distortion function of a binary Markov source
IEEE International Symposium on Information Theory
IEEE. 2007: 571–575
View details for Web of Science ID 000257010200115
-
A universal Wyner-Ziv scheme for discrete sources
IEEE International Symposium on Information Theory
IEEE. 2007: 1951–1955
View details for Web of Science ID 000257010202052
-
On separation in the presence of feedback
IEEE Information Theory Workshop
IEEE. 2007: 266–270
View details for Web of Science ID 000253441300046
- Reflections on the DUDE IEEE Information Theory Society Newsletter 2007; 57 (2): 5-10
- Denoising and filtering under the probability of excess loss criterion IEEE Trans. Inform. Theory 2007; 53 (4): 1265 - 1281
-
Scanning, filtering and prediction for random fields corrupted by Gaussian noise
IEEE International Symposium on Information Theory
IEEE. 2007: 691–695
View details for Web of Science ID 000257010200139
-
The Gaussian channel with noisy feedback
IEEE International Symposium on Information Theory
IEEE. 2007: 1416–1420
View details for Web of Science ID 000257010201115
-
Capacity and zero-error capacity of the chemical channel with feedback
IEEE International Symposium on Information Theory
IEEE. 2007: 1866–1870
View details for Web of Science ID 000257010202035
-
Source coding with limited-look-ahead side information at the decoder
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 5218–39
View details for DOI 10.1109/TIT.2006.885500
View details for Web of Science ID 000242503300003
-
Universal zero-delay joint source-channel coding
38th Conference on Information Sciences and Systems
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 5240–50
View details for DOI 10.1109/TIT.2006.885537
View details for Web of Science ID 000242503300004
-
On the entropy rate of pattern processes
IEEE TRANSACTIONS ON INFORMATION THEORY
2006; 52 (9): 3994-4007
View details for DOI 10.1109/TIT.2006.880044
View details for Web of Science ID 000240076700008
-
Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source
IEEE International Symposium on Information Theory and Its Applications
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 4207–11
View details for Web of Science ID 000240076700023
-
Universal minimax discrete denoising under channel uncertainty
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 3476–97
View details for DOI 10.1109/TIT.2006.878234
View details for Web of Science ID 000239408700006
-
Algorithms for discrete denoising under channel uncertainty
IEEE TRANSACTIONS ON SIGNAL PROCESSING
2006; 54 (6): 2263-2276
View details for DOI 10.1109/TSP.2006.874295
View details for Web of Science ID 000237900000026
-
Source coding with limited side information lookahead at the decoder
IEEE International Symposium on Information Theory
IEEE. 2006: 2441–2445
View details for Web of Science ID 000245289705001
-
Universal scanning and sequential decision making for multidimensional data
IEEE International Symposium on Information Theory
IEEE. 2006: 431–435
View details for Web of Science ID 000245289700089
-
Erasure entropy
IEEE International Symposium on Information Theory
IEEE. 2006: 98–102
View details for Web of Science ID 000245289700021
-
Universal scanning of mixing ra the performance of the peano ndom fields and Peano-Hilbert scan
24th IEEE Convention of Electrical and Electronics Engineers in Israel
IEEE. 2006: 62–66
View details for Web of Science ID 000246334200014
-
On the optimality of symbol-by-symbol filtering and denoising
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 19–40
View details for DOI 10.1109/TIT.2005.860432
View details for Web of Science ID 000234412500002
- Universal Zero-Delay Joint Source-Channel Coding IEEE Trans. Inform. Theory 2006; 52 (12): 5240 - 5250
- Universal Minimax Discrete Denoising under Channel Uncertainty IEEE Trans. Inform. Theory 2006; 52 (8): 3476-3497
- Source Coding with Limited Side Information Lookahead at the Decoder IEEE Trans. Inform. Theory 2006; 52 (12): 5218 - 5239
- 2006 Kailath Lecture and Colloquia Newsletter article in the IEEE Information Theory Society Newsletter. 2006
- On the Entropy Rate of Pattern Processes IEEE Trans. Inform. Theory 2006; 52 (9): 3994 - 4007
- Algorithms for Discrete Denoising under Channel Uncertainty IEEE Trans. Inform. Theory 2006; 54 (6): 2263-2276
- On the Optimality of Symbol by Symbol Fitering and Denoising IEEE Trans. Inform. Theory 2006; 52 (1): 19-40
- Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source IEEE Trans. Inform. Theory 2006; 52 (9): 4207 - 4211
-
Compound sequential decisions against the well-informed antagonist
IEEE Information Theory Workshop
IEEE. 2006: 77–81
View details for Web of Science ID 000245205900017
-
Universal denoising of continuous amplitude signals with applications to images
IEEE International Conference on Image Processing (ICIP 2006)
IEEE. 2006: 2609–2612
View details for Web of Science ID 000245768501271
-
Bounds on the error exponent of the AWGN channel with AWGN-corrupted feedback
24th IEEE Convention of Electrical and Electronics Engineers in Israel
IEEE. 2006: 184–188
View details for Web of Science ID 000246334200042
-
Capacity of finite-state channels with time-invariant deterministic feedback
IEEE International Symposium on Information Theory
IEEE. 2006: 64–68
View details for Web of Science ID 000245289700014
-
Universal denoising of discrete-time continuous-amplitude signals
IEEE International Symposium on Information Theory
IEEE. 2006: 2531–2535
View details for Web of Science ID 000245289705020
-
The empirical distribution of rate-constrained source codes
IEEE TRANSACTIONS ON INFORMATION THEORY
2005; 51 (11): 3718-3733
View details for DOI 10.1109/TIT.2005.856982
View details for Web of Science ID 000233047800002
-
On causal source codes with side information
40th Annual Allerton Conference on Communication, Control and Computing
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 4003–13
View details for DOI 10.1109/TIT.2005.856978
View details for Web of Science ID 000233047800025
-
Universal denoising for the finite-input general-output channel
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 1507–17
View details for DOI 10.1109/TIT.2005.844104
View details for Web of Science ID 000228050800020
- Universal Denoising for the Finite-Input- General-Output Channel IEEE Trans. Inform. Theory 2005; 51 (4): 1507-1517
-
DISCRETE DENOISING FOR CHANNELS WITH MEMORY
COMMUNICATIONS IN INFORMATION AND SYSTEMS
2005; 5 (2): 257–88
View details for Web of Science ID 000213658900005
-
On the entropy rate of pattern processes
IEEE Data Compression Conference
IEEE COMPUTER SOC. 2005: 233–242
View details for Web of Science ID 000229070000024
-
Universal discrete denoising: Known channel
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 5–28
View details for DOI 10.1109/TIT.2004.839518
View details for Web of Science ID 000226179300001
-
On the relationship between process and pattern entropy rate
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 2208–2212
View details for Web of Science ID 000234713801208
-
Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 1506–1510
View details for Web of Science ID 000234713801064
-
Approximations for the entropy rate of a hidden Markov process
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 2198–2202
View details for Web of Science ID 000234713801206
-
Multi-directional context sets with applications to universal denoising and compression
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 1270–1274
View details for Web of Science ID 000234713801015
-
Discrete universal filtering via hidden Markov modelling
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 1285–1289
View details for Web of Science ID 000234713801018
- On Causal Source Codes with Side Information IEEE Trans. Inform. Theory 2005; 51 (11): 4003-4013
- Discrete Denoising for Channels with Memory Comm. in Information and Systems 2005; 5 (2): 257-288
- The empirical distribution of rate-constrained codes IEEE Trans. Inform. Theory 2005; 51 (11): 3718-3733
- Universal Discrete Denoising: Known Channel IEEE Trans. Inform. Theory 2005; 51 (1): 5-28
-
A universal scheme for learning
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 1158–1162
View details for Web of Science ID 000234713800244
-
Asymptotic filtering and entropy rate of a hidden Markov process in the rare transitions regime
IEEE International Symposium on Information Theory and Its Applications
IEEE. 2005: 1838–1842
View details for Web of Science ID 000234713801132
-
Universally attainable error exponents for rate-distortion coding of noisy sources
IEEE TRANSACTIONS ON INFORMATION THEORY
2004; 50 (6): 1229-1246
View details for DOI 10.1109/TIT.2004.828061
View details for Web of Science ID 000221803300019
-
Universal prediction of random binary sequences in a noisy environment
ANNALS OF APPLIED PROBABILITY
2004; 14 (1): 54-89
View details for Web of Science ID 000188878000002
-
Universal minimax binary image denoising under channel uncertainty
International Conference on Image Processing (ICIP 2004)
IEEE. 2004: 997–1000
View details for Web of Science ID 000228043501072
-
Universal denoising for the finite-input-general-output channel
IEEE International Symposium on Information Theory
IEEE. 2004: 201–201
View details for Web of Science ID 000223202600201
-
The empirical distribution of rate - Constrained source codes
IEEE International Symposium on Information Theory
IEEE. 2004: 464–464
View details for Web of Science ID 000223202600464
-
New bounds on the entropy rate of hidden Markov processes
IEEE Information Theory Workshop
IEEE. 2004: 117–122
View details for Web of Science ID 000226087200022
-
Channel decoding of systematically encoded unknown redundant sources
IEEE International Symposium on Information Theory
IEEE. 2004: 165–165
View details for Web of Science ID 000223202600165
-
Efficient pruning of bi-directional context trees with applications to universal denoising and compression
IEEE Information Theory Workshop
IEEE. 2004: 94–98
View details for Web of Science ID 000226087200018
-
On the optimality of symbol by symbol filtering and denoising
IEEE International Symposium on Information Theory
IEEE. 2004: 200–200
View details for Web of Science ID 000223202600200
-
Discrete universal filtering through incremental parsing
Data Compression Conference (DCC 2004)
IEEE COMPUTER SOC. 2004: 352–361
View details for Web of Science ID 000189502800036
-
Universal minimax discrete denoising under channel uncertainty
IEEE International Symposium on Information Theory
IEEE. 2004: 199–199
View details for Web of Science ID 000223202600199
- On the Entropy Rate of Pattern Sequences HP Laboratories Technical Report 2004
- Universally Attainable Error-Exponents for Rate-Distortion Coding of Noisy Sources IEEE Trans. Inform. Theory 2004; 50 (6): 1229-1246
- Universal prediction of random binary sequences in a noisy environment Annals of Applied Probability 2004; 14 (1): 54-89
-
On competitive prediction and its relation to rate-distortion theory
IEEE TRANSACTIONS ON INFORMATION THEORY
2003; 49 (12): 3185-3194
View details for DOI 10.1109/TIT.2003.820014
View details for Web of Science ID 000187958600005
-
The minimax distortion redundancy in noisy source coding
IEEE TRANSACTIONS ON INFORMATION THEORY
2003; 49 (11): 3020-3030
View details for DOI 10.1109/TIT.2003.819336
View details for Web of Science ID 000186618500022
- Scanning and Prediction in Multi-Dimensional Data Arrays IEEE Trans. Inform. Theory 2003; IT-49 (1): 65-82
-
Scanning and prediction in multidimensional data arrays
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2003: 65–82
View details for DOI 10.1109/TIT.2002.806134
View details for Web of Science ID 000180370400006
-
A discrete universal denoiser and its application to binary images
IEEE International Conference on Image Processing
IEEE. 2003: 117–120
View details for Web of Science ID 000187010000030
-
Universally attainable error-exponents for rate-distortion coding of noisy sources
IEEE International Symposium on Information Theory
IEEE. 2003: 194–194
View details for Web of Science ID 000186112600194
-
On competitive prediction and its relation to rate-distortion theory and to channel capacity theory
IEEE International Symposium on Information Theory
IEEE. 2003: 81–81
View details for Web of Science ID 000186112600081
- On Competitive Prediction and its Relation to Rate-Distortion Theory IEEE Trans. Inform. Theory 2003; 49 (12): 3185-3194
-
On limited-delay lossy coding and filtering of individual sequences
IEEE TRANSACTIONS ON INFORMATION THEORY
2002; 48 (3): 721-733
View details for Web of Science ID 000173968200012
-
Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding
IEEE TRANSACTIONS ON INFORMATION THEORY
2002; 48 (2): 396-415
View details for Web of Science ID 000173485600004
- Universally Attainable Error-Exponents for Rate-Constrained Denoising of Noisy Sources HP Laboratories Technical Report 2002
- On Competitive Prediction and its Relation to Rate-Distortion Theory and to Channel Capacity Theory Technion – I.I.T., CC Pub., EE Pub. 2002
-
The minimax distortion redundancy in noisy source coding
IEEE International Symposium on Information Theory
IEEE. 2002: 318–318
View details for Web of Science ID 000177912700317
- On limited-delay lossy coding and filtering of individual sequences IEEE Trans. Inform. Theory 2002; IT-48 (3): 721-733
- Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding IEEE Trans. Inform. Theory 2002; IT-48 (2): 396-415
-
Universal prediction of individual binary sequences in the presence of noise
IEEE TRANSACTIONS ON INFORMATION THEORY
2001; 47 (6): 2151-2173
View details for Web of Science ID 000170717700003
-
Twofold universal prediction schemes for achieving the finite-state predictability of a noisy individual binary sequence
IEEE International Symposium on Information Theory
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2001: 1849–66
View details for Web of Science ID 000169487900013
- Twofold universal prediction schemes for achieving the finite-state predictability of a noisy individual binary sequence IEEE Trans. Inform. Theory 2001; IT-47 (5): 1849-1866
- Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding Technical Report, CCIT Pub., EE Pub. 2001
- Universal prediction of binary individual sequences in the presence of noise IEEE Trans. Inform. Theory 2001; IT-47 (6): 2151-2173
- Universal prediction of individual binary sequences in the presence of noise Technion – I.I.T., CC Pub., EE Pub. 1999
- Not all universal codes are pointwise universal unpublished manuscript available upon request and at http://www.stanford.edu/~tsachy/papers.html.