Bio


Tsachy's research focuses on Information Theory, Data Compression and Communications, Statistical Signal Processing, Machine Learning, the interplay between them, and their applications, with recent focus on applications to genomic data compression and processing. He is inventor of several patents and involved in several companies as member of the technical board. IEEE fellow, he serves on the board of governors of the information theory society as well as the editorial boards of the Transactions on Information Theory and Foundations and Trends in Communications and Information Theory. He is founding Director of the Stanford Compression Forum.

Academic Appointments


Administrative Appointments


  • Founder/Director, Stanford Compression Forum (2015 - Present)
  • Academic Affairs Committee, Stanford University (2013 - 2018)
  • Graduate Admissions Committee, Stanford University (2004 - 2012)

Honors & Awards


  • Google Research Award, Google (2016)
  • Yahoo! Faculty Research Award, Yahoo! (2015)
  • Google Research Award, Google (2013)
  • Yahoo! Faculty Research And Engagement Award, Yahoo! (2013)
  • IEEE Fellow Grade, IEEE (2012)
  • Google Research Award, Google (2011)
  • STMicroelectronics Faculty Scholar Award, Stanford University (2011)

Boards, Advisory Committees, Professional Organizations


  • Guest Editor, Proceedings of IEEE (2017 - 2017)
  • Organizer/Technical Program Committee, IEEE (ITW) (2015 - 2015)
  • Technical Program Committee, IEEE (ISIT) (2015 - 2015)
  • Technical Program Committee, IEEE (ISIT) (2014 - 2014)
  • Editorial Board, Foundations and Trends in Communications and Information Theory (2012 - Present)
  • Fellow, IEEE (2012 - Present)
  • Technical Program Committee, IEEE (ISIT) (2012 - 2012)
  • Technical Program Committee, IEEE (ISIT) (2011 - 2011)
  • Editorial Board, Transactions on Information Theory Associate Editor for Shannon Theory (2010 - Present)
  • Organizer/Technical Program Committee, IEEE (ITW) (2010 - 2010)
  • Paper Award Committee, IEEE (2009 - 2009)
  • Technical Program Committee, Data Compression Conference (DCC) (2009 - 2009)
  • Technical Program Committee, IEEE International Symposium on Information Theory (ISIT) (2009 - 2009)
  • Technical Program Committee, Control Over Communication Channels Workshop (ConCom) (2009 - 2009)
  • Technical Program Committee, DCC (Data Compression Conference) (2008 - 2008)
  • Technical Program Committee, IAPR Workshop on Cognitive Information Processing (CIP) (2008 - 2008)
  • Technical Program Committee, IEEE (ISIT) (2007 - 2007)
  • Member, The Mathematical Association of America (MAA) (2006 - Present)
  • Panelist, Allerton Conference (2006 - 2006)
  • Technical Program Committee, IEEE (ISIT) (2005 - 2005)

Professional Education


  • PhD, Technion-Israel Institute of Technology, Electrical Engineering (2001)
  • B.Sc., Technion-Israel Institute of Technology, Electrical Engineering (1997)

Patents


  • Itschak Weissman. "United States Patent US8320687 Universal Lossy Compression Methods", Nov 27, 2012
  • Itschak Weissman. "United States Patent US200404580 Denoising and Error Correction for Finite Input General Output Channel", May 10, 2012
  • Itschak Weissman. "United States Patent US7624009 Method and System for Producing Variable Length Context Models", Nov 24, 2009
  • Itschak Weissman. "United States Patent US7498961 Context Identification Using a Denoised Signal", Mar 3, 2009
  • Itschak Weissman. "United States Patent US7474793 Methods for Compression Using a Denoiser", Jan 6, 2009
  • Itschak Weissman. "United States Patent EP1766832 Discrete Universal Denoising with Error Correction Coding", Dec 10, 2008
  • Itschak Weissman. "United States Patent US7436969 Method and System for Optimizing Denoising Parameters Using Compressibility", Oct 14, 2008
  • Itschak Weissman. "United States Patent US7433427 Enhanced Denoising System Utilizing Incremental Parsing", Oct 7, 2008
  • Itschak Weissman. "United States Patent US7420487 Denoising Video", Sep 2, 2008
  • Itschak Weissman. "United States Patent US7271749 Context-based Denoiser That Simultaneously Updates Probabilities for Multiple Contexts", Sep 18, 2007
  • Itschak Weissman. "United States Patent US7269781 Discrete Universal Denoising with Reliability Information", Sep 11, 2007
  • Itschak Weissman. "United States Patent US7123172 Method and System for Determining an Optimal or Near Optimal Set of Contexts by Constructing a Multi-directional Context Tree", Oct 17, 2006
  • Itschak Weissman. "United States Patent US7047472 Method For Correcting Noise Errors in a Digital Signal", May 16, 2006

Stanford Advisees


All Publications


  • Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing. Scientific reports Lau, B., Chandak, S., Roy, S., Tatwawadi, K., Wootters, M., Weissman, T., Ji, H. P. 2023; 13 (1): 8514

    Abstract

    The storage of data in DNA typically involves encoding and synthesizing data into short oligonucleotides, followed by reading with a sequencing instrument. Major challenges include the molecular consumption of synthesized DNA, basecalling errors, and limitations with scaling up read operations for individual data elements. Addressing these challenges, we describe a DNA storage system called MDRAM (Magnetic DNA-based Random Access Memory) that enables repetitive and efficient readouts of targeted files with nanopore-based sequencing. By conjugating synthesized DNA to magnetic agarose beads, we enabled repeated data readouts while preserving the original DNA analyte and maintaining data readout quality. MDRAM utilizes an efficient convolutional coding scheme that leverages soft information in raw nanopore sequencing signals to achieve information reading costs comparable to Illumina sequencing despite higher error rates. Finally, we demonstrate a proof-of-concept DNA-based proto-filesystem that enables an exponentially-scalable data address space using only small numbers of targeting primers for assembly and readout.

    View details for DOI 10.1038/s41598-023-29575-z

    View details for PubMedID 37231057

  • Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach. Scientific reports Meng, Q., Chandak, S., Zhu, Y., Weissman, T. 2023; 13 (1): 2082

    Abstract

    The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .

    View details for DOI 10.1038/s41598-023-29267-8

    View details for PubMedID 36747011

  • Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS Tandon, P., Chandak, S., Pataranutaporn, P., Liu, Y., Mapuranga, A. M., Maes, P., Weissman, T., Sra, M. 2023; 41 (1): 107-118
  • KmerKeys: a web resource for searching indexed genome assemblies and variants. Nucleic acids research Pavlichin, D. S., Lee, H., Greer, S. U., Grimes, S. M., Weissman, T., Ji, H. P. 2022

    Abstract

    K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.

    View details for DOI 10.1093/nar/gkac266

    View details for PubMedID 35474383

  • The Human Pangenome Project: a global resource to map genomic diversity. Nature Wang, T., Antonacci-Fulton, L., Howe, K., Lawson, H. A., Lucas, J. K., Phillippy, A. M., Popejoy, A. B., Asri, M., Carson, C., Chaisson, M. J., Chang, X., Cook-Deegan, R., Felsenfeld, A. L., Fulton, R. S., Garrison, E. P., Garrison, N. A., Graves-Lindsay, T. A., Ji, H., Kenny, E. E., Koenig, B. A., Li, D., Marschall, T., McMichael, J. F., Novak, A. M., Purushotham, D., Schneider, V. A., Schultz, B. I., Smith, M. W., Sofia, H. J., Weissman, T., Flicek, P., Li, H., Miga, K. H., Paten, B., Jarvis, E. D., Hall, I. M., Eichler, E. E., Haussler, D., Human Pangenome Reference Consortium 2022; 604 (7906): 437-446

    Abstract

    The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goalof constructing the highest-possible quality human pangenome reference. Our goal is toimprove data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.

    View details for DOI 10.1038/s41586-022-04601-8

    View details for PubMedID 35444317

  • Classification and clustering of RNA crosslink-ligation data reveal complex structures and homodimers. Genome research Zhang, M., Hwang, I. T., Li, K., Bai, J., Chen, J., Weisman, T., Zou, J. Y., Lu, Z. 2022

    Abstract

    The recent development and application of methods based on the general principle of crosslinking and proximity ligation (crosslink-ligation) are revolutionizing RNA structure studies in living cells. However, extracting structure information from such data presents unique challenges. Here we introduce a set of computational tools for the systematic analysis of data from a wide variety of crosslink-ligation methods, specifically focusing on read mapping, alignment classification and clustering. We design a new strategy to map short reads with irregular gaps at high sensitivity and specificity. Analysis of previously published data reveals distinct properties and bias caused by the crosslinking reactions. We perform rigorous and exhaustive classification of alignments and discover 8 types of arrangements that provide distinct information on RNA structures and interactions. To deconvolve the dense and intertwined gapped alignments, we develop a net-work/graph-based tool CRSSANT (Crosslinked RNA Secondary Structure Analysis using Network Techniques), which enables clustering of gapped alignments and discovery of new alternative and dynamic conformations. We discover that multiple crosslinking and ligation events can occur on the same RNA, generating multi-segment alignments to report complex high level RNA structures and multi-RNA interactions. We find that alignments with overlapped segments are produced from potential homodimers and develop a new method for their de novo identification. Analysis of overlapping alignments revealed potential new homodimers in cellular noncoding RNAs and RNA virus genomes in the Picornaviridae family. Together, this suite of computational tools enables rapid and efficient analysis of RNA structure and interaction data in living cells.

    View details for DOI 10.1101/gr.275979.121

    View details for PubMedID 35332099

  • An Information-Theoretic Justification for Model Pruning Isik, B., Weissman, T., No, A., Camps-Valls, G., Ruiz, F. J., Valera JMLR-JOURNAL MACHINE LEARNING RESEARCH. 2022
  • Geometric Lower Bounds for Distributed Parameter Estimation Under Communication Constraints IEEE TRANSACTIONS ON INFORMATION THEORY Han, Y., Ozgur, A., Weissman, T. 2021; 67 (12): 8248-8263
  • Optimal Communication Rates and Combinatorial Properties for Common Randomness Generation IEEE TRANSACTIONS ON INFORMATION THEORY Han, Y., Tatwawadi, K., Kurri, G. R., Zhou, Z., Prabhakaran, V. M., Weissman, T. 2021; 67 (12): 7723-7739
  • Reducing latency and bandwidth for video streaming using keypoint extraction and digital puppetry Prabhakar, R., Chandak, S., Chiu, C., Liang, R., Nguyen, H., Tatwawadi, K., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE COMPUTER SOC. 2021: 360
  • MEOW: A Space-Efficient Nonparametric Bid Shading Algorithm Zhang, W., Kitts, B., Han, Y., Zhou, Z., Mao, T., He, H., Pan, S., Flores, A., Gultekin, S., Weissman, T., ASSOC COMP MACHINERY ASSOC COMPUTING MACHINERY. 2021: 3928-3936
  • Optimal Communication Rates and Combinatorial Properties for Common Randomness Generation Han, Y., Tatwawadi, K., Kurri, G. R., Zhou, Z., Prabhakaran, V. M., Weissman, T., IEEE IEEE. 2021: 1931-1936
  • OPTIMAL RATES OF ENTROPY ESTIMATION OVER LIPSCHITZ BALLS ANNALS OF STATISTICS Han, Y., Jiao, J., Weissman, T., Wu, Y. 2020; 48 (6): 3228–50

    View details for DOI 10.1214/19-AOS1927

    View details for Web of Science ID 000598369200006

  • Impact of lossy compression of nanopore raw signal data on basecalling and consensus accuracy Bioinformatics (Oxford, England) Chandak, S. n., Tatwawadi, T. n., Sridhar, S. n., Weissman, T. n. 2020

    Abstract

    Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications.We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (⁠≲0.2% reduction) and consensus accuracy (⁠≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications.The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation.

    View details for DOI 10.1093/bioinformatics/btaa1017

    View details for PubMedID 33325499

  • OVERCOMING HIGH NANOPORE BASECALLER ERROR RATES FOR DNA STORAGE VIA BASECALLER-DECODER INTEGRATION AND CONVOLUTIONAL CODES Chandak, S., Neu, J., Tatwawadi, K., Mardia, J., Lau, B., Kubit, M., Hulett, R., Griffin, P., Wootters, M., Weissman, T., Ji, H., IEEE IEEE. 2020: 8822–26
  • LFZip: Lossy compression of multivariate floating-point time series data via improved prediction Chandak, S., Tatwawadi, K., Wen, C., Wang, L., Ojea, J., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE COMPUTER SOC. 2020: 342–51
  • Denoising of Aligned Genomic Data. Scientific reports Fischer-Hwang, I., Ochoa, I., Weissman, T., Hernaez, M. 2019; 9 (1): 15067

    Abstract

    Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE .

    View details for DOI 10.1038/s41598-019-51418-z

    View details for PubMedID 31636330

  • Estimating the Fundamental Limits Is Easier Than Achieving the Fundamental Limits IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Han, Y., Fischer-Hwang, I., Weissman, T. 2019; 65 (10): 6704–15
  • SPRING: a next-generation compressor for FASTQ data BIOINFORMATICS Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M., Weissman, T. 2019; 35 (15): 2674–76
  • Humans are still the best lossy image compressors Bhown, A., Mukherjee, S., Yang, S., Chandak, S., Fischer-Hwang, I., Tatwawadi, K., Weissman, T., IEEE IEEE. 2019: 558
  • Improved read/write cost tradeoff in DNA-based data storage using LDPC codes Chandak, S., Tatwawadi, K., Lau, B., Mardia, J., Kubit, M., Neu, J., Griffin, P., Wootters, M., Weissman, T., Ji, H., IEEE IEEE. 2019: 147–56
  • Genomic Data Compression ANNUAL REVIEW OF BIOMEDICAL DATA SCIENCE, VOL 2, 2019 Hernaez, M., Pavlichin, D., Weissman, T., Ochoa, I., Altman, R. B., Levitt, M. 2019; 2: 19–37
  • Approximate Profile Maximum Likelihood JOURNAL OF MACHINE LEARNING RESEARCH Pavlichin, D. S., Jiao, J., Weissman, T. 2019; 20
  • SPRING: A next-generation compressor for FASTQ data. Bioinformatics (Oxford, England) Chandak, S., Tatwawadi, K., Ochoa, I., Hernaez, M., Weissman, T. 2018

    Abstract

    Motivation: High-Throughput Sequencing (HTS) technologies produce huge amounts of data in the form of short genomic reads, associated quality values, and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.Results: In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairingpreserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25x whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6x smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.Availability: SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING.Supplementary Information: Supplementary data are available at Bioinformatics online.

    View details for PubMedID 30535063

  • Mutual Information, Relative Entropy and Estimation Error in Semi-Martingale Channels IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Venkat, K., Weissman, T. 2018; 64 (10): 6662–71
  • Minimax Estimation of the L-1 Distance IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Han, Y., Weissman, T. 2018; 64 (10): 6672–6706
  • Generalizations of maximal inequalities to arbitrary selection rules STATISTICS & PROBABILITY LETTERS Jiao, J., Han, Y., Weissman, T. 2018; 137: 19–25
  • Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis BIOINFORMATICS Chandak, S., Tatwawadi, K., Weissman, T. 2018; 34 (4): 558–67

    Abstract

    New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data.We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference.schandak@stanford.edu.Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.

    View details for PubMedID 29444237

    View details for PubMedCentralID PMC5860611

  • Minimax redundancy for Markov chains with large state space Tatwawadi, K., Jiao, J., Weissman, T., IEEE IEEE. 2018: 216–20
  • Entropy Rate Estimation for Markov Chains with Large State Space Han, Y., Jiao, J., Lee, C., Weissman, T., Wu, Y., Yu, T., Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
  • Distributed Statistical Estimation of High-Dimensional and Nonparametric Distributions Han, Y., Mukherjee, P., Ozgur, A., Weissman, T., IEEE IEEE. 2018: 506–10
  • On Universal Compression with Constant Random Access Tatwawadi, K., Bidokhti, S., Weissman, T., IEEE IEEE. 2018: 891–95
  • Maximum Likelihood Estimation of Functionals of Discrete Distributions IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Venkat, K., Han, Y., Weissman, T. 2017; 63 (10): 6774–98
  • DUDE-Seq: Fast, flexible, and robust denoising for targeted amplicon sequencing PLOS ONE Lee, B., Moon, T., Yoon, S., Weissman, T. 2017; 12 (7): e0181463

    Abstract

    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.

    View details for PubMedID 28749987

  • Relations Between Information and Estimation in Discrete-Time Levy Channels Jiao, J., Venkat, K., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2017: 3579–94
  • Effect of lossy compression of quality scores on variant calling BRIEFINGS IN BIOINFORMATICS Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T., Ashley, E. 2017; 18 (2): 183-194

    Abstract

    Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

    View details for DOI 10.1093/bib/bbw011

    View details for Web of Science ID 000397060200002

  • Effect of lossy compression of quality scores on variant calling. Briefings in bioinformatics Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T., Ashley, E. 2017; 18 (2): 183-194

    Abstract

    Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

    View details for DOI 10.1093/bib/bbw011

    View details for PubMedID 26966283

  • When is Noisy State Information at the Encoder as Useless as No Information or as Good as Noise-Free State? IEEE TRANSACTIONS ON INFORMATION THEORY Xu, R., Chen, J., Weissman, T., Zhang, J. 2017; 63 (2): 960-974
  • Principles and Applications of Science of Information PROCEEDINGS OF THE IEEE Courtade, T., Grama, A., Mahoney, M. W., Weissman, T. 2017; 105 (2): 183–88
  • GeneComp, a new reference-based compressor for SAM files Long, R., Hernaez, M., Ochoa, I., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE COMPUTER SOC. 2017: 330–39

    Abstract

    The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.

    View details for PubMedID 29046896

  • Compressing Tabular Data via Pairwise Dependencies Pavlichin, D. S., Ingber, A., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE COMPUTER SOC. 2017: 455

    View details for PubMedID 29046897

  • Dependence Measures Bounding the Exploration Bias for General Measurements Jiao, J., Han, Y., Weissman, T., IEEE IEEE. 2017: 1475–79
  • Rateless Lossy Compression via the Extremes IEEE TRANSACTIONS ON INFORMATION THEORY No, A., Weissman, T. 2016; 62 (10): 5484-5495

    Abstract

    We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log β n) per source symbol for β > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.

    View details for DOI 10.1109/TIT.2016.2598148

    View details for Web of Science ID 000384304600013

    View details for PubMedCentralID PMC5786423

  • Rateless Lossy Compression via the Extremes. IEEE transactions on information theory No, A., Weissman, T. 2016; 62 (10): 5484-5495

    Abstract

    We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log β n) per source symbol for β > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.

    View details for DOI 10.1109/tit.2016.2598148

    View details for PubMedID 29375154

    View details for PubMedCentralID PMC5786423

  • GTRAC: fast retrieval from compressed collections of genomic variants BIOINFORMATICS Tatwawadi, K., Hernaez, M., Ochoa, I., Weissman, T. 2016; 32 (17): 479-486

    Abstract

    The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btw437

    View details for Web of Science ID 000384666800012

    View details for PubMedCentralID PMC5013914

  • GTRAC: fast retrieval from compressed collections of genomic variants. Bioinformatics Tatwawadi, K., Hernaez, M., Ochoa, I., Weissman, T. 2016; 32 (17): i479-i486

    Abstract

    The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether.We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures.The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC CONTACT: : kedart@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btw437

    View details for PubMedID 27587665

    View details for PubMedCentralID PMC5013914

  • Information, Estimation, and Lookahead in the Gaussian Channel IEEE TRANSACTIONS ON SIGNAL PROCESSING Venkat, K., Weissman, T., Carmon, Y., Shamai, S. 2016; 64 (14): 3605-3618
  • Secure Source Coding With a Public Helper IEEE TRANSACTIONS ON INFORMATION THEORY Kittichokechai, K., Chia, Y., Oechtering, T. J., Skoglund, M., Weissman, T. 2016; 62 (7): 3930-3949
  • Strong Successive Refinability and Rate-Distortion-Complexity Tradeoff IEEE TRANSACTIONS ON INFORMATION THEORY No, A., Ingber, A., Weissman, T. 2016; 62 (6): 3618-3635
  • Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes IEEE TRANSACTIONS ON INFORMATION THEORY Steiner, F., Dempfle, S., Ingber, A., Weissman, T. 2016; 62 (5): 2737-2747

    Abstract

    We study the problem of compression for the purpose of similarity identification, where similarity is measured by the mean square Euclidean distance between vectors. While the asymptotical fundamental limits of the problem - the minimal compression rate and the error exponent - were found in a previous work, in this paper we focus on the nonasymptotic domain and on practical, implementable schemes. We first present a finite blocklength achievability bound based on shape-gain quantization: The gain (amplitude) of the vector is compressed via scalar quantization and the shape (the projection on the unit sphere) is quantized using a spherical code. The results are numerically evaluated and they converge to the asymptotic values as predicted by the error exponent. We then give a nonasymptotic lower bound on the performance of any compression scheme, and compare to the upper (achievability) bound. For a practical implementation of such a scheme, we use wrapped spherical codes, studied by Hamkins and Zeger, and use the Leech lattice as an example for an underlying lattice. As a side result, we obtain a bound on the covering angle of any wrapped spherical code, as a function of the covering radius of the underlying lattice.

    View details for DOI 10.1109/TIT.2016.2535172

    View details for Web of Science ID 000375005500026

    View details for PubMedCentralID PMC5796586

  • Compression for Quadratic Similarity Queries: Finite Blocklength and Practical Schemes. IEEE transactions on information theory Steiner, F., Dempfle, S., Ingber, A., Weissman, T. 2016; 62 (5): 2737-2747

    Abstract

    We study the problem of compression for the purpose of similarity identification, where similarity is measured by the mean square Euclidean distance between vectors. While the asymptotical fundamental limits of the problem - the minimal compression rate and the error exponent - were found in a previous work, in this paper we focus on the nonasymptotic domain and on practical, implementable schemes. We first present a finite blocklength achievability bound based on shape-gain quantization: The gain (amplitude) of the vector is compressed via scalar quantization and the shape (the projection on the unit sphere) is quantized using a spherical code. The results are numerically evaluated and they converge to the asymptotic values as predicted by the error exponent. We then give a nonasymptotic lower bound on the performance of any compression scheme, and compare to the upper (achievability) bound. For a practical implementation of such a scheme, we use wrapped spherical codes, studied by Hamkins and Zeger, and use the Leech lattice as an example for an underlying lattice. As a side result, we obtain a bound on the covering angle of any wrapped spherical code, as a function of the covering radius of the underlying lattice.

    View details for DOI 10.1109/TIT.2016.2535172

    View details for PubMedID 29398721

    View details for PubMedCentralID PMC5796586

  • Comment on: 'ERGC: an efficient referential genome compression algorithm' BIOINFORMATICS Deorowicz, S., Grabowski, S., Ochoa, I., Hernaez, M., Weissman, T. 2016; 32 (7): 1115–17

    Abstract

    Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.sebastian.deorowicz@polsl.pl, iochoa@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for PubMedID 26615213

    View details for PubMedCentralID PMC4907388

  • smallWig: parallel compression of RNA-seq WIG files BIOINFORMATICS Wang, Z., Weissman, T., Milenkovic, O. 2016; 32 (2): 173-180

    Abstract

    We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times.The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/.zhiyingw@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btv561

    View details for Web of Science ID 000368360100003

    View details for PubMedCentralID PMC6078172

  • smallWig: parallel compression of RNA-seq WIG files. Bioinformatics (Oxford, England) Wang, Z., Weissman, T., Milenkovic, O. 2016; 32 (2): 173-80

    Abstract

    We developed a new lossless compression method for WIG data, named smallWig, offering the best known compression rates for RNA-seq data and featuring random access functionalities that enable visualization, summary statistics analysis and fast queries from the compressed files. Our approach results in order of magnitude improvements compared with bigWig and ensures compression rates only a fraction of those produced by cWig. The key features of the smallWig algorithm are statistical data analysis and a combination of source coding methods that ensure high flexibility and make the algorithm suitable for different applications. Furthermore, for general-purpose file compression, the compression rate of smallWig approaches the empirical entropy of the tested WIG data. For compression with random query features, smallWig uses a simple block-based compression scheme that introduces only a minor overhead in the compression rate. For archival or storage space-sensitive applications, the method relies on context mixing techniques that lead to further improvements of the compression rate. Implementations of smallWig can be executed in parallel on different sets of chromosomes using multiple processors, thereby enabling desirable scaling for future transcriptome Big Data platforms.The development of next-generation sequencing technologies has led to a dramatic decrease in the cost of DNA/RNA sequencing and expression profiling. RNA-seq has emerged as an important and inexpensive technology that provides information about whole transcriptomes of various species and organisms, as well as different organs and cellular communities. The vast volume of data generated by RNA-seq experiments has significantly increased data storage costs and communication bandwidth requirements. Current compression tools for RNA-seq data such as bigWig and cWig either use general-purpose compressors (gzip) or suboptimal compression schemes that leave significant room for improvement. To substantiate this claim, we performed a statistical analysis of expression data in different transform domains and developed accompanying entropy coding methods that bridge the gap between theoretical and practical WIG file compression rates.We tested different variants of the smallWig compression algorithm on a number of integer-and real- (floating point) valued RNA-seq WIG files generated by the ENCODE project. The results reveal that, on average, smallWig offers 18-fold compression rate improvements, up to 2.5-fold compression time improvements, and 1.5-fold decompression time improvements when compared with bigWig. On the tested files, the memory usage of the algorithm never exceeded 90 KB. When more elaborate context mixing compressors were used within smallWig, the obtained compression rates were as much as 23 times better than those of bigWig. For smallWig used in the random query mode, which also supports retrieval of the summary statistics, an overhead in the compression rate of roughly 3-17% was introduced depending on the chosen system parameters. An increase in encoding and decoding time of 30% and 55% represents an additional performance loss caused by enabling random data access. We also implemented smallWig using multi-processor programming. This parallelization feature decreases the encoding delay 2-3.4 times compared with that of a single-processor implementation, with the number of processors used ranging from 2 to 8; in the same parameter regime, the decoding delay decreased 2-5.2 times.The smallWig software can be downloaded from: http://stanford.edu/~zhiyingw/smallWig/smallwig.html, http://publish.illinois.edu/milenkovic/, http://web.stanford.edu/~tsachy/.zhiyingw@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btv561

    View details for PubMedID 26424856

    View details for PubMedCentralID PMC6078172

  • Denoising of Quality Scores for Boosted Inference and Reduced Storage Ochoa, I., Hernaez, M., Goldfeder, R., Weissman, T., Ashley, E., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE. 2016: 251–60

    Abstract

    Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.

    View details for PubMedID 29098178

  • Mutual Information, Relative Entropy and Estimation Error in Semi-Martingale Channels Jiao, J., Venkat, K., Weissman, T., IEEE IEEE. 2016: 2794–98
  • Chained Kullback-Leibler Divergences Pavlichin, D. S., Weissman, T., IEEE IEEE. 2016: 580–84

    Abstract

    We define and characterize the "chained" Kullback-Leibler divergence min w D(p‖w) + D(w‖q) minimized over all intermediate distributions w and the analogous k-fold chained K-L divergence min D(p‖wk-1) + … + D(w2‖w1) + D(w1‖q) minimized over the entire path (w1,…,wk-1). This quantity arises in a large deviations analysis of a Markov chain on the set of types - the Wright-Fisher model of neutral genetic drift: a population with allele distribution q produces offspring with allele distribution w, which then produce offspring with allele distribution p, and so on. The chained divergences enjoy some of the same properties as the K-L divergence (like joint convexity in the arguments) and appear in k-step versions of some of the same settings as the K-L divergence (like information projections and a conditional limit theorem). We further characterize the optimal k-step "path" of distributions appearing in the definition and apply our findings in a large deviations analysis of the Wright-Fisher process. We make a connection to information geometry via the previously studied continuum limit, where the number of steps tends to infinity, and the limiting path is a geodesic in the Fisher information metric. Finally, we offer a thermodynamic interpretation of the chained divergence (as the rate of operation of an appropriately defined Maxwell's demon) and we state some natural extensions and applications (a k-step mutual information and k-step maximum likelihood inference). We release code for computing the objects we study.

    View details for PubMedID 29130024

  • A cluster-based approach to compression of Quality Scores Hernaez, M., Ochoa, I., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE. 2016: 261–70

    Abstract

    Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

    View details for PubMedID 29057318

  • Minimax Estimation of the L-I Distance Jiao, J., Han, Y., Weissman, T., IEEE IEEE. 2016: 750–54
  • When is Noisy State Information at the Eneoder as Useless as No Information or as Good as Noise-Free State? Xu, R., Chen, J., Weissman, T., Zhang, J., IEEE IEEE. 2016: 890–94
  • Minimax Rate-optimal Estimation of KL Divergence between Discrete Distributions Han, Y., Jiao, J., Weissman, T., IEEE IEEE. 2016: 256–60

    Abstract

    We refine the general methodology in [1] for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions with support size S comparable with the number of observations n. Specifically, we determine the "smooth" and "non-smooth" regimes based on the confidence set and the smoothness of the functional. In the "non-smooth" regime, we apply an unbiased estimator for a "suitable" polynomial approximation of the functional. In the "smooth" regime, we construct a bias corrected version of the Maximum Likelihood Estimator (MLE) based on Taylor expansion. We apply the general methodology to the problem of estimating the KL divergence between two discrete distributions from empirical data. We construct a minimax rate-optimal estimator which is adaptive in the sense that it does not require the knowledge of the support size nor the upper bound on the likelihood ratio. Moreover, the performance of the optimal estimator with n samples is essentially that of the MLE with n ln n samples, i.e., the effective sample size enlargement phenomenon holds.

    View details for PubMedID 29457152

  • Beyond Maximum Likelihood: Boosting the Chow-Liu Algorithm for Large Alphabets Jiao, J., Han, Y., Weissman, T., Matthews, M. B. IEEE COMPUTER SOC. 2016: 321–25
  • Distortion Rate Function of Sub-Nyquist Sampled Gaussian Sources IEEE TRANSACTIONS ON INFORMATION THEORY Kipnis, A., Goldsmith, A. J., Eldar, Y. C., Weissman, T. 2016; 62 (1): 401-429
  • Minimax Estimation of Discrete Distributions Under l(1) Loss IEEE TRANSACTIONS ON INFORMATION THEORY Han, Y., Jiao, J., Weissman, T. 2015; 61 (11): 6343-6354
  • Justification of Logarithmic Loss via the Benefit of Side Information IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Courtade, T. A., Venkat, K., Weissman, T. 2015; 61 (10): 5357-5365
  • QVZ: lossy compression of quality values. Bioinformatics Malysa, G., Hernaez, M., Ochoa, I., Rao, M., Ganesan, K., Weissman, T. 2015; 31 (19): 3122-3129

    Abstract

    Recent advancements in sequencing technology have led to a drastic reduction in the cost of sequencing a genome. This has generated an unprecedented amount of genomic data that must be stored, processed and transmitted. To facilitate this effort, we propose a new lossy compressor for the quality values presented in genomic data files (e.g. FASTQ and SAM files), which comprise roughly half of the storage space (in the uncompressed domain). Lossy compression allows for compression of data beyond its lossless limit.The proposed algorithm QVZ exhibits better rate-distortion performance than the previously proposed algorithms, for several distortion metrics and for the lossless case. Moreover, it allows the user to define any quasi-convex distortion function to be minimized, a feature not supported by the previous algorithms. Finally, we show that QVZ-compressed data exhibit better performance in the genotyping than data compressed with previously proposed algorithms, in the sense that for a similar rate, a genotyping closer to that achieved with the original quality values is obtained.QVZ is written in C and can be downloaded from https://github.com/mikelhernaez/qvz.mhernaez@stanford.edu or gmalysa@stanford.edu or iochoa@stanford.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btv330

    View details for PubMedID 26026138

  • Network Compression: Worst Case Analysis IEEE TRANSACTIONS ON INFORMATION THEORY Asnani, H., Shomorony, I., Avestimehr, A. S., Weissman, T. 2015; 61 (7): 3980-3995

    Abstract

    We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: 1) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples and 2) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance.

    View details for DOI 10.1109/TIT.2015.2434829

    View details for Web of Science ID 000356301600020

    View details for PubMedCentralID PMC5786424

  • Network Compression: Worst Case Analysis. IEEE transactions on information theory Asnani, H., Shomorony, I., Avestimehr, A. S., Weissman, T. 2015; 61 (7): 3980-3995

    Abstract

    We study the problem of communicating a distributed correlated memoryless source over a memoryless network, from source nodes to destination nodes, under quadratic distortion constraints. We establish the following two complementary results: 1) for an arbitrary memoryless network, among all distributed memoryless sources of a given correlation, Gaussian sources are least compressible, that is, they admit the smallest set of achievable distortion tuples and 2) for any memoryless source to be communicated over a memoryless additive-noise network, among all noise processes of a given correlation, Gaussian noise admits the smallest achievable set of distortion tuples. We establish these results constructively by showing how schemes for the corresponding Gaussian problems can be applied to achieve similar performance for (source or noise) distributions that are not necessarily Gaussian but have the same covariance.

    View details for DOI 10.1109/tit.2015.2434829

    View details for PubMedID 29375153

    View details for PubMedCentralID PMC5786424

  • Compression for Quadratic Similarity Queries IEEE TRANSACTIONS ON INFORMATION THEORY Ingber, A., Courtade, T., Weissman, T. 2015; 61 (5): 2729-2747

    Abstract

    The problem of performing similarity queries on compressed data is considered. We focus on the quadratic similarity measure, and study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on the compressed data. For a Gaussian source, we show that the queries can be answered reliably if and only if the compression rate exceeds a given threshold-the identification rate- which we explicitly characterize. Moreover, when compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively. For a general source, we prove that, as with classical compression, the Gaussian source requires the largest compression rate among sources with a given variance. Moreover, a robust scheme is described that attains this maximal rate for any source distribution.

    View details for DOI 10.1109/TIT.2015.2402972

    View details for Web of Science ID 000353515700032

    View details for PubMedCentralID PMC5786438

  • Compression for Quadratic Similarity Queries. IEEE transactions on information theory Ingber, A., Courtade, T., Weissman, T. 2015; 61 (5): 2729-2747

    Abstract

    The problem of performing similarity queries on compressed data is considered. We focus on the quadratic similarity measure, and study the fundamental tradeoff between compression rate, sequence length, and reliability of queries performed on the compressed data. For a Gaussian source, we show that the queries can be answered reliably if and only if the compression rate exceeds a given threshold-the identification rate- which we explicitly characterize. Moreover, when compression is performed at a rate greater than the identification rate, responses to queries on the compressed data can be made exponentially reliable. We give a complete characterization of this exponent, which is analogous to the error and excess-distortion exponents in channel and source coding, respectively. For a general source, we prove that, as with classical compression, the Gaussian source requires the largest compression rate among sources with a given variance. Moreover, a robust scheme is described that attains this maximal rate for any source distribution.

    View details for DOI 10.1109/tit.2015.2402972

    View details for PubMedID 29375151

    View details for PubMedCentralID PMC5786438

  • Minimax Estimation of Functionals of Discrete Distributions. IEEE transactions on information theory Jiao, J., Venkat, K., Han, Y., Weissman, T. 2015; 61 (5): 2835-2885

    Abstract

    We propose a general methodology for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the support size S is unknown and may be comparable with or even much larger than the number of observations n. We treat the respective regions where the functional is nonsmooth and smooth separately. In the nonsmooth regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the smooth regime, we apply a bias-corrected version of the maximum likelihood estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing the performance of the resulting schemes for estimating two important information measures: 1) the entropy [Formula: see text] and 2) [Formula: see text], α > 0. We obtain the minimax L2 rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity n ≍ S/ln S for entropy estimation. We also demonstrate that the sample complexity for estimating Fα (P), 0 < α < 1, is n ≍ S1/α /ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show the minimax L2 rate for estimating Fα (P) is (n ln n)-2(α-1) for infinite support size, while the maximum L2 rate for the MLE is n-2(α-1). For all the above cases, the behavior of the minimax rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term "effective sample size enlargement." We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator yielded by our framework leads to significant performance boosts over the Chow-Liu algorithm in learning graphical models. The wide use of information measure estimation suggests that the insights and estimators obtained in this paper could be broadly applicable.

    View details for DOI 10.1109/tit.2015.2412945

    View details for PubMedID 29375152

    View details for PubMedCentralID PMC5786426

  • Minimax Estimation of Functionals of Discrete Distributions IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Venkat, K., Han, Y., Weissman, T. 2015; 61 (5): 2835-2885

    Abstract

    We propose a general methodology for the construction and analysis of essentially minimax estimators for a wide class of functionals of finite dimensional parameters, and elaborate on the case of discrete distributions, where the support size S is unknown and may be comparable with or even much larger than the number of observations n. We treat the respective regions where the functional is nonsmooth and smooth separately. In the nonsmooth regime, we apply an unbiased estimator for the best polynomial approximation of the functional whereas, in the smooth regime, we apply a bias-corrected version of the maximum likelihood estimator (MLE). We illustrate the merit of this approach by thoroughly analyzing the performance of the resulting schemes for estimating two important information measures: 1) the entropy [Formula: see text] and 2) [Formula: see text], α > 0. We obtain the minimax L2 rates for estimating these functionals. In particular, we demonstrate that our estimator achieves the optimal sample complexity n ≍ S/ln S for entropy estimation. We also demonstrate that the sample complexity for estimating Fα (P), 0 < α < 1, is n ≍ S1/α /ln S, which can be achieved by our estimator but not the MLE. For 1 < α < 3/2, we show the minimax L2 rate for estimating Fα (P) is (n ln n)-2(α-1) for infinite support size, while the maximum L2 rate for the MLE is n-2(α-1). For all the above cases, the behavior of the minimax rate-optimal estimators with n samples is essentially that of the MLE (plug-in rule) with n ln n samples, which we term "effective sample size enlargement." We highlight the practical advantages of our schemes for the estimation of entropy and mutual information. We compare our performance with various existing approaches, and demonstrate that our approach reduces running time and boosts the accuracy. Moreover, we show that the minimax rate-optimal mutual information estimator yielded by our framework leads to significant performance boosts over the Chow-Liu algorithm in learning graphical models. The wide use of information measure estimation suggests that the insights and estimators obtained in this paper could be broadly applicable.

    View details for DOI 10.1109/TIT.2015.2412945

    View details for Web of Science ID 000353515700038

    View details for PubMedCentralID PMC5786426

  • Comparison of the Achievable Rates in OFDM and Single Carrier Modulation with IID Inputs IEEE TRANSACTIONS ON INFORMATION THEORY Carmon, Y., Shamai, S., Weissman, T. 2015; 61 (4): 1795-1818
  • iDoComp: a compression scheme for assembled genomes BIOINFORMATICS Ochoa, I., Hernaez, M., Weissman, T. 2015; 31 (5): 626-633

    Abstract

    With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references.We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms.iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).

    View details for DOI 10.1093/bioinformatics/btu698

    View details for Web of Science ID 000352268500002

    View details for PubMedID 25344501

  • Macrophages eat cancer cells using their own calreticulin as a guide: Roles of TLR and Btk. Proceedings of the National Academy of Sciences of the United States of America Feng, M., Chen, J. Y., Weissman-Tsukamoto, R., Volkmer, J., Ho, P. Y., McKenna, K. M., Cheshier, S., Zhang, M., Guo, N., Gip, P., Mitra, S. S., Weissman, I. L. 2015; 112 (7): 2145-2150

    Abstract

    Macrophage-mediated programmed cell removal (PrCR) is an important mechanism of eliminating diseased and damaged cells before programmed cell death. The induction of PrCR by eat-me signals on tumor cells is countered by don't-eat-me signals such as CD47, which binds macrophage signal-regulatory protein α to inhibit phagocytosis. Blockade of CD47 on tumor cells leads to phagocytosis by macrophages. Here we demonstrate that the activation of Toll-like receptor (TLR) signaling pathways in macrophages synergizes with blocking CD47 on tumor cells to enhance PrCR. Bruton's tyrosine kinase (Btk) mediates TLR signaling in macrophages. Calreticulin, previously shown to be an eat-me signal on cancer cells, is activated in macrophages for secretion and cell-surface exposure by TLR and Btk to target cancer cells for phagocytosis, even if the cancer cells themselves do not express calreticulin.

    View details for DOI 10.1073/pnas.1424907112

    View details for PubMedID 25646432

  • Universality of Logarithmic Loss in Lossy Compression No, A., Weissman, T., IEEE IEEE. 2015: 2166–70
  • Compression for Similarity Identification: Computing the Error Exponent Ingber, A., Weissman, T., Bilgin, A., Marcellin, M. W., SerraSagrista, J., Storer, J. A. IEEE. 2015: 413–22

    Abstract

    We consider the problem of compressing discrete memoryless data sequences for the purpose of similarity identification, first studied by Ahlswede et al. (1997). In this setting, a source sequence is compressed, where the goal is to be able to identify whether the original source sequence is similar to another given sequence (called the query sequence). There is no requirement that the source will be reproducible from the compressed version. In the case where no false negatives are allowed, a compression scheme is said to be reliable if the probability of error (false positive) vanishes as the sequence length grows. The minimal compression rate in this sense, which is the parallel of the classical rate distortion function, is called the identification rate. The rate at which the error probability vanishes is measured by its exponent, called the identification exponent (which is the analog of the classical excess distortion exponent). While an information-theoretic expression for the identification exponent was found in past work, it is uncomputable due to a dependency on an auxiliary random variable with unbounded cardinality. The main result of this paper is a cardinality bound on the auxiliary random variable in the identification exponent, thereby making the quantity computable (solving the problem that was left open by Ahlswede et al.). The new proof technique relies on the fact that the Lagrangian in the optimization problem (in the expression for the exponent) can be decomposed by coordinate (of the auxiliary random variable). Then a standard Carathéodory - style argument completes the proof.

    View details for PubMedID 29046895

  • Maximum Likelihood Estimation of Information Measures Jiao, J., Venkat, K., Han, Y., Weissman, T., IEEE IEEE. 2015: 839–43
  • Minimax Estimation of Information Measures Jiao, J., Venkat, K., Han, Y., Weissman, T., IEEE IEEE. 2015: 2296–2300
  • Minimax Estimation of Discrete Distributions Han, Y., Jiao, J., Weissman, T., IEEE IEEE. 2015: 2291–95
  • Adaptive Estimation of Shannon Entropy Han, Y., Jiao, J., Weissman, T., IEEE IEEE. 2015: 1372–76
  • Does Dirichlet Prior Smoothing Solve the Shannon Entropy Estimation Problem? Han, Y., Jiao, J., Weissman, T., IEEE IEEE. 2015: 1367–71
  • Information Measures: The Curious Case of the Binary Alphabet IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Courtade, T. A., No, A., Venkat, K., Weissman, T. 2014; 60 (12): 7616-7626
  • Aligned genomic data compression via improved modeling JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY Ochoa, I., Hernaez, M., Weissman, T. 2014; 12 (6)

    Abstract

    With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.

    View details for DOI 10.1142/S0219720014420025

    View details for PubMedID 25395305

  • Capacity of a POST Channel With and Without Feedback IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Asnani, H., Weissman, T. 2014; 60 (10): 6041-6057
  • To Feed or Not to Feedback IEEE TRANSACTIONS ON INFORMATION THEORY Asnani, H., Permuter, H. H., Weissman, T. 2014; 60 (9): 5150-5172
  • Minimax Filtering Regret via Relations Between Information and Estimation IEEE TRANSACTIONS ON INFORMATION THEORY No, A., Weissman, T. 2014; 60 (8): 4832-4847
  • The Porosity of Additive Noise Channels IEEE TRANSACTIONS ON INFORMATION THEORY Misra, V., Weissman, T. 2014; 60 (6): 3144-3162
  • Compression With Actions IEEE TRANSACTIONS ON INFORMATION THEORY Zhao, L., Chia, Y., Weissman, T. 2014; 60 (2): 796-807
  • Multiterminal Source Coding Under Logarithmic Loss IEEE TRANSACTIONS ON INFORMATION THEORY Courtade, T. A., Weissman, T. 2014; 60 (1): 740-761
  • Compression for Similarity Identification: Fundamental Limits IEEE International Symposium on Information Theory (ISIT) Ingber, A., Weissman, T. IEEE. 2014: 1–5
  • Compression for Quadratic Similarity Queries via Shape-Gain Quantizers IEEE International Symposium on Information Theory (ISIT) Dempfle, S., Steiner, F., Ingber, A., Weissman, T. IEEE. 2014: 2839–2843
  • Information Divergences and the Curious Case of the Binary Alphabet IEEE International Symposium on Information Theory (ISIT) Jiao, J., Courtade, T., No, A., Venkat, K., Weissman, T. IEEE. 2014: 351–355
  • Relations between Information and Estimation in Scalar Levy Channels IEEE International Symposium on Information Theory (ISIT) Jiao, J., Venkat, K., Weissman, T. IEEE. 2014: 2212–2216
  • Strong Successive Refinability: Sufficient Conditions IEEE International Symposium on Information Theory (ISIT) No, A., Ingber, A., Weissman, T. IEEE. 2014: 2659–2663
  • Justification of Logarithmic Loss via the Benefit of Side Information IEEE International Symposium on Information Theory (ISIT) Jiao, J., Courtade, T., Venkat, K., Weissman, T. IEEE. 2014: 946–950
  • Rateless Lossy Compression via the Extremes No, A., Weissman, T., IEEE IEEE. 2014: 88–92
  • Achievable Error Exponents in the Gaussian Channel With Rate-Limited Feedback IEEE TRANSACTIONS ON INFORMATION THEORY Mirghaderi, R., Goldsmith, A., Weissman, T. 2013; 59 (12): 8144-8156
  • Estimation With a Helper Who Knows the Interference IEEE TRANSACTIONS ON INFORMATION THEORY Chia, Y., Soundararajan, R., Weissman, T. 2013; 59 (11): 7097-7117
  • Universal Estimation of Directed Information IEEE TRANSACTIONS ON INFORMATION THEORY Jiao, J., Permuter, H. H., Zhao, L., Kim, Y., Weissman, T. 2013; 59 (10): 6220-6242
  • The human genome contracts again BIOINFORMATICS Pavlichin, D. S., Weissman, T., Yona, G. 2013; 29 (17): 2199-2202

    Abstract

    The number of human genomes that have been sequenced completely for different individuals has increased rapidly in recent years. Storing and transferring complete genomes between computers for the purpose of applying various applications and analysis tools will soon become a major hurdle, hindering the analysis phase. Therefore, there is a growing need to compress these data efficiently. Here, we describe a technique to compress human genomes based on entropy coding, using a reference genome and known Single Nucleotide Polymorphisms (SNPs). Furthermore, we explore several intrinsic features of genomes and information in other genomic databases to further improve the compression attained. Using these methods, we compress James Watson's genome to 2.5 megabytes (MB), improving on recent work by 37%. Similar compression is obtained for most genomes available from the 1000 Genomes Project. Our biologically inspired techniques promise even greater gains for genomes of lower organisms and for human genomes as more genomic data become available.Code is available at sourceforge.net/projects/genomezip/

    View details for DOI 10.1093/bioinformatics/btt362

    View details for Web of Science ID 000323344800018

    View details for PubMedID 23793748

  • Successive Refinement With Decoder Cooperation and Its Channel Coding Duals IEEE TRANSACTIONS ON INFORMATION THEORY Asnani, H., Permuter, H. H., Weissman, T. 2013; 59 (9): 5511-5533
  • QualComp: a new lossy compressor for quality scores based on rate distortion theory BMC BIOINFORMATICS Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T., Yona, G. 2013; 14

    Abstract

    Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment).QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp.

    View details for DOI 10.1186/1471-2105-14-187

    View details for Web of Science ID 000321136300001

    View details for PubMedID 23758828

    View details for PubMedCentralID PMC3698011

  • Real-Time Coding With Limited Lookahead IEEE TRANSACTIONS ON INFORMATION THEORY Asnani, H., Weissman, T. 2013; 59 (6): 3582-3606
  • Multiterminal Source Coding With Action-Dependent Side Information IEEE TRANSACTIONS ON INFORMATION THEORY Chia, Y., Asnani, H., Weissman, T. 2013; 59 (6): 3653-3667
  • Directed Information, Causal Estimation, and Communication in Continuous Time IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Kim, Y., Permuter, H. H. 2013; 59 (3): 1271-1287
  • Multiterminal Source Coding with Action Dependent Side Information IEEE Trans. Inform. Theory Chi, Y.K., Asnani, H., Weissman, T. 2013; 59 (6): 3653-3667
  • Secure Source Coding with a Public Helper Kittichokechai, K., Chia, Y., Oechtering, T. J., Skoglund, M., Weissman, T., IEEE IEEE. 2013: 2209-+

    Abstract

    We consider secure multi-terminal source coding problems in the presence of a public helper. Two main scenarios are studied: 1) source coding with a helper where the coded side information from the helper is eavesdropped by an external eavesdropper, 2) triangular source coding with a helper where the helper is considered as a public terminal. We are interested in how the helper can support the source transmission subject to a constraint on the amount of information leaked due to its public nature. We characterize the tradeoff between transmission rate, incurred distortion, and information leakage rate at the helper/eavesdropper in the form of a rate-distortion-leakage region for various classes of problems.

    View details for PubMedID 29398720

  • QualComp: a new lossy compressor for quality scores based on rate distortion theory. BMC bioinformatics Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T., Yona, G. 2013; 14: 187-?

    Abstract

    Next Generation Sequencing technologies have revolutionized many fields in biology by reducing the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users. This data consists of both the nucleotide sequences and per-base quality scores that indicate the level of confidence in the readout of these sequences. Quality scores account for about half of the required disk space in the commonly used FASTQ format (before compression), and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data.In this paper, we present a new scheme for the lossy compression of the quality scores, to address the problem of storage. Our framework allows the user to specify the rate (bits per quality score) prior to compression, independent of the data to be compressed. Our algorithm can work at any rate, unlike other lossy compression algorithms. We envisage our algorithm as being part of a more general compression scheme that works with the entire FASTQ file. Numerical experiments show that we can achieve a better mean squared error (MSE) for small rates (bits per quality score) than other lossy compression schemes. For the organism PhiX, whose assembled genome is known and assumed to be correct, we show that it is possible to achieve a significant reduction in size with little compromise in performance on downstream applications (e.g., alignment).QualComp is an open source software package, written in C and freely available for download at https://sourceforge.net/projects/qualcomp.

    View details for DOI 10.1186/1471-2105-14-187

    View details for PubMedID 23758828

  • Efficient Similarity Queries via Lossy Compression 51st Annual Allerton Conference on Communication, Control, and Computing Ochoa, I., Ingber, A., Weissman, T. IEEE. 2013: 883–889
  • Unsupervised Learning and Universal Communication IEEE International Symposium on Information Theory Proceedings (ISIT) Misra, V., Weissman, T. IEEE. 2013: 261–265
  • The Role of Lookahead in Estimation under Gaussian Noise IEEE International Symposium on Information Theory Proceedings (ISIT) Venkat, K., Weissman, T., Carmon, Y., Shamai, S. IEEE. 2013: 2850–2854
  • Minimax Filtering Regret via Relations Between Information and Estimation IEEE International Symposium on Information Theory Proceedings (ISIT) No, A., Weissman, T. IEEE. 2013: 444–448
  • Complexity and Rate-Distortion Tradeoff via Successive Refinement 51st Annual Allerton Conference on Communication, Control, and Computing No, A., Ingber, A., Weissman, T. IEEE. 2013: 1531–1536
  • Distortion Rate Function of Sub-Nyquist Sampled Gaussian Sources Corrupted by Noise 51st Annual Allerton Conference on Communication, Control, and Computing Kipnis, A., Goldsmith, A. J., Weissman, T., Eldar, Y. C. IEEE. 2013: 901–908
  • Reliable Uncoded Communication in the Underdetermined SIMO MAC with Low-Complexity Decoding 51st Annual Allerton Conference on Communication, Control, and Computing Chowdhury, M., Goldsmith, A., Weissman, T. IEEE. 2013: 1075–1081
  • Compression for Exact Match Identification IEEE International Symposium on Information Theory Proceedings (ISIT) Ingber, A., Courtade, T., Weissman, T. IEEE. 2013: 654–658
  • Pointwise Relations between Information and Estimation in the Poisson Channel IEEE International Symposium on Information Theory Proceedings (ISIT) Jiao, J., Venkat, K., Weissman, T. IEEE. 2013: 449–453
  • Reliable Uncoded Communication in the SIMO MAC via Low-Complexity Decoding IEEE International Symposium on Information Theory Proceedings (ISIT) Chowdhury, M., Goldsmith, A., Weissman, T. IEEE. 2013: 1645–1649
  • Network Compression: Worst-Case Analysis IEEE International Symposium on Information Theory Proceedings (ISIT) Asnani, H., Shomorony, I., Avestimehr, A. S., Weissman, T. IEEE. 2013: 196–200
  • Capacity of a POST Channel with and without Feedback IEEE International Symposium on Information Theory Proceedings (ISIT) Asnani, H., Permuter, H. H., Weissman, T. IEEE. 2013: 2538–2542
  • Relations between Information and Estimation in the presence of Feedback (invited contribution to) Information and Control in Networks (accepted) Asnani, H., Venkat, K., Weissman, T. Cambridge University Press. 2013
  • Operational Extremality of Gaussianity in Network Compression, Communication, and Coding IEEE Information Theory Workshop (ITW) Asnani, H., Shomorony, I., Avestimehr, A. S., Weissman, T. IEEE. 2013
  • Quadratic Similarity Queries on Compressed Data Data Compression Conference (DCC) Ingber, A., Courtade, T., Weissman, T. IEEE. 2013: 441–450
  • Uncoded transmission in MAC channels achieves arbitrarily small error probability 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton) Chowdhury, M., Goldsmith, A., Weissman, T. IEEE. 2013: 1983–1990
  • On Real Time Coding with Limited Lookahead IEEE Trans. Inform. Theory Asnani, H., Weissman, T. 2013; 59 (6): 3582-3606
  • On Information, Estimation and Lookahead 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton) Venkat, K., Weissman, T., Carmon, Y., Shamai, S. IEEE. 2013: 1292–1299
  • The Human Genome Contracts Again Bioinformatics Pavlichin, D., Yona, G., Weissman, T. 2013
  • Estimation with a Helper who Knows the Interference IEEE Trans. Inform. Theory Chi, Y.K., Soundararajan, R., Weissman, T. 2013; 59 (11): 7097-7117
  • QualComp: a new lossy compressor for quality scores based on rate distortion theory BMC Bioinformatics Ochoa, I., Asnani, H., Bharadia, D., Chowdhury, M., Weissman, T. 2013: 14:187
  • Achievable Error Exponents in the Gaussian Channel with Rate-Limited Feedback IEEE Trans. Inform. Theory Mirghaderi, R., Goldsmith, A., Weissman, T. 2013; 59 (12): 8144-8156
  • Pointwise Relations Between Information and Estimation in Gaussian Noise IEEE TRANSACTIONS ON INFORMATION THEORY Venkat, K., Weissman, T. 2012; 58 (10): 6264-6281
  • Achievable Complexity-Performance Tradeoffs in Lossy Compression PROBLEMS OF INFORMATION TRANSMISSION Gupta, A., Verdu, S., Weissman, T. 2012; 48 (4): 352-375
  • An MCMC Approach to Universal Lossy Compression of Analog Sources IEEE TRANSACTIONS ON SIGNAL PROCESSING Baron, D., Weissman, T. 2012; 60 (10): 5230-5240
  • Block and Sliding-Block Lossy Compression via MCMC IEEE TRANSACTIONS ON COMMUNICATIONS Jalali, S., Weissman, T. 2012; 60 (8): 2187-2198
  • Cascade and Triangular Source Coding With Side Information at the First Two Nodes IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Weissman, T. 2012; 58 (6): 3339-3349
  • Denoising via MCMC-Based Lossy Compression IEEE TRANSACTIONS ON SIGNAL PROCESSING Jalali, S., Weissman, T. 2012; 60 (6): 3092-3100
  • Lossy Compression of Discrete Sources via the Viterbi Algorithm IEEE TRANSACTIONS ON INFORMATION THEORY Jalali, S., Montanari, A., Weissman, T. 2012; 58 (4): 2475-2489
  • Mutual Information, Relative Entropy, and Estimation in the Poisson Channel IEEE TRANSACTIONS ON INFORMATION THEORY Atar, R., Weissman, T. 2012; 58 (3): 1302-1318
  • Cascade, Triangular and Two Way Source Coding with Degraded Side Information at the Second User IEEE Trans. Inform. Theory Chia, Y. K., Permuter, H., Weissman, T. 2012; 58 (1): 189-206
  • The Degraded Broadcast Channel with Action-Dependent States IEEE International Symposium on Information Theory Steinberg, Y., Weissman, T. IEEE. 2012: 596–600
  • Estimation with a Helper Who Knows the Interference IEEE International Symposium on Information Theory Chia, Y., Soundararajan, R., Weissman, T. IEEE. 2012: 706–710
  • Multiterminal Source Coding under Logarithmic Loss IEEE International Symposium on Information Theory Courtade, T. A., Weissman, T. IEEE. 2012: 761–765
  • Reference Based Genome Compression IEEE Information Theory Workshop (ITW) Chern, B. G., Ochoa, I., Manolakos, A., No, A., Venkat, K., Weissman, T. IEEE. 2012: 427–431
  • Universal Estimation of Directed Information via Sequential Probability Assignments IEEE International Symposium on Information Theory Jiao, J., Permuter, H. H., Zhao, L., Kim, Y., Weissman, T. IEEE. 2012: 523–527
  • Worst-Case Source for Distributed Compression with Quadratic Distortion IEEE Information Theory Workshop (ITW) Shomorony, I., Avestimehr, A. S., Asnani, H., Weissman, T. IEEE. 2012: 187–191
  • Successive Refinement with Cribbing Decoders and its Channel Coding Duals IEEE International Symposium on Information Theory Asnani, H., Permuter, H., Weissman, T. IEEE. 2012
  • Cascade, Triangular, and Two-Way Source Coding With Degraded Side Information at the Second User IEEE TRANSACTIONS ON INFORMATION THEORY Chia, Y., Permuter, H. H., Weissman, T. 2012; 58 (1): 189-206
  • Pointwise Relations between Information and Estimation in Gaussian Noise IEEE International Symposium on Information Theory Venkat, K., Weissman, T. IEEE. 2012: 701–705
  • Joint Source-Channel Coding of one Random Variable over the Poisson Channel IEEE International Symposium on Information Theory No, A., Venkat, K., Weissman, T. IEEE. 2012
  • The Porosity of Additive Noise Sequences IEEE International Symposium on Information Theory Misra, V., Weissman, T. IEEE. 2012
  • Achievable Complexity-Performance Tradeoffs in Lossy Compression Problems of Information Transmission Gupta, A., Verdú, S., Weissman, T. 2012; 48 (4): 352-375
  • An MCMC Approach to Universal Lossy Compression of Analog Sources IEEE Trans. Sig. Proc. Baron, D., Weissman, T. 2012; 60 (10): 5230-5240
  • Cascade and Triangular Source Coding with Side Information at the First Two Nodes IEEE Trans. Inform. Theory Permuter, H. H., Weissman, T. 2012; 58 (6): 3309-3349
  • Denoising via MCMC-based Lossy Compression IEEE Trans. Sig. Proc. Jalali, S., Weissman, T. 2012; 60 (6): 3092-3100
  • Block and Sliding-Block Lossy Compression via MCMC IEEE Trans. on Communications Jalali, S., Weissman, T. 2012; 60 (8): 2187-2198
  • Mutual Information, Relative Entropy, and Estimation in the Poisson Channel IEEE Trans. Inform. Theory Atar, R., Weissman, T. 2012; 58 (3): 1302-1318
  • Probing Capacity IEEE TRANSACTIONS ON INFORMATION THEORY Asnani, H., Permuter, H., Weissman, T. 2011; 57 (11): 7317-7332
  • Source Coding With a Side Information "Vending Machine" IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Weissman, T. 2011; 57 (7): 4530-4544
  • Interpretations of Directed Information in Portfolio Theory, Data Compression, and Hypothesis Testing IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Kim, Y., Weissman, T. 2011; 57 (6): 3248-3259
  • Error Exponents for the Gaussian Channel With Active Noisy Feedback IEEE TRANSACTIONS ON INFORMATION THEORY Kim, Y., Lapidoth, A., Weissman, T. 2011; 57 (3): 1223-1236
  • Error Exponents for the Gaussian Channel with Active Noisy Feedback IEEE Trans. Inform. Theory Kim, Y.H., Lapidoth, A., Weissman, T. 2011; 57 (3): 1223-1236
  • Mutual Information, Relative Entropy, and Estimation in the Poisson Channel IEEE International Symposium on Information Theory (ISIT) Atar, R., Weissman, T. IEEE. 2011: 708–712
  • Discrete denoising of heterogeneous two-dimensional data IEEE International Symposium on Information Theory (ISIT) Moon, T., Weissman, T., Kim, J. IEEE. 2011: 1041–1045
  • Continuous-Time Directed Information and Its Role in Communication IEEE Information Theory Workshop (ITW) Permuter, H. H., Kim, Y., Weissman, T. IEEE. 2011
  • Bounds on the Entropy Rate of Binary Hidden Markov Processes Entropy of Hidden Markov Processes and Connections to Dynamical Systems Ordentlich, E., Weissman, T. edited by Marcus, B., Petersen, K., Weissman, T. Cambridge University Press. 2011: 117–171
  • Entropy of Hidden Markov Processes and Connections to Dynamical Systems Marcus, B., Petersen, K., Weissman, T. edited by Marcus, B., Petersen, K., Weissman, T. Cambridge University Press. 2011
  • Cascade and Triangular source coding with causal side information IEEE International Symposium on Information Theory (ISIT) Chia, Y., Weissman, T. IEEE. 2011: 1683–1687
  • Multi-terminal Source Coding With Action Dependent Side Information IEEE International Symposium on Information Theory (ISIT) Chia, Y., Asnani, H., Weissman, T. IEEE. 2011
  • To Feed or Not to Feed Back IEEE International Symposium on Information Theory (ISIT) Asnani, H., Permuter, H., Weissman, T. IEEE. 2011: 159–163
  • Capacity of Channels With Action-Dependent States IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T. 2010; 56 (11): 5396-5411
  • The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-Time AWGN Channels IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T. 2010; 56 (9): 4256-4273
  • Tighter Bounds on the Capacity of Finite-State Channels Via Markov Set-Chains IEEE TRANSACTIONS ON INFORMATION THEORY Chen, J., Permuter, H., Weissman, T. 2010; 57 (8): 3660-3691
  • Two-Way Source Coding With a Helper IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Steinberg, Y., Weissman, T. 2010; 56 (6): 2905-2919
  • Universal Reinforcement Learning IEEE TRANSACTIONS ON INFORMATION THEORY Farias, V. F., Moallemi, C. C., Van Roy, B., Weissman, T. 2010; 56 (5): 2441-2454
  • A Universal Scheme for Wyner-Ziv Coding of Discrete Sources IEEE TRANSACTIONS ON INFORMATION THEORY Jalali, S., Verdu, S., Weissman, T. 2010; 56 (4): 1737-1750
  • Capacity of Channels with Action-Dependent States IEEE Trans. Inform. Theory Weissman, T. 2010; 56 (11): 5396-5411
  • An MCMC Approach to Lossy Compression of Continuous Sources Baron, D., Weissman, T., Storer, J. A., Marcellin, M. W. IEEE COMPUTER SOC. 2010: 40–48
  • Cascade and Triangular Source Coding with Side Information at the First Two Nodes 2010 IEEE International Symposium on Information Theory Permuter, H. H., Weissman, T. IEEE. 2010: 31–35
  • Universal Lossless Compression-based Denoising 2010 IEEE International Symposium on Information Theory Su, H., Weissman, T. IEEE. 2010: 1648–1652
  • The Relationship Between Causal and Non-Causal Mismatched Estimation in Continuous-Time AWGN Channels IEEE Trans. Inform. Theory Weissman, T. 2010; 56 (9): 4256-4273
  • A Universal Scheme for Wyner-Ziv Coding of Discrete Sources IEEE Trans. Inform. Theory Jalali, S., Verdú, S., Weissman, T. 2010; 56 (4): 1737-1750
  • Universal Estimation of Directed Information 2010 IEEE International Symposium on Information Theory Zhao, L., Permuter, H., Kim, Y., Weissman, T. IEEE. 2010: 1433–1437
  • Discrete Denoising With Shifts IEEE TRANSACTIONS ON INFORMATION THEORY Moon, T., Weissman, T. 2009; 55 (11): 5284-5301
  • A Context Quantization Approach to Universal Denoising IEEE TRANSACTIONS ON SIGNAL PROCESSING Sivaramakrishnan, K., Weissman, T. 2009; 57 (6): 2110-2129
  • Capacity Region of the Finite-State Multiple-Access Channel With and Without Feedback IEEE International Symposium on Information Theory Permuter, H. H., Weissman, T., Chen, J. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2009: 2455–77
  • Universal FIR MMSE Filtering IEEE TRANSACTIONS ON SIGNAL PROCESSING Moon, T., Weissman, T. 2009; 57 (3): 1068-1083
  • Finite State Channels With Time-Invariant Deterministic Feedback IEEE TRANSACTIONS ON INFORMATION THEORY Permuter, H. H., Weissman, T., Goldsmith, A. J. 2009; 55 (2): 644-662
  • A Context Quantization Approach to Universal denoising IEEE Trans. Sig. Proc. Sivaramakrishnan, K., Weissman, T. 2009; 57 (6): 2110-2129
  • Two-way source coding with a common helper IEEE International Symposium on Information Theory (ISIT 2009) Permuter, H., Steinberg, Y., Weissman, T. IEEE. 2009: 1473–1477
  • Problems We Can Solve With a Helper IEEE Information Theory Workshop on Networking and Information Theory Permuter, H., Steinberg, Y., Weissman, T. IEEE. 2009: 266–270
  • An Outer Bound for Side-Information Scalable Source Coding with Partially Cooperating Decoders IEEE International Symposium on Information Theory (ISIT 2009) Bross, S. I., Weissman, T. IEEE. 2009: 1194–1198
  • Multiple Description Coding of Discrete Ergodic Sources 47th Annual Allerton Conference on Communication, Control, and Computing Jalali, S., Weissman, T. IEEE. 2009: 1256–1261
  • Directed Information, Causal Estimation, and Communication in Continuous Time 7th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Kim, Y., Permuter, H. H., Weissman, T. IEEE. 2009: 632–638
  • Source Coding with a Side Information 'Vending Machine' at the Decoder IEEE International Symposium on Information Theory (ISIT 2009) Permuter, H., Weissman, T. IEEE. 2009: 1030–1034
  • Directed Information and Causal Estimation in Continuous Time IEEE International Symposium on Information Theory (ISIT 2009) Kim, Y., Permuter, H. H., Weissman, T. IEEE. 2009: 819–823
  • An Iterative Scheme for Near Optimal and Universal Lossy Compression IEEE Information Theory Workshop on Networking and Information Theory Jalali, S., Montanari, A., Weissman, T. IEEE. 2009: 231–235
  • Capacity of Channels with Action-Dependent States IEEE International Symposium on Information Theory (ISIT 2009) Weissman, T. IEEE. 2009: 1794–1798
  • Discrete Denoising with Shifts IEEE Trans. Inform. Theory Moon, T., Weissman, T. 2009; 55 (11): 5284-5301
  • Finite-state channels with time-invariant deterministic feedback IEEE Trans. Inform. Theory Permuter, H., Weissman, T. 2009; 55 (2): 644-662
  • Capacity Region of the Multiple Access Channel With or Without Feedback IEEE Trans. Inform. Theory Permuter, H., Weissman, T. 2009; 55 (6): 2455-2477
  • Universal Denoising of Discrete-Time Continuous-Amplitude Signals IEEE TRANSACTIONS ON INFORMATION THEORY Sivaramakrishnan, K., Weissman, T. 2008; 54 (12): 5632-5660
  • Scanning and Sequential Decision Making for Multidimensional Data-Part II: The Noisy Case IEEE TRANSACTIONS ON INFORMATION THEORY Cohen, A., Weissman, T., Merhav, N. 2008; 54 (12): 5609-5631
  • The Information Lost in Erasures IEEE International Symposium on Information Theory Verdu, S., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 5030–58
  • Coding for additive white noise channels with feedback corrupted by quantization or bounded noise IEEE TRANSACTIONS ON INFORMATION THEORY Martins, N. C., Weissman, T. 2008; 54 (9): 4274-4282
  • How to filter an "individual sequence with feedback" 44th Annual Allerton Conference on Communication, Control and Computing Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 3831–41
  • Capacity of the trapdoor channel with feedback IEEE International Symposium on Information Theory Permuter, H., Cuff, P., Van Roy, B., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 3150–65
  • Universal filtering via hidden Markov modeling IEEE International Symposium on Information Theory and Its Applications Moon, T., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2008: 692–708
  • Coding Schemes for Additive White Noise Channels with Feedback Corrupted by Quantization or Bounded Noise IEEE Trans. Inform. Theory Martins, N.C., Weissman, T. 2008; 54 (9): 4274-4282
  • On Successive Refinement for the Wyner-Ziv Problem with Partially Cooperating Decoders 2008 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY PROCEEDINGS, VOLS 1-6 Bross, S. I., Weissman, T. 2008: 2172-?
  • Rate-Distortion in Near-Linear Time IEEE International Symposium on Information Theory Gupta, A., Verdu, S., Weissman, T. IEEE. 2008: 847–851
  • Near optimal lossy source coding and compression-based denoising via Markov Chain Monte Carlo 42nd Annual Conference on Information Sciences and Systems Jalali, S., Weissman, T. IEEE. 2008: 441–446
  • RATE-DISTORTION WITH COMMON RATE-LIMITED SIDE INFORMATION TO THE ENCODER AND DECODER 47th AIAA Aerospace Sciences Meeting and Exhibit including the New Horizons Forum and Aerospace Exposition Permuter, H., Steinberg, Y., Weissman, T. IEEE. 2008: 777–779
  • Rate-Distortion via Markov Chain Monte Carlo IEEE International Symposium on Information Theory Jalali, S., Weissman, T. IEEE. 2008: 852–856
  • Lossy source coding via Markov chain Monte Carlo International Zurich Seminar on Communications Jalali, S., Weissman, T. IEEE. 2008: 80–83
  • On the Capacity of Finite-State Channels IEEE International Symposium on Information Theory Chen, J., Permuter, H., Weissman, T. IEEE. 2008: 1223–1227
  • New Bounds for the Capacity Region of the Finite-State Multiple Access Channel IEEE International Symposium on Information Theory Permuter, H. H., Weissman, T., Chen, J. IEEE. 2008: 394–398
  • Scanning and sequential decision making for multi-dimensional data, Part II: the noisy case IEEE Trans. Inform. Theory Cohen, A., Weissman, T., Merhav, N. 2008; 54 (12): 5609-5631
  • An Implementable Scheme for Universal Lossy Compression of Discrete Markov Sources 19th Data Compression Conference Jalali, S., Montanari, A., Weissman, T. IEEE COMPUTER SOC. 2008: 292–301
  • On Directed Information and Gambling IEEE International Symposium on Information Theory Permuter, H. H., Kim, Y., Weissman, T. IEEE. 2008: 1403–1407
  • Universal denoising of discretetime continuous-amplitude signals IEEE Trans. Inform. Theory Sivaramakrishnan, K., Weissman, T. 2008; 54 (12): 5632-5660
  • The Information Lost in Erasures IEEE Trans. Inform. Theory Verdú, S., Weissman, T. 2008; 54 (11): 5030-5058
  • How to filter an 'individual sequence with feedback' IEEE Trans. Inform. Theory Weissman, T. 2008; 54 (8): 3831-3841
  • Discrete Universal Filtering via Hidden Markov Modeling IEEE Trans. Inform. Theory Moon, T., Weissman, T. 2008; 54 (2): 692 - 708
  • Universal FIR MMSE filtering IEEE Trans. Sig. Proc. Moon, T., Weissman, T. 2008; 57 (3): 1068-1083
  • Scanning and sequential decision making for multidimensional data - Part I : The noiseless case IEEE International Symposium on Information Theory Cohen, A., Merhav, N., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 3001–20
  • Universal filtering via prediction Data Compression Conference (DCC 2004) Weissman, T., Ordentlich, E., Weinberger, M. J., Somekh-Baruch, A., Merhav, N. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 1253–64
  • Denoising and filtering under the probability of excess loss criterion 43rd Annual Allerton Conference on Communication, Control and Computing Pereira, S., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2007: 1265–81
  • Capacity and zero-error capacity of the chemical channel with feedback IEEE International Symposium on Information Theory Permuter, H., Cuff, P., Van Roy, B., Weissman, T. IEEE. 2007: 1866–1870
  • Scanning and sequential decision making for multidimensional data IEEE Information Theory and Applications Workshop Cohen, A., Merhav, N., Weissman, T. IEEE. 2007: 99–108
  • Scanning, filtering and prediction for random fields corrupted by Gaussian noise IEEE International Symposium on Information Theory Cohen, A., Merhav, N., Weissman, T. IEEE. 2007: 691–695
  • The Gaussian channel with noisy feedback IEEE International Symposium on Information Theory Kim, Y., Lapidoth, A., Weissman, T. IEEE. 2007: 1416–1420
  • Competitive on-line linear FIR MMSE filtering IEEE International Symposium on Information Theory Moon, T., Weissman, T. IEEE. 2007: 1126–1130
  • A context quantization approach to universal denoising IEEE International Symposium on Information Theory Sivaramakrishnan, K., Weissman, T. IEEE. 2007: 2361–2365
  • Scanning and sequential decision making for multi-dimensional data, Part I: the noiseless case IEEE Trans. Inform. Theory Cohen, A., Merhav, N., Weissman, T. 2007; 53 (9): 3001 - 3020
  • New bounds on the rate-distortion function of a binary Markov source IEEE International Symposium on Information Theory Jalali, S., Weissman, T. IEEE. 2007: 571–575
  • A universal Wyner-Ziv scheme for discrete sources IEEE International Symposium on Information Theory Jalali, S., Verdu, S., Weissman, T. IEEE. 2007: 1951–1955
  • On separation in the presence of feedback IEEE Information Theory Workshop Permuter, H., Weissman, T. IEEE. 2007: 266–270
  • Reflections on the DUDE IEEE Information Theory Society Newsletter Ordentlich, E., Serouss, G., Verd_x0013_u, S., Weinberger, M., Weissman, T. 2007; 57 (2): 5-10
  • Denoising and filtering under the probability of excess loss criterion IEEE Trans. Inform. Theory Matloub, S., Weissman, T. 2007; 53 (4): 1265 - 1281
  • Source coding with limited-look-ahead side information at the decoder IEEE International Symposium on Information Theory Weissman, T., El Gamal, A. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 5218–39
  • Universal zero-delay joint source-channel coding 38th Conference on Information Sciences and Systems Matloub, S., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 5240–50
  • On the entropy rate of pattern processes IEEE TRANSACTIONS ON INFORMATION THEORY Gemelos, G. M., Weissman, T. 2006; 52 (9): 3994-4007
  • Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source IEEE International Symposium on Information Theory and Its Applications Merhav, N., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 4207–11
  • Universal minimax discrete denoising under channel uncertainty IEEE International Symposium on Information Theory Gemelos, G. M., Sigurjonsson, S., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 3476–97
  • Algorithms for discrete denoising under channel uncertainty IEEE TRANSACTIONS ON SIGNAL PROCESSING Gemelos, G. A., Sigurjonsson, S., Weissman, T. 2006; 54 (6): 2263-2276
  • Source coding with limited side information lookahead at the decoder IEEE International Symposium on Information Theory El Gamal, A., Weissman, T. IEEE. 2006: 2441–2445
  • Universal scanning and sequential decision making for multidimensional data IEEE International Symposium on Information Theory Cohen, A., Merhav, N., Weissman, T. IEEE. 2006: 431–435
  • Erasure entropy IEEE International Symposium on Information Theory Verdu, S., Weissman, T. IEEE. 2006: 98–102
  • Universal scanning of mixing ra the performance of the peano ndom fields and Peano-Hilbert scan 24th IEEE Convention of Electrical and Electronics Engineers in Israel Cohen, A., Merhav, N., Weissman, T. IEEE. 2006: 62–66
  • On the optimality of symbol-by-symbol filtering and denoising IEEE International Symposium on Information Theory Ordentlich, E., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2006: 19–40
  • Compound sequential decisions against the well-informed antagonist IEEE Information Theory Workshop Weissman, T. IEEE. 2006: 77–81
  • Universal denoising of continuous amplitude signals with applications to images IEEE International Conference on Image Processing (ICIP 2006) Sivaramakrishnan, K., Weissman, T. IEEE. 2006: 2609–2612
  • Bounds on the error exponent of the AWGN channel with AWGN-corrupted feedback 24th IEEE Convention of Electrical and Electronics Engineers in Israel Kim, Y., Lapidoth, A., Weissman, T. IEEE. 2006: 184–188
  • Capacity of finite-state channels with time-invariant deterministic feedback IEEE International Symposium on Information Theory Permuter, H., Weissman, T., Goldsmith, A. IEEE. 2006: 64–68
  • Universal Zero-Delay Joint Source-Channel Coding IEEE Trans. Inform. Theory Matloub, S., Weissman, T. 2006; 52 (12): 5240 - 5250
  • Universal Minimax Discrete Denoising under Channel Uncertainty IEEE Trans. Inform. Theory Gemelos, G., Sigurjonsson, S., Weissman, T. 2006; 52 (8): 3476-3497
  • Universal denoising of discrete-time continuous-amplitude signals IEEE International Symposium on Information Theory Sivaramakrishnan, K., Weissman, T. IEEE. 2006: 2531–2535
  • Source Coding with Limited Side Information Lookahead at the Decoder IEEE Trans. Inform. Theory Weissman, T., Gamal, A., El 2006; 52 (12): 5218 - 5239
  • 2006 Kailath Lecture and Colloquia Verdú, S., Weissman, T. Newsletter article in the IEEE Information Theory Society Newsletter. 2006
  • On the Entropy Rate of Pattern Processes IEEE Trans. Inform. Theory Gemelos, G., Weissman, T. 2006; 52 (9): 3994 - 4007
  • Algorithms for Discrete Denoising under Channel Uncertainty IEEE Trans. Inform. Theory Gemelos, G., Sigurjonsson, S., Weissman, T. 2006; 54 (6): 2263-2276
  • On the Optimality of Symbol by Symbol Fitering and Denoising IEEE Trans. Inform. Theory Ordentlich, E., Weissman, T. 2006; 52 (1): 19-40
  • Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2006; 52 (9): 4207 - 4211
  • The empirical distribution of rate-constrained source codes IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Ordentlich, E. 2005; 51 (11): 3718-3733
  • On causal source codes with side information 40th Annual Allerton Conference on Communication, Control and Computing Weissman, T., Merhav, N. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 4003–13
  • Universal denoising for the finite-input general-output channel IEEE International Symposium on Information Theory Dembo, A., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 1507–17
  • Universal Denoising for the Finite-Input- General-Output Channel IEEE Trans. Inform. Theory Dembo, A., Weissman, T. 2005; 51 (4): 1507-1517
  • DISCRETE DENOISING FOR CHANNELS WITH MEMORY COMMUNICATIONS IN INFORMATION AND SYSTEMS Zhang, R., Weissman, T. 2005; 5 (2): 257–88
  • On the entropy rate of pattern processes IEEE Data Compression Conference Gemelos, G. M., Weissman, T. IEEE COMPUTER SOC. 2005: 233–242
  • Universal discrete denoising: Known channel IEEE International Symposium on Information Theory Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., Weinberger, M. J. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2005: 5–28
  • On the relationship between process and pattern entropy rate IEEE International Symposium on Information Theory and Its Applications Gemelos, G. M., Weissman, T. IEEE. 2005: 2208–2212
  • Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv source IEEE International Symposium on Information Theory and Its Applications Merhav, N., Weissman, T. IEEE. 2005: 1506–1510
  • Approximations for the entropy rate of a hidden Markov process IEEE International Symposium on Information Theory and Its Applications Ordentlich, E., Weissman, T. IEEE. 2005: 2198–2202
  • Multi-directional context sets with applications to universal denoising and compression IEEE International Symposium on Information Theory and Its Applications Ordentlich, E., Weinberger, M. J., Weissman, T. IEEE. 2005: 1270–1274
  • Discrete universal filtering via hidden Markov modelling IEEE International Symposium on Information Theory and Its Applications Moon, T., Weissman, T. IEEE. 2005: 1285–1289
  • A universal scheme for learning IEEE International Symposium on Information Theory and Its Applications Farias, V. F., Moallemi, C. C., Van Roy, B., Weissman, T. IEEE. 2005: 1158–1162
  • Asymptotic filtering and entropy rate of a hidden Markov process in the rare transitions regime IEEE International Symposium on Information Theory and Its Applications Nair, C., Ordentlich, E., Weissman, T. IEEE. 2005: 1838–1842
  • On Causal Source Codes with Side Information IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2005; 51 (11): 4003-4013
  • Discrete Denoising for Channels with Memory Comm. in Information and Systems Zhang, R., Weissman, T. 2005; 5 (2): 257-288
  • The empirical distribution of rate-constrained codes IEEE Trans. Inform. Theory Weissman, T., Ordentlich, E. 2005; 51 (11): 3718-3733
  • Universal Discrete Denoising: Known Channel IEEE Trans. Inform. Theory Weissman, T., Ordentlich, E., Seroussi, G., Verdú, S., Weinberger, M. 2005; 51 (1): 5-28
  • Universally attainable error exponents for rate-distortion coding of noisy sources IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T. 2004; 50 (6): 1229-1246
  • Universal prediction of random binary sequences in a noisy environment ANNALS OF APPLIED PROBABILITY Weissman, T., Merhav, N. 2004; 14 (1): 54-89
  • Universal prediction of random binary sequences in a noisy environment Annals of Applied Probability Weissman, T., Merhav, N. 2004; 14 (1): 54-89
  • Universal denoising for the finite-input-general-output channel IEEE International Symposium on Information Theory Dembo, A., Weissman, T. IEEE. 2004: 201–201
  • The empirical distribution of rate - Constrained source codes IEEE International Symposium on Information Theory Weissman, T., Ordentlich, E. IEEE. 2004: 464–464
  • Channel decoding of systematically encoded unknown redundant sources IEEE International Symposium on Information Theory Ordentlich, E., Seroussi, G., Verdu, S., Viswanathan, K., Weinberger, M. J., Weissman, T. IEEE. 2004: 165–165
  • Efficient pruning of bi-directional context trees with applications to universal denoising and compression IEEE Information Theory Workshop Ordentlich, E., Weinberger, M. J., Weissman, T. IEEE. 2004: 94–98
  • On the optimality of symbol by symbol filtering and denoising IEEE International Symposium on Information Theory Ordentlich, E., Weissman, T. IEEE. 2004: 200–200
  • Universal minimax binary image denoising under channel uncertainty International Conference on Image Processing (ICIP 2004) Gemelos, G., Sigurjonsson, S., Weissman, T. IEEE. 2004: 997–1000
  • New bounds on the entropy rate of hidden Markov processes IEEE Information Theory Workshop Ordentlich, E., Weissman, T. IEEE. 2004: 117–122
  • Discrete universal filtering through incremental parsing Data Compression Conference (DCC 2004) Ordentlich, E., Weissman, T., Weinberger, M. J., Somekh-Baruch, A., Merhav, N. IEEE COMPUTER SOC. 2004: 352–361
  • Universal minimax discrete denoising under channel uncertainty IEEE International Symposium on Information Theory Gemelos, G., Sigurjonsson, S., Weissman, T. IEEE. 2004: 199–199
  • On the Entropy Rate of Pattern Sequences HP Laboratories Technical Report Gemelos, G., Weissman, T. 2004
  • Universally Attainable Error-Exponents for Rate-Distortion Coding of Noisy Sources IEEE Trans. Inform. Theory Weissman, T. 2004; 50 (6): 1229-1246
  • On competitive prediction and its relation to rate-distortion theory IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Merhav, N. 2003; 49 (12): 3185-3194
  • The minimax distortion redundancy in noisy source coding IEEE TRANSACTIONS ON INFORMATION THEORY Dembo, A., Weissman, T. 2003; 49 (11): 3020-3030
  • Scanning and Prediction in Multi-Dimensional Data Arrays IEEE Trans. Inform. Theory Merhav, N., Weissman, T. 2003; IT-49 (1): 65-82
  • Scanning and prediction in multidimensional data arrays IEEE International Symposium on Information Theory Merhav, N., Weissman, T. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2003: 65–82
  • A discrete universal denoiser and its application to binary images IEEE International Conference on Image Processing Ordentlich, E., Seroussi, G., Verdu, S., Weinberger, M., Weissman, T. IEEE. 2003: 117–120
  • Universally attainable error-exponents for rate-distortion coding of noisy sources IEEE International Symposium on Information Theory Weissman, T. IEEE. 2003: 194–194
  • On competitive prediction and its relation to rate-distortion theory and to channel capacity theory IEEE International Symposium on Information Theory Weissman, T., Merhav, N. IEEE. 2003: 81–81
  • On Competitive Prediction and its Relation to Rate-Distortion Theory IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2003; 49 (12): 3185-3194
  • On limited-delay lossy coding and filtering of individual sequences IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Merhav, N. 2002; 48 (3): 721-733
  • Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Merhav, N. 2002; 48 (2): 396-415
  • Universally Attainable Error-Exponents for Rate-Constrained Denoising of Noisy Sources HP Laboratories Technical Report Weissman, T. 2002
  • The minimax distortion redundancy in noisy source coding IEEE International Symposium on Information Theory Dembo, A., Weissman, T. IEEE. 2002: 318–318
  • On Competitive Prediction and its Relation to Rate-Distortion Theory and to Channel Capacity Theory Technion – I.I.T., CC Pub., EE Pub. Weissman, T., Merhav, N. 2002
  • On limited-delay lossy coding and filtering of individual sequences IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2002; IT-48 (3): 721-733
  • Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2002; IT-48 (2): 396-415
  • Universal prediction of individual binary sequences in the presence of noise IEEE TRANSACTIONS ON INFORMATION THEORY Weissman, T., Merhav, N. 2001; 47 (6): 2151-2173
  • Twofold universal prediction schemes for achieving the finite-state predictability of a noisy individual binary sequence IEEE International Symposium on Information Theory Weissman, T., Merhav, N., Somekh-Baruch, A. IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC. 2001: 1849–66
  • Twofold universal prediction schemes for achieving the finite-state predictability of a noisy individual binary sequence IEEE Trans. Inform. Theory Weissman, T., Merhav, N., Somekh-Baruch, A. 2001; IT-47 (5): 1849-1866
  • Tradeoffs between the excess-code-length exponent and the excess-distortion exponent in lossy source coding Technical Report, CCIT Pub., EE Pub. Weissman, T., Merhav, N. 2001
  • Universal prediction of binary individual sequences in the presence of noise IEEE Trans. Inform. Theory Weissman, T., Merhav, N. 2001; IT-47 (6): 2151-2173
  • Universal prediction of individual binary sequences in the presence of noise Technion – I.I.T., CC Pub., EE Pub. Weissman, T., Merhav, N. 1999
  • Not all universal codes are pointwise universal Weissman, T. unpublished manuscript available upon request and at http://www.stanford.edu/~tsachy/papers.html.