Robert Tibshirani's main interests are in applied statistics, biostatistics, and data mining. He is co-author of the books "Generalized Additive Models" (with Trevor Hastie, Stanford), "An Introduction to the Bootstrap" (with Brad Efron, Stanford), and "Elements of Statistical Learning" (with Trevor Hastie and Jerry Friedman, Stanford). His current research focuses on problems in biology and genomics, medicine, and industry. With Stanford collaborator Balasubramanian Narasimhan, he also develops software packages for genomics and proteomics.

Administrative Appointments

  • Professor, Department of Biomedical Data Science and Department of Statistics, Stanford University (2015 - Present)
  • Professor, Department of Health Research and Policy and Department of Statistics, Stanford University (1998 - 2015)
  • Professor, Department of Public Health Sciences and Department of Statistics, University of Toronto (1994 - 1998)
  • Associate Professor, Department of Statistics, University of Toronto (1989 - 1994)
  • Associate Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1989 - 1994)
  • Assistant Professor, Department of Statistics, University of Toronto (1985 - 1989)
  • Assistant Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1985 - 1989)

Honors & Awards

  • Doctor Honoris Causa, University of Waterloo (2018)
  • Elected Member, National Academy of Sciences (2012)
  • Gold Medal, Statistical Society of Canada (2012)
  • Alumni Achievement Award, University of Waterloo (2006)
  • Fellow, Royal Society of Canada (2001)
  • CRM-SSC Prize in Statistics, Statistical Society of Canada (2000)
  • E.W. Steacie Memorial Fellowship, Natural Sciences and Engineering Research Council of Canada (1997)
  • President's Award, Committee of Presidents of Statistical Societies (1996)
  • Guggenheim Fellowship, J. Guggenheim Foundation (1994)
  • Fellow, Institute of Mathematical Statistics (1993)
  • Fellow, American Statistical Association (1992)

Boards, Advisory Committees, Professional Organizations

  • Associate Editor, Annals of Applied Statistics (2006 - Present)
  • Associate Editor, PLoS Biology (2001 - 2004)
  • Member, Screening Panel, National Science Foundation (1999 - 1999)
  • Associate Editor, Annals of Statistics (1998 - Present)
  • Associate Editor, Statistical Science (1995 - Present)
  • Chair, Committee on Computerization, Institute of Mathematical Statistics (1995 - Present)
  • Associate Editor, Canadian Journal of Statistics (1995 - 1997)
  • Program Chair, Statistical Computing, American Statistical Association (1995 - 1996)
  • Annual Meeting Program Chair, Statistical Society of Canada (1994 - 1994)
  • Series Editor, Computing and Graphics Monographs, Chapman & Hall (1994 - 1994)
  • Council Member, Institute of Mathematical Statistics (1991 - 1994)
  • Member, Statistical Sciences Grant Selection Committee, Natural Sciences and Engineering Research Council of Canada (1989 - 1993)
  • Associate Editor, Canadian Journal of Statistics (1988 - 1991)
  • Associate Editor, Theory and Methods, Journal of the American Statistical Association (1986 - 1995)

Professional Education

  • B.Math., University of Waterloo, Statistics and Computer Science (1979)
  • M.Sc., University of Toronto, Statistics (1980)
  • Ph.D., Stanford University, Statistics (1984)

Current Research and Scholarly Interests

My research is in applied statistics and biostatistics. I specialize in computer-intensive methods for regression and classification, bootstrap, cross-validation and statistical inference, and signal and image analysis for medical diagnosis.

2018-19 Courses

Stanford Advisees

Graduate and Fellowship Programs

All Publications

  • Shaping of infant B cell receptor repertoires by environmental factors and infectious disease. Science translational medicine Nielsen, S. C., Roskin, K. M., Jackson, K. J., Joshi, S. A., Nejad, P., Lee, J., Wagar, L. E., Pham, T. D., Hoh, R. A., Nguyen, K. D., Tsunemoto, H. Y., Patel, S. B., Tibshirani, R., Ley, C., Davis, M. M., Parsonnet, J., Boyd, S. D. 2019; 11 (481)


    Antigenic exposures at epithelial sites in infancy and early childhood are thought to influence the maturation of humoral immunity and modulate the risk of developing immunoglobulin E (IgE)-mediated allergic disease. How different kinds of environmental exposures influence B cell isotype switching to IgE, IgG, or IgA, and the somatic mutation maturation of these antibody pools, is not fully understood. We sequenced antibody repertoires in longitudinal blood samples in a birth cohort from infancy through the first 3 years of life and found that, whereas IgG and IgA show linear increases in mutational maturation with age, IgM and IgD mutations are more closely tied to pathogen exposure. IgE mutation frequencies are primarily increased in children with impaired skin barrier conditions such as eczema, suggesting that IgE affinity maturation could provide a mechanistic link between epithelial barrier failure and allergy development.

    View details for PubMedID 30814336

  • Reply to J. Wang et al. Journal of clinical oncology : official journal of the American Society of Clinical Oncology Kurtz, D. M., Scherer, F., Jin, M. C., Soo, J., Craig, A. F., Esfahani, M. S., Chabon, J. J., Stehr, H., Liu, C. L., Tibshirani, R., Maeda, L. S., Gupta, N. K., Khodadoust, M. S., Advani, R. H., Newman, A. M., Duhrsen, U., Huttmann, A., Meignan, M., Casasnovas, O., Westin, J. R., Roschewski, M., Wilson, W. H., Gaidano, G., Rossi, D., Diehn, M., Alizadeh, A. A. 2019: JCO1801907

    View details for PubMedID 30753108

  • Proliferation tracing with single-cell mass cytometry optimizes generation of stem cell memory-like T cells. Nature biotechnology Good, Z., Borges, L., Vivanco Gonzalez, N., Sahaf, B., Samusik, N., Tibshirani, R., Nolan, G. P., Bendall, S. C. 2019


    Selective differentiation of naive T cells into multipotent T cells is of great interest clinically for the generation of cell-based cancer immunotherapies. Cellular differentiation depends crucially on division state and time. Here we adapt a dye dilution assay for tracking cell proliferative history through mass cytometry and uncouple division, time and regulatory protein expression in single naive human T cells during their activation and expansion in a complex ex vivo milieu. Using 23 markers, we defined groups of proteins controlled predominantly by division state or time and found that undivided cells account for the majority of phenotypic diversity. We next built a map of cell state changes during naive T-cell expansion. By examining cell signaling on this map, we rationally selected ibrutinib, a BTK and ITK inhibitor, and administered it before T cell activation to direct differentiation toward a T stem cell memory (TSCM)-like phenotype. This method for tracing cell fate across division states and time can be broadly applied for directing cellular differentiation.

    View details for PubMedID 30742126

  • Desensitization rates to peanut protein during OIT among children, adolescents, and adults Long, A. J., Purington, N., Woch, M., O'Laughlin, K., Tan, T., Kost, L., Hijazi, S., Shojinaga, M., Raeber, O., Alvarez, A., Andorf, S., Tibshirani, R., Galli, S. J., Nadeau, K. C., Chinthrajah, R. MOSBY-ELSEVIER. 2019: AB245
  • Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy. Bioinformatics (Oxford, England) Ghaemi, M. S., DiGiulio, D. B., Contrepois, K., Callahan, B., Ngo, T. T., Lee-McMullen, B., Lehallier, B., Robaczewska, A., Mcilwain, D., Rosenberg-Hasson, Y., Wong, R. J., Quaintance, C., Culos, A., Stanley, N., Tanada, A., Tsai, A., Gaudilliere, D., Ganio, E., Han, X., Ando, K., McNeil, L., Tingle, M., Wise, P., Maric, I., Sirota, M., Wyss-Coray, T., Winn, V. D., Druzin, M. L., Gibbs, R., Darmstadt, G. L., Lewis, D. B., Partovi Nia, V., Agard, B., Tibshirani, R., Nolan, G., Snyder, M. P., Relman, D. A., Quake, S. R., Shaw, G. M., Stevenson, D. K., Angst, M. S., Gaudilliere, B., Aghaeepour, N. 2019; 35 (1): 95–103


    Motivation: Multiple biological clocks govern a healthy pregnancy. These biological mechanisms produce immunologic, metabolomic, proteomic, genomic and microbiomic adaptations during the course of pregnancy. Modeling the chronology of these adaptations during full-term pregnancy provides the frameworks for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia.Results: We performed a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets included measurements from the immunome, transcriptome, microbiome, proteome and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net (EN) algorithm was used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets were combined into a single model. This model not only significantly increased predictive power by combining all datasets, but also revealed novel interactions between different biological modalities. Future work includes expansion of the cohort to preterm-enriched populations and in vivo analysis of immune-modulating interventions based on the mechanisms identified.Availability and implementation: Datasets and scripts for reproduction of results are available through: information: Supplementary data are available at Bioinformatics online.

    View details for PubMedID 30561547

  • Found In Translation: a machine learning model for mouse-to-human inference. Nature methods Normand, R., Du, W., Briller, M., Gaujoux, R., Starosvetsky, E., Ziv-Kenet, A., Shalev-Malul, G., Tibshirani, R. J., Shen-Orr, S. S. 2018


    Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20-50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost.

    View details for PubMedID 30478323

  • Analyzing Excess Risk from Matched Designs with Double Controls: Author response. Journal of clinical epidemiology Redelmeier, D., Tibshirani, R. J. 2018

    View details for PubMedID 30453039

  • Log-ratio Lasso: Scalable, Sparse Estimation for Log-ratio Models. Biometrics Bates, S., Tibshirani, R. 2018


    Positive-valued signal data is common in the biological and medical sciences, due to the prevalence of mass spectrometry other imaging techniques. With such data, only the relative intensities of the raw measurements are meaningful. It is desirable to consider models consisting of the log-ratios of all pairs of the raw features, since log-ratios are the simplest meaningful derived features. In this case, however, the dimensionality of the predictor space becomes large, and computationally efficient estimation procedures are required. In this work, we introduce an embedding of the log-ratio parameter space into a space of much lower dimension and use this representation to develop an efficient penalized fitting procedure. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set, the proposed method fits a highly sparse model consisting of features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods. This article is protected by copyright. All rights reserved.

    View details for PubMedID 30387139

  • Multicenter Study Using Desorption-Electrospray-Ionization-Mass-Spectrometry Imaging for Breast-Cancer Diagnosis ANALYTICAL CHEMISTRY Porcari, A. M., Zhang, J., Garza, K. Y., Rodrigues-Peres, R. M., Lin, J. Q., Young, J. H., Tibshirani, R., Nagi, C., Paiva, G. R., Carter, S. A., Sarian, L., Eberlin, M. N., Eberlin, L. S. 2018; 90 (19): 11324–32


    The histological and molecular subtypes of breast cancer demand distinct therapeutic approaches. Invasive ductal carcinoma (IDC) is subtyped according to estrogen-receptor (ER), progesterone-receptor (PR), and HER2 status, among other markers. Desorption-electrospray-ionization-mass-spectrometry imaging (DESI-MSI) is an ambient-ionization MS technique that has been previously used to diagnose IDC. Aiming to investigate the robustness of ambient-ionization MS for IDC diagnosis and subtyping over diverse patient populations and interlaboratory use, we report a multicenter study using DESI-MSI to analyze samples from 103 patients independently analyzed in the United States and Brazil. The lipid profiles of IDC and normal breast tissues were consistent across different patient races and were unrelated to country of sample collection. Similar experimental parameters used in both laboratories yielded consistent mass-spectral data in mass-to-charge ratios ( m/ z) above 700, where complex lipids are observed. Statistical classifiers built using data acquired in the United States yielded 97.6% sensitivity, 96.7% specificity, and 97.6% accuracy for cancer diagnosis. Equivalent performance was observed for the intralaboratory validation set (99.2% accuracy) and, most remarkably, for the interlaboratory validation set independently acquired in Brazil (95.3% accuracy). Separate classification models built for ER and PR statuses as well as the status of their combined hormone receptor (HR) provided predictive accuracies (>89.0%), although low classification accuracies were achieved for HER2 status. Altogether, our multicenter study demonstrates that DESI-MSI is a robust and reproducible technology for rapid breast-cancer-tissue diagnosis and therefore is of value for clinical use.

    View details for PubMedID 30170496

  • Circulating Tumor DNA Measurements As Early Outcome Predictors in Diffuse Large B-Cell Lymphoma. Journal of clinical oncology : official journal of the American Society of Clinical Oncology Kurtz, D. M., Scherer, F., Jin, M. C., Soo, J., Craig, A. F., Esfahani, M. S., Chabon, J. J., Stehr, H., Liu, C. L., Tibshirani, R., Maeda, L. S., Gupta, N. K., Khodadoust, M. S., Advani, R. H., Levy, R., Newman, A. M., Duhrsen, U., Huttmann, A., Meignan, M., Casasnovas, R., Westin, J. R., Roschewski, M., Wilson, W. H., Gaidano, G., Rossi, D., Diehn, M., Alizadeh, A. A. 2018: JCO2018785246


    Purpose Outcomes for patients with diffuse large B-cell lymphoma remain heterogeneous, with existing methods failing to consistently predict treatment failure. We examined the additional prognostic value of circulating tumor DNA (ctDNA) before and during therapy for predicting patient outcomes. Patients and Methods We studied the dynamics of ctDNA from 217 patients treated at six centers, using a training and validation framework. We densely characterized early ctDNA dynamics during therapy using cancer personalized profiling by deep sequencing to define response-associated thresholds within a discovery set. These thresholds were assessed in two independent validation sets. Finally, we assessed the prognostic value of ctDNA in the context of established risk factors, including the International Prognostic Index and interim positron emission tomography/computed tomography scans. Results Before therapy, ctDNA was detectable in 98% of patients; pretreatment levels were prognostic in both front-line and salvage settings. In the discovery set, ctDNA levels changed rapidly, with a 2-log decrease after one cycle (early molecular response [EMR]) and a 2.5-log decrease after two cycles (major molecular response [MMR]) stratifying outcomes. In the first validation set, patients receiving front-line therapy achieving EMR or MMR had superior outcomes at 24 months (EMR: EFS, 83% v 50%; P = .0015; MMR: EFS, 82% v 46%; P < .001). EMR also predicted superior 24-month outcomes in patients receiving salvage therapy in the first validation set (EFS, 100% v 13%; P = .011). The prognostic value of EMR and MMR was further confirmed in the second validation set. In multivariable analyses including International Prognostic Index and interim positron emission tomography/computed tomography scans across both cohorts, molecular response was independently prognostic of outcomes, including event-free and overall survival. Conclusion Pretreatment ctDNA levels and molecular responses are independently prognostic of outcomes in aggressive lymphomas. These risk factors could potentially guide future personalized risk-directed approaches.

    View details for PubMedID 30125215

  • SUPERVISED LEARNING VIA THE "HUBNET" PROCEDURE STATISTICA SINICA Guan, L., Fan, Z., Tibshirani, R. 2018; 28 (3): 1225–43
  • Pharmacogenetics and progression to neovascular age-relatedmacular degeneration-Evidence supporting practice change REPLY PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Vavvas, D. G., Small, K. W., Awh, C., Zanke, B. W., Tibshirani, R. J., Kustra, R. 2018; 115 (25): E5640–E5641

    View details for PubMedID 29880713

  • Noninvasive blood tests for fetal development predict gestational age and preterm delivery SCIENCE Ngo, T. M., Moufarrej, M. N., Rasmussen, M. H., Camunas-Soler, J., Pan, W., Okamoto, J., Neff, N. F., Liu, K., Wong, R. J., Downes, K., Tibshirani, R., Shaw, G. M., Skotte, L., Stevenson, D. K., Biggio, J. R., Elovitz, M. A., Melbye, M., Quake, S. R. 2018; 360 (6393): 1133–36


    Noninvasive blood tests that provide information about fetal development and gestational age could potentially improve prenatal care. Ultrasound, the current gold standard, is not always affordable in low-resource settings and does not predict spontaneous preterm birth, a leading cause of infant death. In a pilot study of 31 healthy pregnant women, we found that measurement of nine cell-free RNA (cfRNA) transcripts in maternal blood predicted gestational age with comparable accuracy to ultrasound but at substantially lower cost. In a related study of 38 women (23 full-term and 15 preterm deliveries), all at elevated risk of delivering preterm, we identified seven cfRNA transcripts that accurately classified women who delivered preterm up to 2 months in advance of labor. These tests hold promise for prenatal care in both the developed and developing worlds, although they require validation in larger, blinded clinical trials.

    View details for PubMedID 29880692

  • Methods for analyzing matched designs with double controls: excess risk is easily estimated and misinterpreted when evaluating traffic deaths JOURNAL OF CLINICAL EPIDEMIOLOGY Redelmeier, D. A., Tibshirani, R. J. 2018; 98: 117–22


    To demonstrate analytic approaches for matched studies where two controls are linked to each case and events are accumulating counts rather than binary outcomes. A secondary intent is to clarify the distinction between total risk and excess risk (unmatched vs. matched perspectives).We review past research testing whether elections can lead to increased traffic risks. The results are reinterpreted by analyzing both the total count of individuals in fatal crashes and the excess count of individuals in fatal crashes, each time accounting for the matched double controls.Overall, 1,546 individuals were in fatal crashes on the 10 election days (average = 155/d), and 2,593 individuals were in fatal crashes on the 20 control days (average = 130/d). Poisson regression of total counts yielded a relative risk of 1.19 (95% confidence interval: 1.12-1.27). Poisson regression of excess counts yielded a relative risk of 3.22 (95% confidence interval: 2.72-3.80). The discrepancy between analyses of total counts and excess counts replicated with alternative statistical models and was visualized in graphical displays.Available approaches provide methods for analyzing count data in matched designs with double controls and help clarify the distinction between increases in total risk and increases in excess risk.

    View details for PubMedID 29452220

  • Some methods for heterogeneous treatment effect estimation in high dimensions STATISTICS IN MEDICINE Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., Tibshirani, R. 2018; 37 (11): 1767–87


    When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.

    View details for PubMedID 29508417

  • Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nature medicine Good, Z., Sarno, J., Jager, A., Samusik, N., Aghaeepour, N., Simonds, E. F., White, L., Lacayo, N. J., Fantl, W. J., Fazio, G., Gaipa, G., Biondi, A., Tibshirani, R., Bendall, S. C., Nolan, G. P., Davis, K. L. 2018; 24 (4): 474–83


    Insight into the cancer cell populations that are responsible for relapsed disease is needed to improve outcomes. Here we report a single-cell-based study of B cell precursor acute lymphoblastic leukemia at diagnosis that reveals hidden developmentally dependent cell signaling states that are uniquely associated with relapse. By using mass cytometry we simultaneously quantified 35 proteins involved in B cell development in 60 primary diagnostic samples. Each leukemia cell was then matched to its nearest healthy B cell population by a developmental classifier that operated at the single-cell level. Machine learning identified six features of expanded leukemic populations that were sufficient to predict patient relapse at diagnosis. These features implicated the pro-BII subpopulation of B cells with activated mTOR signaling, and the pre-BI subpopulation of B cells with activated and unresponsive pre-B cell receptor signaling, to be associated with relapse. This model, termed 'developmentally dependent predictor of relapse' (DDPR), significantly improves currently established risk stratification methods. DDPR features exist at diagnosis and persist at relapse. By leveraging a data-driven approach, we demonstrate the predictive value of single-cell 'omics' for patient stratification in a translational setting and provide a framework for its application to human cancer.

    View details for PubMedID 29505032

  • Post-selection inference for 1-penalized likelihood models CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Taylor, J., Tibshirani, R. 2018; 46 (1): 41–61

    View details for DOI 10.1002/cjs.11313

    View details for Web of Science ID 000425130100004

  • CFH and ARMS2 genetic risk determines progression to neovascular age-related macular degeneration after antioxidant and zinc supplementation PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Vavvas, D. G., Small, K. W., Awh, C. C., Zanke, B. W., Tibshirani, R. J., Kustra, R. 2018; 115 (4): E696–E704


    We evaluated the influence of an antioxidant and zinc nutritional supplement [the Age-Related Eye Disease Study (AREDS) formulation] on delaying or preventing progression to neovascular AMD (NV) in persons with age-related macular degeneration (AMD). AREDS subjects (n = 802) with category 3 or 4 AMD at baseline who had been treated with placebo or the AREDS formulation were evaluated for differences in the risk of progression to NV as a function of complement factor H (CFH) and age-related maculopathy susceptibility 2 (ARMS2) genotype groups. We used published genetic grouping: a two-SNP haplotype risk-calling algorithm to assess CFH, and either the single SNP rs10490924 or 372_815del443ins54 to mark ARMS2 risk. Progression risk was determined using the Cox proportional hazard model. Genetics-treatment interaction on NV risk was assessed using a multiiterative bootstrap validation analysis. We identified strong interaction of genetics with AREDS formulation treatment on the development of NV. Individuals with high CFH and no ARMS2 risk alleles and taking the AREDS formulation had increased progression to NV compared with placebo. Those with low CFH risk and high ARMS2 risk had decreased progression risk. Analysis of CFH and ARMS2 genotype groups from a validation dataset reinforces this conclusion. Bootstrapping analysis confirms the presence of a genetics-treatment interaction and suggests that individual treatment response to the AREDS formulation is largely determined by genetics. The AREDS formulation modifies the risk of progression to NV based on individual genetics. Its use should be based on patient-specific genotype.

    View details for PubMedID 29311295

  • Genomic feature selection by coverage design optimization Journal of Applied Statistics Reid, S., Newman, A. M., Diehn, M., Alizadeh, A. A., Tibshirani, R. 2018
  • Distinguishing malignant from benign microscopic skin lesions using desorption electrospray ionization mass spectrometry imaging. Proceedings of the National Academy of Sciences of the United States of America Margulis, K., Chiou, A. S., Aasi, S. Z., Tibshirani, R. J., Tang, J. Y., Zare, R. N. 2018


    Detection of microscopic skin lesions presents a considerable challenge in diagnosing early-stage malignancies as well as in residual tumor interrogation after surgical intervention. In this study, we established the capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin. We analyzed 86 human specimens collected during Mohs micrographic surgery for BCC to cross-examine spatial distributions of numerous lipids and metabolites in BCC aggregates versus adjacent skin. Statistical analysis using the least absolute shrinkage and selection operation (Lasso) was employed to categorize each 200-µm-diameter picture element (pixel) of investigated skin tissue map as BCC or normal. Lasso identified 24 molecular ion signals, which are significant for pixel classification. These ion signals included lipids observed at m/z 200-1,200 and Krebs cycle metabolites observed at m/z < 200. Based on these features, Lasso yielded an overall 94.1% diagnostic accuracy pixel by pixel of the skin map compared with histopathological evaluation. We suggest that DESI-MSI/Lasso analysis can be employed as a complementary technique for delineation of microscopic skin tumors.

    View details for PubMedID 29866838

  • Food allergy and omics. The Journal of allergy and clinical immunology Dhondalay, G. K., Rael, E., Acharya, S., Zhang, W., Sampath, V., Galli, S. J., Tibshirani, R., Boyd, S. D., Maecker, H., Nadeau, K. C., Andorf, S. 2018; 141 (1): 20–29


    Food allergy (FA) prevalence has been increasing over the last few decades and is now a global health concern. Current diagnostic methods for FA result in a high number of false-positive results, and the standard of care is either allergen avoidance or use of epinephrine on accidental exposure, although currently with no other approved treatments. The increasing prevalence of FA, lack of robust biomarkers, and inadequate treatments warrants further research into the mechanism underlying food allergies. Recent technological advances have made it possible to move beyond traditional biological techniques to more sophisticated high-throughput approaches. These technologies have created the burgeoning field of omics sciences, which permit a more systematic investigation of biological problems. Omics sciences, such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, microbiomics, and exposomics, have enabled the construction of regulatory networks and biological pathway models. Parallel advances in bioinformatics and computational techniques have enabled the integration, analysis, and interpretation of these exponentially growing data sets and opens the possibility of personalized or precision medicine for FA.

    View details for PubMedID 29307411

  • DRUG-NEM: Optimizing drug combinations using single-cell perturbation response to account for intratumoral heterogeneity. Proceedings of the National Academy of Sciences of the United States of America Anchang, B., Davis, K. L., Fienberg, H. G., Williamson, B. D., Bendall, S. C., Karacosta, L. G., Tibshirani, R., Nolan, G. P., Plevritis, S. K. 2018; 115 (18): E4294–E4303


    An individual malignant tumor is composed of a heterogeneous collection of single cells with distinct molecular and phenotypic features, a phenomenon termed intratumoral heterogeneity. Intratumoral heterogeneity poses challenges for cancer treatment, motivating the need for combination therapies. Single-cell technologies are now available to guide effective drug combinations by accounting for intratumoral heterogeneity through the analysis of the signaling perturbations of an individual tumor sample screened by a drug panel. In particular, Mass Cytometry Time-of-Flight (CyTOF) is a high-throughput single-cell technology that enables the simultaneous measurements of multiple ([Formula: see text]40) intracellular and surface markers at the level of single cells for hundreds of thousands of cells in a sample. We developed a computational framework, entitled Drug Nested Effects Models (DRUG-NEM), to analyze CyTOF single-drug perturbation data for the purpose of individualizing drug combinations. DRUG-NEM optimizes drug combinations by choosing the minimum number of drugs that produce the maximal desired intracellular effects based on nested effects modeling. We demonstrate the performance of DRUG-NEM using single-cell drug perturbation data from tumor cell lines and primary leukemia samples.

    View details for PubMedID 29654148

  • A General Framework for Estimation and Inference From Clusters of Features JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Reid, S., Taylor, J., Tibshirani, R. 2018; 113 (521): 280–93

    View details for DOI 10.1214/16-AOS1536

    View details for Web of Science ID 000418371600011

  • KLHL6 Is Preferentially Expressed in Germinal Center-Derived B-Cell Lymphomas AMERICAN JOURNAL OF CLINICAL PATHOLOGY Kunder, C. A., Roncador, G., Advani, R. H., Gualco, G., Bacchi, C. E., Sabile, J. M., Lossos, I. S., Nie, K., Tibshirani, R., Green, M. R., Alizadeh, A. A., Natkunam, Y. 2017; 148 (6): 465–76


    KLHL6 is a recently described BTB-Kelch protein with selective expression in lymphoid tissues and is most strongly expressed in germinal center B cells.Using gene expression profiling as well as immunohistochemistry with an anti-KLHL6 monoclonal antibody, we have characterized the expression of this molecule in normal and neoplastic tissues. Protein expression was evaluated in 1,058 hematopoietic neoplasms.Consistent with its discovery as a germinal center marker, KLHL6 was positive mainly in B-cell neoplasms of germinal center derivation, including 95% of follicular lymphomas (106/112). B-cell lymphomas of non-germinal center derivation were generally negative (0/33 chronic lymphocytic leukemias/small lymphocytic lymphomas, 3/49 marginal zone lymphomas, and 2/66 mantle cell lymphomas).In addition to other germinal center markers, including BCL6, CD10, HGAL, and LMO2, KLHL6 immunohistochemistry may prove a useful adjunct in the diagnosis and future classification of B-cell lymphomas.

    View details for PubMedID 29140403

  • Big data modeling to predict platelet usage and minimize wastage in a tertiary care system PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Guan, L., Tian, X., Gombar, S., Zemek, A. J., Krishnan, G., Scott, R., Narasimhan, B., Tibshirani, R. J., Pham, T. D. 2017; 114 (43): 11368–73


    Maintaining a robust blood product supply is an essential requirement to guarantee optimal patient care in modern health care systems. However, daily blood product use is difficult to anticipate. Platelet products are the most variable in daily usage, have short shelf lives, and are also the most expensive to produce, test, and store. Due to the combination of absolute need, uncertain daily demand, and short shelf life, platelet products are frequently wasted due to expiration. Our aim is to build and validate a statistical model to forecast future platelet demand and thereby reduce wastage. We have investigated platelet usage patterns at our institution, and specifically interrogated the relationship between platelet usage and aggregated hospital-wide patient data over a recent consecutive 29-mo period. Using a convex statistical formulation, we have found that platelet usage is highly dependent on weekday/weekend pattern, number of patients with various abnormal complete blood count measurements, and location-specific hospital census data. We incorporated these relationships in a mathematical model to guide collection and ordering strategy. This model minimizes waste due to expiration while avoiding shortages; the number of remaining platelet units at the end of any day stays above 10 in our model during the same period. Compared with historical expiration rates during the same period, our model reduces the expiration rate from 10.5 to 3.2%. Extrapolating our results to the ∼2 million units of platelets transfused annually within the United States, if implemented successfully, our model can potentially save ∼80 million dollars in health care costs.

    View details for PubMedID 29073058

  • Post-selection point and interval estimation of signal sizes in Gaussian samples CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Reid, S., Taylor, J., Tibshirani, R. 2017; 45 (2): 128-148

    View details for DOI 10.1002/cjs.11320

    View details for Web of Science ID 000400027400001

  • Chemical Space Mimicry for Drug Discovery JOURNAL OF CHEMICAL INFORMATION AND MODELING Yuan, W., Jiang, D., Nambiar, D. K., Liew, L. P., Hay, M. P., Bloomstein, J., Lu, P., Turner, B., Le, Q., Tibshirani, R., Khatri, P., Moloney, M. G., Koong, A. C. 2017; 57 (4): 875-882


    We describe a new library generation method, Machine-based Identification of Molecules Inside Characterized Space (MIMICS), that generates sets of molecules inspired by a text-based input. MIMICS-generated libraries were found to preserve distributions of properties while simultaneously increasing structural diversity. Newly identified MIMICS-generated compounds were found to be bioactive as inhibitors of specific components of the unfolded protein response (UPR) and the VEGFR2 pathway in cell-based assays, thus confirming the applicability of this methodology toward drug design applications. Wider application of MIMICS could facilitate the efficient utilization of chemical space.

    View details for DOI 10.1021/acs.jcim.6b00754

    View details for PubMedID 28257191

  • Diagnosis of prostate cancer by desorption electrospray ionization mass spectrometric imaging of small metabolites and lipids PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Banerjee, S., Zare, R. N., Tibshirani, R. J., Kunder, C. A., Nolley, R., Fan, R., Brooks, J. D., Sonn, G. A. 2017; 114 (13): 3334-3339


    Accurate identification of prostate cancer in frozen sections at the time of surgery can be challenging, limiting the surgeon's ability to best determine resection margins during prostatectomy. We performed desorption electrospray ionization mass spectrometry imaging (DESI-MSI) on 54 banked human cancerous and normal prostate tissue specimens to investigate the spatial distribution of a wide variety of small metabolites, carbohydrates, and lipids. In contrast to several previous studies, our method included Krebs cycle intermediates (m/z <200), which we found to be highly informative in distinguishing cancer from benign tissue. Malignant prostate cells showed marked metabolic derangements compared with their benign counterparts. Using the "Least absolute shrinkage and selection operator" (Lasso), we analyzed all metabolites from the DESI-MS data and identified parsimonious sets of metabolic profiles for distinguishing between cancer and normal tissue. In an independent set of samples, we could use these models to classify prostate cancer from benign specimens with nearly 90% accuracy per patient. Based on previous work in prostate cancer showing that glucose levels are high while citrate is low, we found that measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5. After brief tissue preparation, the glucose/citrate ratio can be recorded on a tissue sample in 1 min or less, which is in sharp contrast to the 20 min or more required by histopathological examination of frozen tissue specimens.

    View details for DOI 10.1073/pnas.1700677114

    View details for PubMedID 28292895

  • Landscape of monoallelic DNA accessibility in mouse embryonic stem cells and neural progenitor cells. Nature genetics Xu, J., Carter, A. C., Gendrel, A., Attia, M., Loftus, J., Greenleaf, W. J., Tibshirani, R., Heard, E., Chang, H. Y. 2017; 49 (3): 377-386


    We developed an allele-specific assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) to genotype and profile active regulatory DNA across the genome. Using a mouse hybrid F1 system, we found that monoallelic DNA accessibility across autosomes was pervasive, developmentally programmed and composed of several patterns. Genetically determined accessibility was enriched at distal enhancers, but random monoallelically accessible (RAMA) elements were enriched at promoters and may act as gatekeepers of monoallelic mRNA expression. Allelic choice at RAMA elements was stable across cell generations and bookmarked through mitosis. RAMA elements in neural progenitor cells were biallelically accessible in embryonic stem cells but premarked with bivalent histone modifications; one allele was silenced during differentiation. Quantitative analysis indicated that allelic choice at the majority of RAMA elements is consistent with a stochastic process; however, up to 30% of RAMA elements may deviate from the expected pattern, suggesting a regulated or counting mechanism.

    View details for DOI 10.1038/ng.3769

    View details for PubMedID 28112738

  • Long-term course of patients with primary ocular adnexal MALT lymphoma: a large single-institution cohort study BLOOD Desai, A., Joag, M. G., Lekakis, L., Chapman, J. R., Vega, F., Tibshirani, R., Tse, D., Markoe, A., Lossos, I. S. 2017; 129 (3): 324-332


    While Primary Ocular Adnexal MALT Lymphoma (POAML) is the most common orbital tumor, there are large gaps in knowledge of its natural history. We conducted a retrospective analysis of the largest reported cohort, consisting of 182 patients with POAML, diagnosed or treated at our institution to analyze long-term outcome, response to treatment, incidence and localization of relapse and transformation. The majority of patients (80%) presented with stage I disease. Overall, 84% of treated patients achieved a complete response after first-line therapy. In patients with stage I disease treated with radiation therapy (RT), doses ≥ 30.6Gy were associated with significantly better complete response rate (p=0.04) and progression free survival (PFS) at 5 and 10-year (p<0.0001). Median overall survival and PFS for all patients were 250 months (95% CI: 222 - upper limit not reached) and 134 months (95% CI: 87 - 198), respectively. Kaplan-Meier estimates for the PFS at 1, 5, and 10 years were 91.5% (95% CI: 86.1% - 94.9%), 68.5% (95% CI: 60.4% - 75.6%), and 50.9% (95% CI: 40.5% - 61.6%), respectively. In univariate analysis, age > 60 years, radiation dose, bilateral ocular involvement at presentation and advanced stage were significantly correlated with shorter PFS (p=0.006, p=0.0001, p=0.002 and p=0.0001, respectively). Multivariate analysis showed that age >60 years (HR= 2.44) and RT<30.6Gy (HR=4.17) were the only factors correlated with shorter PFS (p=0.01 and p=0.0003, respectively). We demonstrate that POAMLs harbor a persistent and ongoing risk for relapses, including in central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow up.

    View details for DOI 10.1182/blood-2016-05-714584

    View details for Web of Science ID 000396529800010

  • An immune clock of human pregnancy. Science immunology Aghaeepour, N., Ganio, E. A., Mcilwain, D., Tsai, A. S., Tingle, M., Van Gassen, S., Gaudilliere, D. K., Baca, Q., McNeil, L., Okada, R., Ghaemi, M. S., Furman, D., Wong, R. J., Winn, V. D., Druzin, M. L., El-Sayed, Y. Y., Quaintance, C., Gibbs, R., Darmstadt, G. L., Shaw, G. M., Stevenson, D. K., Tibshirani, R., Nolan, G. P., Lewis, D. B., Angst, M. S., Gaudilliere, B. 2017; 2 (15)


    The maintenance of pregnancy relies on finely tuned immune adaptations. We demonstrate that these adaptations are precisely timed, reflecting an immune clock of pregnancy in women delivering at term. Using mass cytometry, the abundance and functional responses of all major immune cell subsets were quantified in serial blood samples collected throughout pregnancy. Cell signaling-based Elastic Net, a regularized regression method adapted from the elastic net algorithm, was developed to infer and prospectively validate a predictive model of interrelated immune events that accurately captures the chronology of pregnancy. Model components highlighted existing knowledge and revealed previously unreported biology, including a critical role for the interleukin-2-dependent STAT5ab signaling pathway in modulating T cell function during pregnancy. These findings unravel the precise timing of immunological events occurring during a term pregnancy and provide the analytical framework to identify immunological deviations implicated in pregnancy-related pathologies.

    View details for PubMedID 28864494

  • An Ordered Lasso and Sparse Time-Lagged Regression TECHNOMETRICS Tibshirani, R., Suo, X. 2016; 58 (4): 415-423
  • High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences of the United States of America Wager, S., Du, W., Taylor, J., Tibshirani, R. J. 2016


    We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.

    View details for PubMedID 27791165

  • Cardiolipins Are Biomarkers of Mitochondria-Rich Thyroid Oncocytic Tumors. Cancer research Zhang, J., Yu, W., Ryu, S. W., Lin, J., Buentello, G., Tibshirani, R., Suliburk, J., Eberlin, L. S. 2016: -?


    Oncocytic tumors are characterized by an excessive eosinophilic, granular cytoplasm due to aberrant accumulation of mitochondria. Mutations in mitochondrial DNA occur in oncocytic thyroid tumors, but there is no information about their lipid composition, which might reveal candidate theranostic molecules. Here, we used desorption electrospray ionization mass spectrometry (DESI-MS) to image and chemically characterize the lipid composition of oncocytic thyroid tumors, as compared with nononcocytic thyroid tumors and normal thyroid samples. We identified a novel molecular signature of oncocytic tumors characterized by an abnormally high abundance and chemical diversity of cardiolipins (CL), including many oxidized species. DESI-MS imaging and IHC experiments confirmed that the spatial distribution of CLs overlapped with regions of accumulation of mitochondria-rich oncocytic cells. Fluorescent imaging and mitochondrial isolation showed that both mitochondrial accumulation and alteration in CL composition of mitochondria occurred in oncocytic tumors cells, thus contributing the aberrant molecular signatures detected. A total of 219 molecular ions, including CLs, other glycerophospholipids, fatty acids, and metabolites, were found at increased or decreased abundance in oncocytic, nononcocytic, or normal thyroid tissues. Our findings suggest new candidate targets for clinical and therapeutic use against oncocytic tumors. Cancer Res; 76(22); 1-10. ©2016 AACR.

    View details for PubMedID 27659048

  • Data Shared Lasso: A novel tool to discover uplift COMPUTATIONAL STATISTICS & DATA ANALYSIS Gross, S. M., Tibshirani, R. 2016; 101: 226-235
  • Pancreatic Cancer Surgical Resection Margins: Molecular Assessment by Mass Spectrometry Imaging. PLoS medicine Eberlin, L. S., Margulis, K., Planell-Mendez, I., Zare, R. N., Tibshirani, R., Longacre, T. A., Jalali, M., Norton, J. A., Poultsides, G. A. 2016; 13 (8)


    Surgical resection with microscopically negative margins remains the main curative option for pancreatic cancer; however, in practice intraoperative delineation of resection margins is challenging. Ambient mass spectrometry imaging has emerged as a powerful technique for chemical imaging and real-time diagnosis of tissue samples. We applied an approach combining desorption electrospray ionization mass spectrometry imaging (DESI-MSI) with the least absolute shrinkage and selection operator (Lasso) statistical method to diagnose pancreatic tissue sections and prospectively evaluate surgical resection margins from pancreatic cancer surgery.Our methodology was developed and tested using 63 banked pancreatic cancer samples and 65 samples (tumor and specimen margins) collected prospectively during 32 pancreatectomies from February 27, 2013, to January 16, 2015. In total, mass spectra for 254,235 individual pixels were evaluated. When cross-validation was employed in the training set of samples, 98.1% agreement with histopathology was obtained. Using an independent set of samples, 98.6% agreement was achieved. We used a statistical approach to evaluate 177,727 mass spectra from samples with complex, mixed histology, achieving an agreement of 81%. The developed method showed agreement with frozen section evaluation of specimen margins in 24 of 32 surgical cases prospectively evaluated. In the remaining eight patients, margins were found to be positive by DESI-MSI/Lasso, but negative by frozen section analysis. The median overall survival after resection was only 10 mo for these eight patients as opposed to 26 mo for patients with negative margins by both techniques. This observation suggests that our method (as opposed to the standard method to date) was able to detect tumor involvement at the margin in patients who developed early recurrence. Nonetheless, a larger cohort of samples is needed to validate the findings described in this study. Careful evaluation of the long-term benefits to patients of the use of DESI-MSI for surgical margin evaluation is also needed to determine its value in clinical practice.Our findings provide evidence that the molecular information obtained by DESI-MSI/Lasso from pancreatic tissue samples has the potential to transform the evaluation of surgical specimens. With further development, we believe the described methodology could be routinely used for intraoperative surgical margin assessment of pancreatic cancer.

    View details for DOI 10.1371/journal.pmed.1002108

    View details for PubMedID 27575375

  • Pathophysiological significance and therapeutic targeting of germinal center kinase in diffuse large B-cell lymphoma. Blood Matthews, J. M., Bhatt, S., Patricelli, M. P., Nomanbhoy, T. K., Jiang, X., Natkunam, Y., Gentles, A. J., Martinez, E., Zhu, D., Chapman, J. R., Cortizas, E., Shyam, R., Chinichian, S., Advani, R., Tan, L., Zhang, J., Choi, H. G., Tibshirani, R., Buhrlage, S. J., Gratzinger, D., Verdun, R., Gray, N. S., Lossos, I. S. 2016; 128 (2): 239-248


    Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma (NHL), yet 40-50% of patients will eventually succumb to their disease demonstrating a pressing need for novel therapeutic options. Gene expression profiling has identified messenger RNA's that lead to transformation, but critical events transforming cells are normally executed by kinases. Therefore, we hypothesized that previously unrecognized kinases may contribute to DLBCL pathogenesis. We performed the first comprehensive analysis of global kinase activity in DLBCL, to identify novel therapeutic targets, and discovered that Germinal Center Kinase (GCK) was extensively activated. GCK RNA interference and small molecule inhibition induced cell cycle arrest and apoptosis in DLBCL cell lines and primary tumors in vitro and decreased the tumor growth rate in vivo, resulting in a significantly extended lifespan of mice bearing DLBCL xenografts. GCK expression was also linked to adverse clinical outcome in a cohort of 151 primary DLBCL patients. These studies demonstrate, for the first time, that GCK is a molecular therapeutic target in DLBCL tumors and that inhibiting GCK may significantly extend DLBCL patient survival. Since the majority of DLBCL tumors (~80%) exhibit activation of GCK, this therapy may be applicable to most patients.

    View details for DOI 10.1182/blood-2016-02-696856

    View details for PubMedID 27151888

  • Exact Post-Selection Inference for Sequential Regression Procedures JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tibshirani, R. J., Taylor, J., Lockhart, R., Tibshirani, R. 2016; 111 (514): 600-614
  • INFERENCE IN ADAPTIVE REGRESSION VIA THE KAC-RICE FORMULA ANNALS OF STATISTICS Taylor, J. E., Loftus, J. R., Tibshirani, R. J. 2016; 44 (2): 743-770

    View details for DOI 10.1214/15-AOS1386

    View details for Web of Science ID 000372594300011

  • Sparse regression and marginal testing using cluster prototypes. Biostatistics Reid, S., Tibshirani, R. 2016; 17 (2): 364-376


    We propose a new approach for sparse regression and marginal testing, for data with correlated features. Our procedure first clusters the features, and then chooses as the cluster prototype the most informative feature in that cluster. Then we apply either sparse regression (lasso) or marginal significance testing to these prototypes. While this kind of strategy is not entirely new, a key feature of our proposal is its use of the post-selection inference theory of Taylor and others (2014, Exact post-selection inference for forward stepwise and least angle regression, Preprint, arXiv:1401.3889) and Lee and others (2014, Exact post-selection inference with the lasso, Preprint, arXiv:1311.6238v5) to compute exact [Formula: see text]-values and confidence intervals that properly account for the selection of prototypes. We also apply the recent "knockoff" idea of Barber and Candès (2014, Controlling the false discovery rate via knockoffs, Preprint, arXiv:1404.5609) to provide exact finite sample control of the FDR of our regression procedure. We illustrate our proposals on both real and simulated data.

    View details for DOI 10.1093/biostatistics/kxv049

    View details for PubMedID 26614384

  • Sequential selection procedures and false discovery rate control JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY G'Sell, M. G., Wager, S., Chouldechova, A., Tibshirani, R. 2016; 78 (2): 423-444

    View details for DOI 10.1111/rssb.12122

    View details for Web of Science ID 000369136600005

  • Successful immunotherapy induces previously unidentified allergen-specific CD4+ T-cell subsets. Proceedings of the National Academy of Sciences of the United States of America Ryan, J. F., Hovde, R., Glanville, J., Lyu, S., Ji, X., Gupta, S., Tibshirani, R. J., Jay, D. C., Boyd, S. D., Chinthrajah, R. S., Davis, M. M., Galli, S. J., Maecker, H. T., Nadeau, K. C. 2016; 113 (9): E1286-95


    Allergen immunotherapy can desensitize even subjects with potentially lethal allergies, but the changes induced in T cells that underpin successful immunotherapy remain poorly understood. In a cohort of peanut-allergic participants, we used allergen-specific T-cell sorting and single-cell gene expression to trace the transcriptional "roadmap" of individual CD4+ T cells throughout immunotherapy. We found that successful immunotherapy induces allergen-specific CD4+ T cells to expand and shift toward an "anergic" Th2 T-cell phenotype largely absent in both pretreatment participants and healthy controls. These findings show that sustained success, even after immunotherapy is withdrawn, is associated with the induction, expansion, and maintenance of immunotherapy-specific memory and naive T-cell phenotypes as early as 3 mo into immunotherapy. These results suggest an approach for immune monitoring participants undergoing immunotherapy to predict the success of future treatment and could have implications for immunotherapy targets in other diseases like cancer, autoimmune disease, and transplantation.

    View details for DOI 10.1073/pnas.1520180113

    View details for PubMedID 26811452


    View details for DOI 10.1214/15-AOAS866

    View details for Web of Science ID 000370445600001

  • A Permutation Approach to Testing Interactions for Binary Response by Comparing Correlations Between Classes JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Simon, N., Tibshirani, R. 2015; 110 (512): 1707-1716
  • A component lasso CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Hussami, N., Tibshirani, R. J. 2015; 43 (4): 624-646

    View details for DOI 10.1002/cjs.11267

    View details for Web of Science ID 000367667700008

  • The Radiogenomic Risk Score: Construction of a Prognostic Quantitative, Noninvasive Image-based Molecular Assay for Renal Cell Carcinoma RADIOLOGY Jamshidi, N., Jonasch, E., Zapala, M., Korn, R. L., Aganovic, L., Zhao, H., Sitaram, R. T., Tibshirani, R. J., Banerjee, S., Brooks, J. D., Ljungberg, B., Kuo, M. D. 2015; 277 (1): 114-123


    Purpose To evaluate the feasibility of constructing radiogenomic-based surrogates of molecular assays (SOMAs) in patients with clear-cell renal cell carcinoma (CCRCC) by using data extracted from a single computed tomographic (CT) image. Materials and Methods In this institutional review board approved study, gene expression profile data and contrast material-enhanced CT images from 70 patients with CCRCC in a training set were independently assessed by two radiologists for a set of predefined imaging features. A SOMA for a previously validated CCRCC-specific supervised principal component (SPC) risk score prognostic gene signature was constructed and termed the radiogenomic risk score (RRS). It uses the microarray data and a 28-trait image array to evaluate each CT image with multiple regression of gene expression analysis. The predictive power of the RRS SOMA was then prospectively validated in an independent dataset to confirm its relationship to the SPC gene signature (n = 70) and determination of patient outcome (n = 77). Data were analyzed by using multivariate linear regression-based methods and Cox regression modeling, and significance was assessed with receiver operator characteristic curves and Kaplan-Meier survival analysis. Results Our SOMA faithfully represents the tissue-based molecular assay it models. The RRS scaled with the SPC gene signature (R = 0.57, P < .001, classification accuracy 70.1%, P < .001) and predicted disease-specific survival (log rank P < .001). Independent validation confirmed the relationship between the RRS and the SPC gene signature (R = 0.45, P < .001, classification accuracy 68.6%, P < .001) and disease-specific survival (log-rank P < .001) and that it was independent of stage, grade, and performance status (multivariate Cox model P < .05, log-rank P < .001). Conclusion A SOMA for the CCRCC-specific SPC prognostic gene signature that is predictive of disease-specific survival and independent of stage was constructed and validated, confirming that SOMA construction is feasible. (©) RSNA, 2015 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on August 24, 2015.

    View details for DOI 10.1148/radiol.2015150800

    View details for Web of Science ID 000368434000014

  • Statistical learning and selective inference PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Taylor, J., Tibshirani, R. J. 2015; 112 (25): 7629-7634


    We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"-searched for the strongest associations-means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

    View details for DOI 10.1073/pnas.1507583112

    View details for PubMedID 26100887

  • Collaborative regression BIOSTATISTICS Gross, S. M., Tibshirani, R. 2015; 16 (2): 326-338


    We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with these type of data is "sparse multiple canonical correlation analysis" (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem that can be implemented with off the shelf solvers, and illustrate their use on simulated and real data.

    View details for DOI 10.1093/biostatistics/kxu047

    View details for PubMedID 25406332


    View details for DOI 10.1214/14-AOAS758

    View details for Web of Science ID 000358354400002

  • Molecular subtyping for clinically defined breast cancer subgroups BREAST CANCER RESEARCH Zhao, X., Rodland, E. A., Tibshirani, R., Plevritis, S. 2015; 17


    Breast cancer is commonly classified into intrinsic molecular subtypes. Standard gene centering is routinely done prior to molecular subtyping, but it can produce inaccurate classifications when the distribution of clinicopathological characteristics in the study cohort differs from that of the training cohort used to derive the classifier.We propose a subgroup-specific gene-centering method to perform molecular subtyping on a study cohort that has a skewed distribution of clinicopathological characteristics relative to the training cohort. On such a study cohort, we center each gene on a specified percentile, where the percentile is determined from a subgroup of the training cohort with clinicopathological characteristics similar to the study cohort. We demonstrate our method using the PAM50 classifier and its associated University of North Carolina (UNC) training cohort. We considered study cohorts with skewed clinicopathological characteristics, including subgroups composed of a single prototypic subtype of the UNC-PAM50 training cohort (n = 139), an external estrogen receptor (ER)-positive cohort (n = 48) and an external triple-negative cohort (n = 77).Subgroup-specific gene centering improved prediction performance with the accuracies between 77% and 100%, compared to accuracies between 17% and 33% from standard gene centering, when applied to the prototypic tumor subsets of the PAM50 training cohort. It reduced classification error rates on the ER-positive (11% versus 28%; P = 0.0389), the ER-negative (5% versus 41%; P < 0.0001) and the triple-negative (11% versus 56%; P = 0.1336) subgroups of the PAM50 training cohort. In addition, it produced higher accuracy for subtyping study cohorts composed of varying proportions of ER-positive versus ER-negative cases. Finally, it increased the percentage of assigned luminal subtypes on the external ER-positive cohort and basal-like subtype on the external triple-negative cohort.Gene centering is often necessary to accurately apply a molecular subtype classifier. Compared with standard gene centering, our proposed subgroup-specific gene centering produced more accurate molecular subtype assignments in a study cohort with skewed clinicopathological characteristics relative to the training cohort.

    View details for DOI 10.1186/s13058-015-0520-4

    View details for Web of Science ID 000351829500001

    View details for PubMedID 25849221

    View details for PubMedCentralID PMC4365540

  • Pancancer analysis of DNA methylation-driven genes using MethylMix GENOME BIOLOGY Gevaert, O., Tibshirani, R., Plevritis, S. K. 2015; 16


    Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.

    View details for DOI 10.1186/s13059-014-0579-8

    View details for Web of Science ID 000351817300001

    View details for PubMedID 25631659

    View details for PubMedCentralID PMC4365533

  • Pancancer analysis of DNA methylation-driven genes using MethylMix. Genome biology Gevaert, O., Tibshirani, R., Plevritis, S. K. 2015; 16: 17-?


    Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.

    View details for DOI 10.1186/s13059-014-0579-8

    View details for PubMedID 25631659

  • A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tian, L., Alizadeh, A. A., Gentles, A. J., Tibshirani, R. 2014; 109 (508): 1517-1532


    We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.

    View details for DOI 10.1080/01621459.2014.951443

    View details for Web of Science ID 000346797000016

    View details for PubMedCentralID PMC4338439

  • Quantitative SD-OCT imaging biomarkers as indicators of age-related macular degeneration progression. Investigative ophthalmology & visual science de Sisternes, L., Simon, N., Tibshirani, R., Leng, T., Rubin, D. L. 2014; 55 (11): 7093-7103


    Purpose: We developed a statistical model based on quantitative characteristics of drusen to estimate the likelihood of conversion from early and intermediate age-related macular degeneration (AMD) to its advanced exudative form (AMD progression) in the short term (less than 5 years), a crucial task to enable early intervention and improve outcomes. Methods: Image features of drusen quantifying their number, morphology, and reflectivity properties, as well as the longitudinal evolution in these characteristics, were automatically extracted from 2146 spectral domain optical coherence tomography (SD-OCT) scans of 330 AMD eyes in 244 patients collected over a period of 5 years, with 36 eyes showing progression during clinical follow-up. We developed and evaluated a statistical model to predict the likelihood of progression at pre-determined times using clinical and image features as predictors. Results: Area, volume, height, and reflectivity of drusen were informative features distinguishing between progressing and non-progressing cases. Discerning progression at follow-up (mean 6.16 months) resulted in a mean area under the receiver operating characteristic curve (AUC) of 0.74 ((0.58, 0.85) 95% confidence interval (CI)). The maximum predictive performance was observed at 11 months after a patient's first early AMD diagnosis, with mean AUC 0.92 ((0.83, 0.98) 95% CI). Those eyes predicted to progress showed a much higher progression rate than those predicted not to progress at any given time from the initial visit. Conclusions: Our results demonstrate the potential ability of our model to identify those AMD patients at risk of progressing to exudative AMD from an early or intermediate stage.

    View details for DOI 10.1167/iovs.14-14918

    View details for PubMedID 25301882

  • Alteration of the lipid profile in lymphomas induced by MYC overexpression. Proceedings of the National Academy of Sciences of the United States of America Eberlin, L. S., Gabay, M., Fan, A. C., Gouw, A. M., Tibshirani, R. J., Felsher, D. W., Zare, R. N. 2014; 111 (29): 10450-10455


    Overexpression of the v-myc avian myelocytomatosis viral oncogene homolog (MYC) oncogene is one of the most commonly implicated causes of human tumorigenesis. MYC is known to regulate many aspects of cellular biology including glucose and glutamine metabolism. Little is known about the relationship between MYC and the appearance and disappearance of specific lipid species. We use desorption electrospray ionization mass spectrometry imaging (DESI-MSI), statistical analysis, and conditional transgenic animal models and cell samples to investigate changes in lipid profiles in MYC-induced lymphoma. We have detected a lipid signature distinct from that observed in normal tissue and in rat sarcoma-induced lymphoma cells. We found 104 distinct molecular ions that have an altered abundance in MYC lymphoma compared with normal control tissue by statistical analysis with a false discovery rate of less than 5%. Of these, 86 molecular ions were specifically identified as complex phospholipids. To evaluate whether the lipid signature could also be observed in human tissue, we examined 15 human lymphoma samples with varying expression levels of MYC oncoprotein. Distinct lipid profiles in lymphomas with high and low MYC expression were observed, including many of the lipid species identified as significant for MYC-induced animal lymphoma tissue. Our results suggest a relationship between the appearance of specific lipid species and the overexpression of MYC in lymphomas.

    View details for DOI 10.1073/pnas.1409778111

    View details for PubMedID 24994904

  • Automated identification of stratifying signatures in cellular subpopulations PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Bruggner, R. V., Bodenmiller, B., Dill, D. L., Tibshirani, R. J., Nolan, G. P. 2014; 111 (26): E2770-E2777


    Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased--and potentially more thorough--correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.

    View details for DOI 10.1073/pnas.1408792111

    View details for Web of Science ID 000338118900020

    View details for PubMedCentralID PMC4084463

  • Active idiotypic vaccination versus control immunotherapy for follicular lymphoma. Journal of clinical oncology Levy, R., Ganjoo, K. N., Leonard, J. P., Vose, J. M., Flinn, I. W., Ambinder, R. F., Connors, J. M., Berinstein, N. L., Belch, A. R., Bartlett, N. L., Nichols, C., Emmanouilides, C. E., Timmerman, J. M., Gregory, S. A., Link, B. K., Inwards, D. J., Freedman, A. S., Matous, J. V., Robertson, M. J., Kunkel, L. A., Ingolia, D. E., Gentles, A. J., Liu, C. L., Tibshirani, R., Alizadeh, A. A., Denney, D. W. 2014; 32 (17): 1797-1803

    View details for DOI 10.1200/JCO.2012.43.9273

    View details for PubMedID 24799467

  • Regularization Paths for Conditional Logistic Regression: The clogitL1 Package JOURNAL OF STATISTICAL SOFTWARE Reid, S., Tibshirani, R. 2014; 58 (12): 1-23
  • Sensitivity analysis for inference with partially identifiable covariance matrices COMPUTATIONAL STATISTICS G'Sell, M. G., Shen-Orr, S. S., Tibshirani, R. 2014; 29 (3-4): 529-546
  • LMO2 and BCL6 are associated with improved survival in primary central nervous system lymphoma BRITISH JOURNAL OF HAEMATOLOGY Lossos, C., Bayraktar, S., Weinzierl, E., Younes, S. F., Hosein, P. J., Tibshirani, R. J., Posthumus, J. S., DeAngelis, L. M., Raizer, J., Schiff, D., Abrey, L., Natkunam, Y., Lossos, I. S. 2014; 165 (5): 640-648


    Primary central nervous system lymphoma (PCNSL) is an aggressive sub-variant of non-Hodgkin lymphoma (NHL) with morphological similarities to diffuse large B-cell lymphoma (DLBCL). While methotrexate (MTX)-based therapies have improved patient survival, the disease remains incurable in most cases and its pathogenesis is poorly understood. We evaluated 69 cases of PCNSL for the expression of HGAL (also known as GCSAM), LMO2 and BCL6 - genes associated with DLBCL prognosis and pathobiology, and analysed their correlation to survival in 49 PCNSL patients receiving MTX-based therapy. We demonstrate that PCNSL expresses LMO2, HGAL(also known as GCSAM) and BCL6 proteins in 52%, 65% and 56% of tumours, respectively. BCL6 protein expression was associated with longer progression-free survival (P = 0·006) and overall survival (OS, P = 0·05), while expression of LMO2 protein was associated with longer OS (P = 0·027). Further research is needed to elucidate the function of BCL6 and LMO2 in PCNSL.

    View details for DOI 10.1111/bjh.12801

    View details for Web of Science ID 000335826500008

    View details for PubMedID 24571259

    View details for PubMedCentralID PMC4123533

  • A multicentre study of primary breast diffuse large B-cell lymphoma in the rituximab era BRITISH JOURNAL OF HAEMATOLOGY Hosein, P. J., Maragulia, J. C., Salzberg, M. P., Press, O. W., Habermann, T. M., Vose, J. M., Bast, M., Advani, R. H., Tibshirani, R., Evens, A. M., Islam, N., Leonard, J. P., Martin, P., Zelenetz, A. D., Lossos, I. S. 2014; 165 (3): 358-363


    Primary breast diffuse large B-cell lymphoma (DLBCL) is a rare subtype of non-Hodgkin lymphoma (NHL) with limited data on pathology and outcome. A multicentre retrospective study was undertaken to determine prognostic factors and the incidence of central nervous system (CNS) relapses. Data was retrospectively collected on patients from 8 US academic centres. Only patients with stage I/II disease (involvement of breast and localized lymph nodes) were included. Histologies apart from primary DLBCL were excluded. Between 1992 and 2012, 76 patients met the eligibility criteria. Most patients (86%) received chemotherapy, and 69% received immunochemotherapy with rituximab; 65% received radiation therapy and 9% received prophylactic CNS chemotherapy. After a median follow-up of 4·5 years (range 0·6-20·6 years), the Kaplan-Meier estimated median progression-free survival was 10·4 years (95% confidence interval [CI] 5·8-14·9 years), and the median overall survival was 14·6 years (95% CI 10·2-19 years). Twelve patients (16%) had CNS relapse. A low stage-modified International Prognostic Index (IPI) was associated with longer overall survival. Rituximab use was not associated with a survival advantage. Primary breast DLBCL has a high rate of CNS relapse. The stage-modified IPI score is associated with survival.

    View details for DOI 10.1111/bjh.12753

    View details for Web of Science ID 000334031000011

    View details for PubMedID 24467658

    View details for PubMedCentralID PMC3990235

  • A SIGNIFICANCE TEST FOR THE LASSO ANNALS OF STATISTICS Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R. 2014; 42 (2): 413-468

    View details for DOI 10.1214/13-AOS1175

    View details for Web of Science ID 000336888400001

  • Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging. Proceedings of the National Academy of Sciences of the United States of America Eberlin, L. S., Tibshirani, R. J., Zhang, J., Longacre, T. A., Berry, G. J., Bingham, D. B., Norton, J. A., Zare, R. N., Poultsides, G. A. 2014; 111 (7): 2436-2441


    Surgical resection is the main curative option for gastrointestinal cancers. The extent of cancer resection is commonly assessed during surgery by pathologic evaluation of (frozen sections of) the tissue at the resected specimen margin(s) to verify whether cancer is present. We compare this method to an alternative procedure, desorption electrospray ionization mass spectrometric imaging (DESI-MSI), for 62 banked human cancerous and normal gastric-tissue samples. In DESI-MSI, microdroplets strike the tissue sample, the resulting splash enters a mass spectrometer, and a statistical analysis, here, the Lasso method (which stands for least absolute shrinkage and selection operator and which is a multiclass logistic regression with L1 penalty), is applied to classify tissues based on the molecular information obtained directly from DESI-MSI. The methodology developed with 28 frozen training samples of clear histopathologic diagnosis showed an overall accuracy value of 98% for the 12,480 pixels evaluated in cross-validation (CV), and 97% when a completely independent set of samples was tested. By applying an additional spatial smoothing technique, the accuracy for both CV and the independent set of samples was 99% compared with histological diagnoses. To test our method for clinical use, we applied it to a total of 21 tissue-margin samples prospectively obtained from nine gastric-cancer patients. The results obtained suggest that DESI-MSI/Lasso may be valuable for routine intraoperative assessment of the specimen margins during gastric-cancer surgery.

    View details for DOI 10.1073/pnas.1400274111

    View details for PubMedID 24550265

  • Systems analysis of sex differences reveals an immunosuppressive role for testosterone in the response to influenza vaccination. Proceedings of the National Academy of Sciences of the United States of America Furman, D., Hejblum, B. P., Simon, N., Jojic, V., Dekker, C. L., Thiébaut, R., Tibshirani, R. J., Davis, M. M. 2014; 111 (2): 869-874


    Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.

    View details for DOI 10.1073/pnas.1321060111

    View details for PubMedID 24367114

  • Increasing value and reducing waste in research design, conduct, and analysis. Lancet Ioannidis, J. P., Greenland, S., Hlatky, M. A., Khoury, M. J., Macleod, M. R., Moher, D., Schulz, K. F., Tibshirani, R. 2014; 383 (9912): 166-175


    Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.

    View details for DOI 10.1016/S0140-6736(13)62227-8

    View details for PubMedID 24411645

  • A shared transcriptional program in early breast neoplasias despite genetic and clinical distinctions GENOME BIOLOGY Brunner, A. L., Li, J., Guo, X., Sweeney, R. T., Varma, S., Zhu, S. X., Li, R., Tibshirani, R., West, R. B. 2014; 15 (5)


    The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations.To characterize the transcriptional changes of early breast neoplasia, we sequenced 3'- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns.This study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.

    View details for DOI 10.1186/gb-2014-15-5-r71

    View details for Web of Science ID 000338981700005

    View details for PubMedCentralID PMC4072957

  • Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data STATISTICAL METHODS IN MEDICAL RESEARCH Li, J., Tibshirani, R. 2013; 22 (5): 519-536


    We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or 'sequencing depths'. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by 'outliers' in the data. We introduce a simple, non-parametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods.

    View details for DOI 10.1177/0962280211428386

    View details for PubMedID 22127579

  • Identification of gene microarray expression profiles in patients with chronic graft-versus-host disease following allogeneic hematopoietic cell transplantation. Clinical immunology Kohrt, H. E., Tian, L., Li, L., Alizadeh, A. A., Hsieh, S., Tibshirani, R. J., Strober, S., Sarwal, M., Lowsky, R. 2013; 148 (1): 124-135


    Chronic graft-versus-host disease (GVHD) results in significant morbidity and mortality, limiting the benefit of allogeneic hematopoietic cell transplantation (HCT). Peripheral blood gene expression profiling of the donor immune repertoire following HCT may provide associated genes and pathways thereby improving the pathophysiologic understanding of chronic GVHD. We profiled 70 patients and identified candidate genes that provided mechanistic insight in the biologic pathways that underlie chronic GVHD. Our data revealed that the dominant gene signature in patients with chronic GVHD represented compensatory responses that control inflammation and included the interleukin-1 decoy receptor, IL-1 receptor type II, and genes that were profibrotic and associated with the IL-4, IL-6 and IL-10 signaling pathways. In addition, we identified three genes that were important regulators of extracellular matrix. Validation of this discovery phase study will determine if the identified genes have diagnostic, prognostic or therapeutic implications.

    View details for DOI 10.1016/j.clim.2013.04.013

    View details for PubMedID 23685278

  • A LASSO FOR HIERARCHICAL INTERACTIONS ANNALS OF STATISTICS Bien, J., Taylor, J., Tibshirani, R. 2013; 41 (3): 1111-1141

    View details for DOI 10.1214/13-AOS1096

    View details for Web of Science ID 000321847600003

  • A Sparse-Group Lasso JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2013; 22 (2): 231-245
  • Classification of patients from time-course gene expression BIOSTATISTICS Zhang, Y., Tibshirani, R., Davis, R. 2013; 14 (1): 87-98


    Classifying patients into different risk groups based on their genomic measurements can help clinicians design appropriate clinical treatment plans. To produce such a classification, gene expression data were collected on a cohort of burn patients, who were monitored across multiple time points. This led us to develop a new classification method using time-course gene expressions. Our results showed that making good use of time-course information of gene expression improved the performance of classification compared with using gene expression from individual time points only. Our method is implemented into an R-package: time-course prediction analysis using microarray.

    View details for DOI 10.1093/biostatistics/kxs027

    View details for Web of Science ID 000312636300007

    View details for PubMedID 22926914

    View details for PubMedCentralID PMC3520502

  • Scientific research in the age of omics: the good, the bad, and the sloppy JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Witten, D. M., Tibshirani, R. 2013; 20 (1): 125-127


    It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is later determined to be flawed can erode the credibility of an entire field, resulting in a severe and negative impact for years to come. Here, we urge the development of an online, open-access, postpublication, peer review system that will increase the accountability of scientists for the quality of their research and the ability of readers to distinguish good from sloppy science.

    View details for DOI 10.1136/amiajnl-2012-000972

    View details for Web of Science ID 000313512900020

    View details for PubMedID 23037799

  • Coronary risk assessment among intermediate risk patients using a clinical and biomarker based algorithm developed and validated in two population cohorts CURRENT MEDICAL RESEARCH AND OPINION Cross, D. S., McCarty, C. A., Hytopoulos, E., Beggs, M., Nolan, N., Harrington, D. S., Hastie, T., Tibshirani, R., Tracy, R. P., Psaty, B. M., McClelland, R., Tsao, P. S., Quertermous, T. 2012; 28 (11): 1819-1830


    Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project (PMRP) individuals. A multivariable Cox regression model was fit using the most powerful risk predictors within the clinical and protein variables identified by repeated cross-validation. The resulting CHDRA algorithm was validated in a Multiple-Ethnic Study of Atherosclerosis (MESA) case-cohort sample.A CHDRA algorithm of age, sex, diabetes, and family history of MI, combined with serum levels of seven biomarkers (CTACK, Eotaxin, Fas Ligand, HGF, IL-16, MCP-3, and sFas) yielded a clinical net reclassification index of 42.7% (p < 0.001) for MESA patients with a recalibrated Framingham 5-year intermediate risk level. Across all patients, the model predicted acute coronary events (hazard ratio = 2.17, p < 0.001), and remained an independent predictor after Framingham risk factor adjustments.These include the slightly different event definition with the MESA samples and inability to include PMRP fatal CHD events.A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients. Improved accuracy in cardiovascular risk classification could lead to improved preventive care and fewer deaths.

    View details for DOI 10.1185/03007995.2012.742878

    View details for Web of Science ID 000310985600009

    View details for PubMedID 23092312

  • Genome-wide Measurement of RNA Folding Energies MOLECULAR CELL Wan, Y., Qu, K., Ouyang, Z., Kertesz, M., Li, J., Tibshirani, R., Makino, D. L., Nutter, R. C., Segal, E., Chang, H. Y. 2012; 48 (2): 169-181


    RNA structural transitions are important in the function and regulation of RNAs. Here, we reveal a layer of transcriptome organization in the form of RNA folding energies. By probing yeast RNA structures at different temperatures, we obtained relative melting temperatures (Tm) for RNA structures in over 4000 transcripts. Specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames and highlighted numerous candidate regulatory RNA motifs in 3' untranslated regions. RNA Tm distinguished noncoding versus coding RNAs and identified mRNAs with distinct cellular functions. We identified thousands of putative RNA thermometers, and their presence is predictive of the pattern of RNA decay in vivo during heat shock. The exosome complex recognizes unpaired bases during heat shock to degrade these RNAs, coupling intrinsic structural stabilities to gene regulation. Thus, genome-wide structural dynamics of RNA can parse functional elements of the transcriptome and reveal diverse biological insights.

    View details for DOI 10.1016/j.molcel.2012.08.008

    View details for PubMedID 22981864

  • Normalization, testing, and false discovery rate estimation for RNA-sequencing data BIOSTATISTICS Li, J., Witten, D. M., Johnstone, I. M., Tibshirani, R. 2012; 13 (3): 523-538


    We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

    View details for DOI 10.1093/biostatistics/kxr031

    View details for PubMedID 22003245

  • Autoantibody Epitope Spreading in the Pre-Clinical Phase Predicts Progression to Rheumatoid Arthritis PLOS ONE Sokolove, J., Bromberg, R., Deane, K. D., Lahey, L. J., Derber, L. A., Chandra, P. E., Edison, J. D., Gilliland, W. R., Tibshirani, R. J., Norris, J. M., Holers, V. M., Robinson, W. H. 2012; 7 (5)


    Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1-12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.

    View details for DOI 10.1371/journal.pone.0035296

    View details for PubMedID 22662108

  • DEGREES OF FREEDOM IN LASSO PROBLEMS ANNALS OF STATISTICS Tibshirani, R. J., Taylor, J. 2012; 40 (2): 1198-1232

    View details for DOI 10.1214/12-AOS1003

    View details for Web of Science ID 000307608000021

  • In situ vaccination against mycosis fungoides by intratumoral injection of a TLR9 agonist combined with radiation: a phase 1/2 study BLOOD Kim, Y. H., Gratzinger, D., Harrison, C., Brody, J. D., Czerwinski, D. K., Ai, W. Z., Morales, A., Abdulla, F., Xing, L., Navi, D., Tibshirani, R. J., Advani, R. H., Lingala, B., Shah, S., Hoppe, R. T., Levy, R. 2012; 119 (2): 355-363


    We have developed and previously reported on a therapeutic vaccination strategy for indolent B-cell lymphoma that combines local radiation to enhance tumor immunogenicity with the injection into the tumor of a TLR9 agonist. As a result, antitumor CD8(+) T cells are induced, and systemic tumor regression was documented. Because the vaccination occurs in situ, there is no need to manufacture a vaccine product. We have now explored this strategy in a second disease: mycosis fungoides (MF). We treated 15 patients. Clinical responses were assessed at the distant, untreated sites as a measure of systemic antitumor activity. Five clinically meaningful responses were observed. The procedure was well tolerated and adverse effects consisted mostly of mild and transient injection site or flu-like symptoms. The immunized sites showed a significant reduction of CD25(+), Foxp3(+) T cells that could be either MF cells or tissue regulatory T cells and a similar reduction in S100(+), CD1a(+) dendritic cells. There was a trend toward greater reduction of CD25(+) T cells and skin dendritic cells in clinical responders versus nonresponders. Our in situ vaccination strategy is feasible also in MF and the clinical responses that occurred in a subset of patients warrant further study with modifications to augment these therapeutic effects. This study is registered at as NCT00226993.

    View details for DOI 10.1182/blood-2011-05-355222

    View details for PubMedID 22045986

  • Strong rules for discarding predictors in lasso-type problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R. J. 2012; 74: 245-266


    We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.

    View details for DOI 10.1111/j.1467-9868.2011.01004.x

    View details for Web of Science ID 000301286200004

    View details for PubMedCentralID PMC4262615

  • Inference with transposable data: modelling the effects of row and column correlations JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Allen, G. I., Tibshirani, R. 2012; 74: 721-743
  • Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers GENOME BIOLOGY Brunner, A. L., Beck, A. H., Edris, B., Sweeney, R. T., Zhu, S. X., Li, R., Montgomery, K., Varma, S., Gilks, T., Guo, X., Foley, J. W., Witten, D. M., Giacomini, C. P., Flynn, R. A., Pollack, J. R., Tibshirani, R., Chang, H. Y., van de Rijn, M., West, R. B. 2012; 13 (8)


    BACKGROUND: Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported. RESULTS: We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker. CONCLUSIONS: This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.

    View details for Web of Science ID 000315867500009

  • Strong rules for discarding predictors in lasso-type problems J. Royal stat. Assoc B robert tibshirani, bien, friedman, Hastie, Simon, Taylor, Tibshirani 2012; 74: 245-266
  • Sparse estimation of a covariance matrix BIOMETRIKA Bien, J., Tibshirani, R. J. 2011; 98 (4): 807-820


    We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.

    View details for DOI 10.1093/biomet/asr054

    View details for Web of Science ID 000297366000004


    View details for DOI 10.1214/11-AOAS495

    View details for Web of Science ID 000300382800008

  • A fused lasso latent feature model for analyzing multi-sample aCGH data BIOSTATISTICS Nowak, G., Hastie, T., Pollack, J. R., Tibshirani, R. 2011; 12 (4): 776-791


    Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.

    View details for DOI 10.1093/biostatistics/kxr012

    View details for Web of Science ID 000294806800014

    View details for PubMedID 21642389

  • Hierarchical Clustering With Prototypes via Minimax Linkage JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bien, J., Tibshirani, R. 2011; 106 (495): 1075-1084
  • Prediction of survival in diffuse large B-cell lymphoma based on the expression of 2 genes reflecting tumor and microenvironment BLOOD Alizadeh, A. A., Gentles, A. J., Alencar, A. J., Liu, C. L., Kohrt, H. E., Houot, R., Goldstein, M. J., Zhao, S., Natkunam, Y., Advani, R. H., Gascoyne, R. D., Briones, J., Tibshirani, R. J., Myklebust, J. H., Plevritis, S. K., Lossos, I. S., Levy, R. 2011; 118 (5): 1350-1358


    Several gene-expression signatures predict survival in diffuse large B-cell lymphoma (DLBCL), but the lack of practical methods for genome-scale analysis has limited translation to clinical practice. We built and validated a simple model using one gene expressed by tumor cells and another expressed by host immune cells, assessing added prognostic value to the clinical International Prognostic Index (IPI). LIM domain only 2 (LMO2) was validated as an independent predictor of survival and the "germinal center B cell-like" subtype. Expression of tumor necrosis factor receptor superfamily member 9 (TNFRSF9) from the DLBCL microenvironment was the best gene in bivariate combination with LMO2. Study of TNFRSF9 tissue expression in 95 patients with DLBCL showed expression limited to infiltrating T cells. A model integrating these 2 genes was independent of "cell-of-origin" classification, "stromal signatures," IPI, and added to the predictive power of the IPI. A composite score integrating these genes with IPI performed well in 3 independent cohorts of 545 DLBCL patients, as well as in a simple assay of routine formalin-fixed specimens from a new validation cohort of 147 patients with DLBCL. We conclude that the measurement of a single gene expressed by tumor cells (LMO2) and a single gene expressed by the immune microenvironment (TNFRSF9) powerfully predicts overall survival in patients with DLBCL.

    View details for DOI 10.1182/blood-2011-03-345272

    View details for PubMedID 21670469

  • MicroRNAs Are Independent Predictors of Outcome in Diffuse Large B-Cell Lymphoma Patients Treated with R-CHOP CLINICAL CANCER RESEARCH Alencar, A. J., Malumbres, R., Kozloski, G. A., Advani, R., Talreja, N., Chinichian, S., Briones, J., Natkunam, Y., Sehn, L. H., Gascoyne, R. D., Tibshirani, R., Lossos, I. S. 2011; 17 (12): 4125-4135


    Diffuse large B-cell lymphoma (DLBCL) heterogeneity has prompted investigations for new biomarkers that can accurately predict survival. A previously reported 6-gene model combined with the International Prognostic Index (IPI) could predict patients' outcome. However, even these predictors are not capable of unambiguously identifying outcome, suggesting that additional biomarkers might improve their predictive power.We studied expression of 11 microRNAs (miRNA) that had previously been reported to have variable expression in DLBCL tumors. We measured the expression of each miRNA by quantitative real-time PCR analyses in 176 samples from uniformly treated DLBCL patients and correlated the results to survival.In a univariate analysis, the expression of miR-18a correlated with overall survival (OS), whereas the expression of miR-181a and miR-222 correlated with progression-free survival (PFS). A multivariate Cox regression analysis including the IPI, the 6-gene model-derived mortality predictor score and expression of the miR-18a, miR-181a, and miR-222, revealed that all variables were independent predictors of survival except the expression of miR-222 for OS and the expression of miR-18a for PFS.The expression of specific miRNAs may be useful for DLBCL survival prediction and their role in the pathogenesis of this disease should be examined further.

    View details for DOI 10.1158/1078-0432.CCR-11-0224

    View details for Web of Science ID 000291644700029

    View details for PubMedID 21525173

    View details for PubMedCentralID PMC3117929

  • THE SOLUTION PATH OF THE GENERALIZED LASSO ANNALS OF STATISTICS Tibshirani, R. J., Taylor, J. 2011; 39 (3): 1335-1371

    View details for DOI 10.1214/11-AOS878

    View details for Web of Science ID 000293716500001

  • Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2011; 39 (5): 1-13


    We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.

    View details for Web of Science ID 000288204000001

    View details for PubMedCentralID PMC4824408

  • The Prognostic Value of Tumor-Associated Macrophages in Leiomyosarcoma A Single Institution Study AMERICAN JOURNAL OF CLINICAL ONCOLOGY-CANCER CLINICAL TRIALS Ganjoo, K. N., Witten, D., Patel, M., Espinosa, I., La, T., Tibshirani, R., van de Rijn, M., Jacobs, C., West, R. B. 2011; 34 (1): 82-86


    High numbers of tumor-associated macrophages (TAMs) have been associated with poor outcome in several solid tumors. In 2 previous studies, we showed that colony stimulating factor-1 (CSF1) is secreted by leiomyosarcoma (LMS) and that the increase in macrophages and CSF1 associated proteins are markers for poor prognosis in both gynecologic and nongynecologic LMS in a multicentered study. The purpose of this study is to evaluate the outcome of patients with LMS from a single institution according to the number of TAMs evaluated through 3 CSF1 associated proteins.Patients with LMS treated at Stanford University with adequate archived tissue and clinical data were eligible for this retrospective study. Data from chart reviews included tumor site, size, grade, stage, treatment, and disease status at the time of last follow-up. The 3 CSF1 associated proteins (CD163, CD16, and cathepsin L) were evaluated by immunohistochemistry on tissue microarrays. Kaplan-Meier survival curves and univariate Cox proportional hazards models were fit to assess the association of clinical predictors as well as CSF1 associated proteins with overall survival.A total of 52 patients diagnosed from 1983 to 2007 were evaluated. Univariate Cox proportional hazards models were fit to assess the significance of grade, size, stage, and the 3 CSF1 associated proteins in predicting OS. Grade, size, and stage were not significantly associated with survival in the full patient cohort, but grade and stage were significant predictors of survival in the gynecologic (GYN) LMS samples (P = 0.038 and P = 0.0164, respectively). Increased cathepsin L was associated with a worse outcome in GYN LMS (P = 0.049). Similar findings were seen with CD16 (P < 0.0001). In addition, CSF1 response enriched (all 3 stains positive) GYN LMS had a poor overall survival when compared with CSF1 response poor tumors (P = 0.001). These results were not seen in non-GYN LMS.Our data form an independent confirmation of the prognostic significance of TAMs and the CSF1 associated proteins in LMS. More aggressive or targeted therapies could be considered in the subset of LMS patients that highly express these markers.

    View details for DOI 10.1097/COC.0b013e3181d26d5e

    View details for PubMedID 23781555

  • Nearly-Isotonic Regression TECHNOMETRICS Tibshirani, R. J., Hoefling, H., Tibshirani, R. 2011; 53 (1): 54-61
  • Adaptive index models for marker-based risk stratification BIOSTATISTICS Tian, L., Tibshirani, R. 2011; 12 (1): 68-86


    We use the term "index predictor" to denote a score that consists of K binary rules such as "age > 60" or "blood pressure > 120 mm Hg." The index predictor is the sum of these binary scores, yielding a value from 0 to K. Such indices as often used in clinical studies to stratify population risk: They are usually derived from subject area considerations. In this paper, we propose a fast data-driven procedure for automatically constructing such indices for linear, logistic, and Cox regression models. We also extend the procedure to create indices for detecting treatment-marker interactions. The methods are illustrated on a study with protein biomarkers as well as a large microarray gene expression study.

    View details for DOI 10.1093/biostatistics/kxq047

    View details for PubMedID 20663850

  • Regression shrinkage and selection via the lasso: a retrospective JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R. 2011; 73: 273-282
  • Penalized classification using Fisher's linear discriminant JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Witten, D. M., Tibshirani, R. 2011; 73: 753-772
  • Bayesian gene set analysis for identifying significant biological pathways JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS Shahbaba, B., Tibshirani, R., Shachaf, C. M., Plevritis, S. K. 2011; 60: 541-557


    We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.

    View details for DOI 10.1111/j.1467-9876.2011.00765.x

    View details for Web of Science ID 000293235800004

    View details for PubMedCentralID PMC3156489

  • Supervised multidimensional scaling for visualization, classification, and bipartite ranking COMPUTATIONAL STATISTICS & DATA ANALYSIS Witten, D. M., Tibshirani, R. 2011; 55 (1): 789-801
  • A statistician plays darts JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY Tibshirani, R. J., Price, A., Taylor, J. 2011; 174: 213-226
  • In Situ Vaccination with TLR9 Agonist Combined with Local Radiation In Mycosis Fungoides: Analysis of Phase I/II Study 52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH) Kim, Y. H., Gratzinger, D., Harrison, C., Brody, J., Czerwinski, D., Xing, L., Morales, A., Ai, W., Abdulla, F., Navi, D., Tibshirani, R. J., Advani, R., Natkunam, Y., Hoppe, R. T., Levy, R. AMER SOC HEMATOLOGY. 2010: 130–30
  • Prediction of Survival In Diffuse Large B-Cell Lymphoma Based On the Expression of Two Genes Reflecting Tumor and Microenvironment 52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH) Alizadeh, A. A., Gentles, A. J., Alencar, A. J., Kohrt, H. E., Houot, R., Goldstein, M. J., Zhao, S., Natkunam, Y., Advani, R., Gascoyne, R. D., Briones, J., Tibshirani, R. J., Myklebust, J. H., Plevritis, S. K., Lossos, I. S., Levy, R. AMER SOC HEMATOLOGY. 2010: 836–37
  • In Situ Vaccination With a TLR9 Agonist Induces Systemic Lymphoma Regression: A Phase I/II Study JOURNAL OF CLINICAL ONCOLOGY Brody, J. D., Ai, W. Z., Czerwinski, D. K., Torchia, J. A., Levy, M., Advani, R. H., Kim, Y. H., Hoppe, R. T., Knox, S. J., Shin, L. K., Wapnir, I., Tibshirani, R. J., Levy, R. 2010; 28 (28): 4324-4332


    Combining tumor antigens with an immunostimulant can induce the immune system to specifically eliminate cancer cells. Generally, this combination is accomplished in an ex vivo, customized manner. In a preclinical lymphoma model, intratumoral injection of a Toll-like receptor 9 (TLR9) agonist induced systemic antitumor immunity and cured large, disseminated tumors.We treated 15 patients with low-grade B-cell lymphoma using low-dose radiotherapy to a single tumor site and-at that same site-injected the C-G enriched, synthetic oligodeoxynucleotide (also referred to as CpG) TLR9 agonist PF-3512676. Clinical responses were assessed at distant, untreated tumor sites. Immune responses were evaluated by measuring T-cell activation after in vitro restimulation with autologous tumor cells.This in situ vaccination maneuver was well-tolerated with only grade 1 to 2 local or systemic reactions and no treatment-limiting adverse events. One patient had a complete clinical response, three others had partial responses, and two patients had stable but continually regressing disease for periods significantly longer than that achieved with prior therapies. Vaccination induced tumor-reactive memory CD8 T cells. Some patients' tumors were able to induce a suppressive, regulatory phenotype in autologous T cells in vitro; these patients tended to have a shorter time to disease progression. One clinically responding patient received a second course of vaccination after relapse resulting in a second, more rapid clinical response.In situ tumor vaccination with a TLR9 agonist induces systemic antilymphoma clinical responses. This maneuver is clinically feasible and does not require the production of a customized vaccine product.

    View details for DOI 10.1200/JCO.2010.28.9793

    View details for Web of Science ID 000282272700032

    View details for PubMedID 20697067

    View details for PubMedCentralID PMC2954133

  • Spectral Regularization Algorithms for Learning Large Incomplete Matrices JOURNAL OF MACHINE LEARNING RESEARCH Mazumder, R., Hastie, T., Tibshirani, R. 2010; 11: 2287-2322


    We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

    View details for Web of Science ID 000282523300010

  • Analysis of factorial time-course microarrays with application to a clinical study of burn injury. Proceedings of the National Academy of Sciences of the United States of America Zhou, B., Xu, W., Herndon, D., Tompkins, R., Davis, R., Xiao, W., Wong, W. H., Toner, M., Warren, H. S., Schoenfeld, D. A., Rahme, L., McDonald-Smith, G. P., Hayden, D., Mason, P., Fagan, S., Yu, Y., Cobb, J. P., Remick, D. G., Mannick, J. A., Lederer, J. A., Gamelli, R. L., Silver, G. M., West, M. A., Shapiro, M. B., Smith, R., Camp, D. G., Qian, W., Storey, J., Mindrinos, M., Tibshirani, R., Lowry, S., Calvano, S., Chaudry, I., West, M. A., Cohen, M., Moore, E. E., Johnson, J., Moldawer, L. L., Baker, H. V., Efron, P. A., Balis, U. G., Billiar, T. R., Ochoa, J. B., Sperry, J. L., Miller-Graziano, C. L., De, A. K., Bankey, P. E., Finnerty, C. C., Jeschke, M. G., Minei, J. P., Arnoldo, B. D., Hunt, J. L., Horton, J., Cobb, J. P., Brownstein, B., Freeman, B., Maier, R. V., Nathens, A. B., Cuschieri, J., Gibran, N., Klein, M., O'Keefe, G. 2010; 107 (22): 9923-9928


    Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at It is also available for download at

    View details for DOI 10.1073/pnas.1002757107

    View details for PubMedID 20479259

  • A Framework for Feature Selection in Clustering JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Witten, D. M., Tibshirani, R. 2010; 105 (490): 713-726

    View details for DOI 10.1214/09-AOAS314

    View details for Web of Science ID 000283528500011

  • Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls BMC BIOLOGY Witten, D., Tibshirani, R., Gu, S. G., Fire, A., Lui, W. 2010; 8


    Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run.We prepared small RNA libraries from 29 tumour/normal pairs of human cervical tissue samples. Analysis of the resulting sequences (42 million in total) defined 64 new human microRNA (miRNA) genes. Both arms of the hairpin precursor were observed in twenty-three of the newly identified miRNA candidates. We tested several computational approaches for the analysis of class differences between high throughput sequencing datasets and describe a novel application of a log linear model that has provided the most effective analysis for this data. This method resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001.This approach can potentially be applied to any kind of RNA sequencing data for analysing differential sequence representation between biological sample sets.

    View details for DOI 10.1186/1741-7007-8-58

    View details for PubMedID 20459774

  • Cell type-specific gene expression differences in complex tissues NATURE METHODS Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M., Butte, A. J. 2010; 7 (4): 287-289


    We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.

    View details for DOI 10.1038/NMETH.1439

    View details for PubMedID 20208531

  • Novel Cell-Type Specific Deconvolution of Whole-Blood Gene Expression Profiles in Renal Acute Rejection 10th American Transplant Congress Khatri, P., Shen-Orr, S., Tibshirani, R., Butte, A. J., Sarwal, M. M. WILEY-BLACKWELL. 2010: 294–294
  • C-C Chemokine Receptor 1 Expression in Human Hematolymphoid Neoplasia AMERICAN JOURNAL OF CLINICAL PATHOLOGY Anderson, M. W., Zhao, S., Ai, W. Z., Tibshirani, R., Levy, R., Lossos, I. S., Natkunam, Y. 2010; 133 (3): 473-483


    Chemokine receptor 1 (CCR1) is a G protein-coupled receptor that binds to members of the C-C chemokine family. Recently, CCL3 (MIP-1alpha), a high-affinity CCR1 ligand, was identified as part of a model that independently predicts survival in patients with diffuse large B-cell lymphoma (DLBCL). However, the role of chemokine signaling in the pathogenesis of human lymphomas is unclear. In normal human hematopoietic tissues, we found CCR1 expression in intraepithelial B cells of human tonsil and granulocytic/monocytic cells in the bone marrow. Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma. CCR1 expression correlated with the non-germinal center subtype of DLBCL but did not predict overall survival in follicular lymphoma. These data suggest that CCR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of hematolymphoid neoplasia.

    View details for DOI 10.1309/AJCP1TA3FLOQTMHF

    View details for PubMedID 20154287

  • Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling ONCOGENE Beck, A. H., Lee, C., WITTEN, D. M., Gleason, B. C., Edris, B., Espinosa, I., Zhu, S., Li, R., Montgomery, K. D., Marinelli, R. J., Tibshirani, R., Hastie, T., Jablons, D. M., Rubin, B. P., Fletcher, C. D., West, R. B., van de Rijn, M. 2010; 29 (6): 845-854


    Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.

    View details for DOI 10.1038/onc.2009.381

    View details for PubMedID 19901961

  • Regularization Paths for Generalized Linear Models via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Friedman, J., Hastie, T., Tibshirani, R. 2010; 33 (1): 1-22


    We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

    View details for Web of Science ID 000275203200001

    View details for PubMedCentralID PMC2929880

  • CD81 protein is expressed at high levels in normal germinal center B cells and in subtypes of human lymphomas HUMAN PATHOLOGY Luo, R. F., Zhao, S., Tibshirani, R., Myklebust, J. H., Sanyal, M., Fernandez, R., Gratzinger, D., Marinelli, R. J., Lu, Z. S., Wong, A., Levy, R., Levy, S., Natkunam, Y. 2010; 41 (2): 271-280


    CD81 is a tetraspanin cell surface protein that regulates CD19 expression in B lymphocytes and enables hepatitis C virus infection of human cells. Immunohistologic analysis in normal hematopoietic tissue showed strong staining for CD81 in normal germinal center B cells, a cell type in which its increased expression has not been previously recognized. High-dimensional flow cytometry analysis of normal hematopoietic tissue confirmed that among B- and T-cell subsets, germinal center B cells showed the highest level of CD81 expression. In more than 800 neoplastic tissue samples, its expression was also found in most non-Hodgkin lymphomas. Staining for CD81 was rarely seen in multiple myeloma, Hodgkin lymphoma, or myeloid leukemia. In hierarchical cluster analysis of diffuse large B-cell lymphoma, staining for CD81 was most similar to other germinal center B cell-associated markers, particularly LMO2. By flow cytometry, CD81 was expressed in diffuse large B-cell lymphoma cells independent of the presence or absence of CD10, another germinal center B-cell marker. The detection of CD81 in routine biopsy samples and its differential expression in lymphoma subtypes, particularly diffuse large B-cell lymphoma, warrant further study to assess CD81 expression and its role in the risk stratification of patients with diffuse large B-cell lymphoma.

    View details for DOI 10.1016/j.humpath.2009.07.022

    View details for PubMedID 20004001

  • DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data BIOINFORMATICS Salari, K., Tibshirani, R., Pollack, J. R. 2010; 26 (3): 414-416


    DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.DR-Integrator is freely available for non-commercial use from the Pollack Lab at and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name 'DRI' at An example analysis using DR-Integrator is included as supplemental material.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btp702

    View details for PubMedID 20031972

  • Survival analysis with high-dimensional covariates STATISTICAL METHODS IN MEDICAL RESEARCH Witten, D. M., Tibshirani, R. 2010; 19 (1): 29-51


    In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation. Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly. Here, we review a number of methods from the literature that address these two problems.

    View details for DOI 10.1177/0962280209105024

    View details for PubMedID 19654171

  • 3 '-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples PLOS ONE Beck, A. H., Weng, Z., Witten, D. M., Zhu, S., Foley, J. W., Lacroute, P., Smith, C. L., Tibshirani, R., van de Rijn, M., Sidow, A., West, R. B. 2010; 5 (1)


    Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address this limitation of gene expression microarrays, we designed a novel procedure (3'-end sequencing for expression quantification (3SEQ)) for gene expression profiling from FFPET using next-generation sequencing. We performed gene expression profiling by 3SEQ and microarray on both frozen tissue and FFPET from two soft tissue tumors (desmoid type fibromatosis (DTF) and solitary fibrous tumor (SFT)) (total n = 23 samples, which were each profiled by at least one of the four platform-tissue preparation combinations). Analysis of 3SEQ data revealed many genes differentially expressed between the tumor types (FDR<0.01) on both the frozen tissue (approximately 9.6K genes) and FFPET (approximately 8.1K genes). Analysis of microarray data from frozen tissue revealed fewer differentially expressed genes (approximately 4.64K), and analysis of microarray data on FFPET revealed very few (69) differentially expressed genes. Functional gene set analysis of 3SEQ data from both frozen tissue and FFPET identified biological pathways known to be important in DTF and SFT pathogenesis and suggested several additional candidate oncogenic pathways in these tumors. These findings demonstrate that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research.

    View details for DOI 10.1371/journal.pone.0008768

    View details for PubMedID 20098735

  • Predicting Patient Survival from Longitudinal Gene Expression STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Zhang, Y., Tibshirani, R. J., Davis, R. W. 2010; 9 (1)


    Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.

    View details for DOI 10.2202/1544-6115.1617

    View details for PubMedID 21126232

  • Lymphoma cell VEGFR2 expression detected by immunohistochemistry predicts poor overall survival in diffuse large B cell lymphoma treated with immunochemotherapy (R-CHOP) BRITISH JOURNAL OF HAEMATOLOGY Gratzinger, D., Advani, R., Zhao, S., Talreja, N., Tibshirani, R. J., Shyam, R., Horning, S., Sehn, L. H., Farinha, P., Briones, J., Lossos, I. S., Gascoyne, R. D., Natkunam, Y. 2010; 148 (2): 235-244


    Diffuse large B cell lymphoma (DLBCL) is clinically and biologically heterogeneous. In most cases of DLBCL, lymphoma cells co-express vascular endothelial growth factor (VEGF) and its receptors VEGFR1 and VEGFR2, suggesting autocrine in addition to angiogenic effects. We enumerated microvessel density and scored lymphoma cell expression of VEGF, VEGFR1, VEGFR2 and phosphorylated VEGFR2 in 162 de novo DLBCL patients treated with R-CHOP (rituximab, cyclophosphamide, vincristine, doxorubicin and prednisone)-like regimens. VEGFR2 expression correlated with shorter overall survival (OS) independent of International Prognostic Index (IPI) (P = 0.0028). Phosphorylated VEGFR2 (detected in 13% of cases) correlated with shorter progression-free survival (PFS, P = 0.044) and trended toward shorter OS on univariate analysis. VEGFR1 was not predictive of survival on univariate analysis, but it did correlate with better OS on multivariate analysis with VEGF, VEGFR2 and IPI (P = 0.036); in patients with weak VEGFR2, lack of VEGFR1 coexpression was significantly correlated with poor OS independent of IPI (P = 0.01). These results are concordant with our prior finding of an association of VEGFR1 with longer OS in DLBCL treated with chemotherapy alone. We postulate that VEGFR1 may oppose autocrine VEGFR2 signalling in DLBCL by competing for VEGF binding. In contrast to our prior results with chemotherapy alone, microvessel density was not prognostic of PFS or OS with R-CHOP-like therapy.

    View details for DOI 10.1111/j.1365-2141.2009.07942.x

    View details for PubMedID 19821819

  • Local false discovery rate facilitates comparison of different microarray experiments NUCLEIC ACIDS RESEARCH Hong, W., Tibshirani, R., Chu, G. 2009; 37 (22): 7483-7497


    The local false discovery rate (LFDR) estimates the probability of falsely identifying specific genes with changes in expression. In computer simulations, LFDR <10% successfully identified genes with changes in expression, while LFDR >90% identified genes without changes. We used LFDR to compare different microarray experiments quantitatively: (i) Venn diagrams of genes with and without changes in expression, (ii) scatter plots of the genes, (iii) correlation coefficients in the scatter plots and (iv) distributions of gene function. To illustrate, we compared three methods for pre-processing microarray data. Correlations between methods were high (r = 0.84-0.92). However, responses were often different in magnitude, and sometimes discordant, even though the methods used the same raw data. LFDR complements functional assessments like gene set enrichment analysis. To illustrate, we compared responses to ultraviolet radiation (UV), ionizing radiation (IR) and tobacco smoke. Compared to unresponsive genes, genes responsive to both UV and IR were enriched for cell cycle, mitosis, and DNA repair functions. Genes responsive to UV but not IR were depleted for cell adhesion functions. Genes responsive to tobacco smoke were enriched for detoxification functions. Thus, LFDR reveals differences and similarities among experiments.

    View details for DOI 10.1093/nar/gkp813

    View details for Web of Science ID 000272935000021

    View details for PubMedID 19825981

    View details for PubMedCentralID PMC2794175

  • Relationship of differential gene expression profiles in CD34(+) myelodysplastic syndrome marrow cells to disease subtype and progression BLOOD Sridhar, K., Ross, D. T., Tibshirani, R., Butte, A. J., Greenberg, P. L. 2009; 114 (23): 4847-4858


    Microarray analysis with 40 000 cDNA gene chip arrays determined differential gene expression profiles (GEPs) in CD34(+) marrow cells from myelodysplastic syndrome (MDS) patients compared with healthy persons. Using focused bioinformatics analyses, we found 1175 genes significantly differentially expressed by MDS versus normal, requiring a minimum of 39 genes to separately classify these patients. Major GEP differences were demonstrated between healthy and MDS patients and between several MDS subgroups: (1) those whose disease remained stable and those who subsequently transformed (tMDS) to acute myeloid leukemia; (2) between del(5q) and other MDS patients. A 6-gene "poor risk" signature was defined, which was associated with acute myeloid leukemia transformation and provided additive prognostic information for International Prognostic Scoring System Intermediate-1 patients. Overexpression of genes generating ribosomal proteins and for other signaling pathways was demonstrated in the tMDS patients. Comparison of del(5q) with the remaining MDS patients showed 1924 differentially expressed genes, with underexpression of 1014 genes, 11 of which were within the 5q31-32 commonly deleted region. These data demonstrated (1) GEPs distinguishing MDS patients from healthy and between those with differing clinical outcomes (tMDS vs those whose disease remained stable) and cytogenetics [eg, del(5q)]; and (2) molecular criteria refining prognostic categorization and associated biologic processes in MDS.

    View details for DOI 10.1182/blood-2009-08-236422

    View details for PubMedID 19801443

  • Disease signatures are robust across tissues and experiments MOLECULAR SYSTEMS BIOLOGY Dudley, J. T., Tibshirani, R., Deshpande, T., Butte, A. J. 2009; 5


    Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.

    View details for DOI 10.1038/msb.2009.66

    View details for PubMedID 19756046

  • A Network Model of a Cooperative Genetic Landscape in Brain Tumors JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Bredel, M., Scholtens, D. M., Harsh, G. R., Bredel, C., Chandler, J. P., Renfrow, J. J., Yadav, A. K., Vogel, H., Scheck, A. C., Tibshirani, R., Sikic, B. I. 2009; 302 (3): 261-275


    Gliomas, particularly glioblastomas, are among the deadliest of human tumors. Gliomas emerge through the accumulation of recurrent chromosomal alterations, some of which target yet-to-be-discovered cancer genes. A persistent question concerns the biological basis for the coselection of these alterations during gliomagenesis.To describe a network model of a cooperative genetic landscape in gliomas and to evaluate its clinical relevance.Multidimensional genomic profiles and clinical profiles of 501 patients with gliomas (45 tumors in an initial discovery set collected between 2001 and 2004 and 456 tumors in validation sets made public between 2006 and 2008) from multiple academic centers in the United States and The Cancer Genome Atlas Pilot Project (TCGA).Identification of genes with coincident genetic alterations, correlated gene dosage and gene expression, and multiple functional interactions; association between those genes and patient survival.Gliomas select for a nonrandom genetic landscape-a consistent pattern of chromosomal alterations-that involves altered regions ("territories") on chromosomes 1p, 7, 8q, 9p, 10, 12q, 13q, 19q, 20, and 22q (false-discovery rate-corrected P<.05). A network model shows that these territories harbor genes with putative synergistic, tumor-promoting relationships. The coalteration of the most interactive of these genes in glioblastoma is associated with unfavorable patient survival. A multigene risk scoring model based on 7 landscape genes (POLD2, CYCS, MYC, AKR1C3, YME1L1, ANXA7, and PDCD4) is associated with the duration of overall survival in 189 glioblastoma samples from TCGA (global log-rank P = .02 comparing 3 survival curves for patients with 0-2, 3-4, and 5-7 dosage-altered genes). Groups of patients with 0 to 2 (low-risk group) and 5 to 7 (high-risk group) dosage-altered genes experienced 49.24 and 79.56 deaths per 100 person-years (hazard ratio [HR], 1.63; 95% confidence interval [CI], 1.10-2.40; Cox regression model P = .02), respectively. These associations with survival are validated using gene expression data in 3 independent glioma studies, comprising 76 (global log-rank P = .003; 47.89 vs 15.13 deaths per 100 person-years for high risk vs low risk; Cox model HR, 3.04; 95% CI, 1.49-6.20; P = .002) and 70 (global log-rank P = .008; 83.43 vs 16.14 deaths per 100 person-years for high risk vs low risk; HR, 3.86; 95% CI, 1.59-9.35; P = .003) high-grade gliomas and 191 glioblastomas (global log-rank P = .002; 83.23 vs 34.16 deaths per 100 person-years for high risk vs low risk; HR, 2.27; 95% CI, 1.44-3.58; P<.001).The alteration of multiple networking genes by recurrent chromosomal aberrations in gliomas deregulates critical signaling pathways through multiple, cooperative mechanisms. These mutations, which are likely due to nonrandom selection of a distinct genetic landscape during gliomagenesis, are associated with patient prognosis.

    View details for Web of Science ID 000267948100020

    View details for PubMedID 19602686

  • A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis BIOSTATISTICS Witten, D. M., Tibshirani, R., Hastie, T. 2009; 10 (3): 515-534


    We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

    View details for DOI 10.1093/biostatistics/kxp008

    View details for PubMedID 19377034

  • Alteration of Gene Expression Signatures of Cortical Differentiation and Wound Response in Lethal Clear Cell Renal Cell Carcinomas PLOS ONE Zhao, H., Ma, Z., Tibshirani, R., Higgins, J. P., Ljungberg, B., Brooks, J. D. 2009; 4 (6)


    Clear cell renal cell carcinoma (ccRCC) is the most common malignancy of the adult kidney and displays heterogeneity in clinical outcomes. Through comprehensive gene expression profiling, we have identified previously a set of transcripts that predict survival following nephrectomy independent of tumor stage, grade, and performance status. These transcripts, designated as the SPC (supervised principal components) gene set, show no apparent biological or genetic features that provide insight into renal carcinogenesis or tumor progression. We explored the relationship of this gene list to a set of genes expressed in different anatomical segments of the normal kidney including the cortex (cortex gene set) and the glomerulus (glomerulus gene set), and a gene set expressed after serum stimulation of quiescent fibroblasts (the core serum response or CSR gene set). Interestingly, the normal cortex, glomerulus (part of the normal renal cortex), and CSR gene sets captured more than 1/5 of the genes in the highly prognostic SPC gene set. Based on gene expression patterns alone, the SPC gene set could be used to sort samples from normal adult kidneys by the anatomical regions from which they were dissected. Tumors whose gene expression profiles most resembled the normal renal cortex or glomerulus showed better survival than those that did not, and those with expression features more similar to CSR showed poorer survival. While the cortex, glomerulus, and CSR signatures predicted survival independent of traditional clinical parameters, they were not independent of the SPC gene list. Our findings suggest that critical biological features of lethal ccRCC include loss of normal cortical differentiation and activation of programs associated with wound healing.

    View details for DOI 10.1371/journal.pone.0006039

    View details for PubMedID 19557179

  • Anti-idiotype antibody response after vaccination correlates with better overall survival in follicular lymphoma BLOOD Ai, W. Z., Tibshirani, R., Taidi, B., Czerwinski, D., Levy, R. 2009; 113 (23): 5743-5746


    Previous studies demonstrated that vaccination-induced tumor-specific immune response is associated with superior clinical outcome in patients with follicular lymphoma. Here, we investigated whether this positive correlation extends to overall survival (OS). We analyzed 91 untreated patients who received CVP chemotherapy (cyclophosphamide, vincristine, and prednisone) followed by idiotype vaccination. Idiotype proteins were produced either by the hybridoma method or by expression of recombinant idiotype-encoding sequences in mammalian or plant-based expression systems. We found that achieving a complete response/complete response unconfirmed (CR/CRu) to CVP and making an anti-idiotype antibody are 2 independent factors that each correlated with longer OS at 10 years (89% vs 68% with or without a CR/CRu, P = .024; 90% vs 69% with or without tumor-specific antibody production; P = .027). In the subset of patients who received hybridoma-generated vaccines, we found that anti-idiotype production was even more highly associated with superior OS (P < .002); this was the case even in patients with a partial response (PR) to CVP (P < .001).

    View details for DOI 10.1182/blood-2009-01-201988

    View details for Web of Science ID 000266656100013

    View details for PubMedID 19346494

    View details for PubMedCentralID PMC2700314


    View details for DOI 10.1214/08-AOAS224

    View details for Web of Science ID 000271979600014

  • Prognostic significance of vascular endothelial growth factor (VEGF), VEGF receptors (VEGFR), and vascularity in diffuse large B-cell lymphoma treated with immunochemotherapy (R-CHOP) 45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO) Gratzinger, D., Advani, R., Zhao, S., Talreja, N., Tibshirani, R. J., Horning, S. J., Levy, R., Lossos, I. S., Gascoyne, R. D., Natkunam, Y. AMER SOC CLINICAL ONCOLOGY. 2009
  • Correlation of RRM1 expression in muscle invasive locally advanced urothelial cancer with age 45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO) Harshman, L. C., Bepler, G., Zheng, Z., Higgins, J. P., ALLEN, G. I., Tibshirani, R., Srinivas, S. AMER SOC CLINICAL ONCOLOGY. 2009
  • Differentiation stage-specific expression of microRNAs in B lymphocytes and diffuse large B-cell lymphomas BLOOD Malumbres, R., Sarosiek, K. A., Cubedo, E., Ruiz, J. W., Jiang, X., Gascoyne, R. D., Tibshirani, R., Lossos, I. S. 2009; 113 (16): 3754-3764


    miRNAs are small RNA molecules binding to partially complementary sites in the 3'-UTR of target transcripts and repressing their expression. miRNAs orchestrate multiple cellular functions and play critical roles in cell differentiation and cancer development. We analyzed miRNA profiles in B-cell subsets during peripheral B-cell differentiation as well as in diffuse large B-cell lymphoma (DLBCL) cells. Our results show temporal changes in the miRNA expression during B-cell differentiation with a highly unique miRNA profile in germinal center (GC) lymphocytes. We provide experimental evidence that these changes may be physiologically relevant by demonstrating that GC-enriched hsa-miR-125b down-regulates the expression of IRF4 and PRDM1/BLIMP1, and memory B cell-enriched hsa-miR-223 down-regulates the expression of LMO2. We further demonstrate that although an important component of the biology of a malignant cell is inherited from its nontransformed cellular progenitor-GC centroblasts-aberrant miRNA expression is acquired upon cell transformation. A 9-miRNA signature was identified that could precisely differentiate the 2 major subtypes of DLBCL. Finally, expression of some of the miRNAs in this signature is correlated with clinical outcome of uniformly treated DLBCL patients.

    View details for DOI 10.1182/blood-2008-10-184077

    View details for Web of Science ID 000265445900016

    View details for PubMedID 19047678

  • Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods JOURNAL OF MACHINE LEARNING RESEARCH Hoefling, H., Tibshirani, R. 2009; 10: 883-906


    We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

    View details for Web of Science ID 000270824600003

  • Temporal Changes in Gene Expression Induced by Sulforaphane in Human Prostate Cancer Cells PROSTATE Bhamre, S., Sahoo, D., Tibshirani, R., Dill, D. L., Brooks, J. D. 2009; 69 (2): 181-190


    Prostate cancer is thought to arise as a result of oxidative stresses and induction of antioxidant electrophile defense (phase 2) enzymes has been proposed as a prostate cancer prevention strategy. The isothiocyanate sulforaphane, derived from cruciferous vegetables like broccoli, potently induces surrogate markers of phase 2 enzyme activity in prostate cells in vitro and in vivo. To better understand the temporal effects of sulforaphane and broccoli sprouts on gene expression in prostate cells, we carried out comprehensive transcriptome analysis using cDNA microarrays.Transcripts significantly modulated by sulforaphane over time were identified using StepMiner analysis. Ingenuity Pathway Analysis (IPA) was used to identify biological pathways, networks, and functions significantly altered by sulforaphane treatment.StepMiner and IPA revealed significant changes in many transcripts associated with cell growth and cell cycle, as well as a significant number associated with cellular response to oxidative damage and stress. Comparison to an existing dataset suggested that sulforaphane blocked cell growth by inducing G2/M arrest. Cell growth assays and flow cytometry analysis confirmed that sulforaphane inhibited cell growth and induced cell cycle arrest.Our data suggest that in prostate cells sulforaphane primarily induces cellular defenses and inhibits cell growth by causing G2/M phase arrest. Furthermore, based on the striking similarities in the gene expression patterns induced across experiments in these cells, sulforaphane appears to be the primary bioactive compound present in broccoli sprouts, suggesting that broccoli sprouts can serve as a suitable source for sulforaphane in intervention trials.

    View details for DOI 10.1002/pros.20869

    View details for PubMedID 18973173

  • Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Beck, A. H., Lee, C. H., WITTEN, D. M., Zhou, S., Montgomery, K., Tibshirani, R., Hastie, T., West, R. B., van de Rijn, M. NATURE PUBLISHING GROUP. 2009: 368A–368A
  • CD81 Protein Is Expressed in Normal Germinal Center B-Cells and in Subtypes of Human Non-Hodgkin Lymphomas 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Luo, R. F., Zhao, S., Tibshirani, R., Lossos, I. S., Advani, R., Gratzinger, D., Wong, A., Talrega, N., Levy, R., Levy, S., Natkunam, Y. NATURE PUBLISHING GROUP. 2009: 275A–275A
  • Covariance-regularized regression and classification for high dimensional problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Witten, D. M., Tibshirani, R. 2009; 71: 615-636


    In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.

    View details for DOI 10.1111/j.1467-9868.2009.00699.x

    View details for Web of Science ID 000266602200003

  • Blood autoantibody and cytokine profiles predict response to anti-tumor necrosis factor therapy in rheumatoid arthritis ARTHRITIS RESEARCH & THERAPY Hueber, W., Tomooka, B. H., Batliwalla, F., Li, W., Monach, P. A., Tibshirani, R. J., Van Vollenhoven, R. F., Lampa, J., Saito, K., Tanaka, Y., Genovese, M. C., Klareskog, L., Gregersen, P. K., Robinson, W. H. 2009; 11 (3)


    Anti-TNF therapies have revolutionized the treatment of rheumatoid arthritis (RA), a common systemic autoimmune disease involving destruction of the synovial joints. However, in the practice of rheumatology approximately one-third of patients demonstrate no clinical improvement in response to treatment with anti-TNF therapies, while another third demonstrate a partial response, and one-third an excellent and sustained response. Since no clinical or laboratory tests are available to predict response to anti-TNF therapies, great need exists for predictive biomarkers.Here we present a multi-step proteomics approach using arthritis antigen arrays, a multiplex cytokine assay, and conventional ELISA, with the objective to identify a biomarker signature in three ethnically diverse cohorts of RA patients treated with the anti-TNF therapy etanercept.We identified a 24-biomarker signature that enabled prediction of a positive clinical response to etanercept in all three cohorts (positive predictive values 58 to 72%; negative predictive values 63 to 78%).We identified a multi-parameter protein biomarker that enables pretreatment classification and prediction of etanercept responders, and tested this biomarker using three independent cohorts of RA patients. Although further validation in prospective and larger cohorts is needed, our observations demonstrate that multiplex characterization of autoantibodies and cytokines provides clinical utility for predicting response to the anti-TNF therapy etanercept in RA patients.

    View details for DOI 10.1186/ar2706

    View details for PubMedID 19460157

  • Univariate Shrinkage in the Cox Model for High Dimensional Data STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Tibshirani, R. J. 2009; 8 (1)


    We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.

    View details for PubMedID 19409065

  • Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Witten, D. M., Tibshirani, R. J. 2009; 8 (1)


    In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

    View details for DOI 10.2202/1544-6115.1470

    View details for PubMedID 19572827

  • Lymphoma-Expressed VEGF-a,VEGFR-1, VEGFR-2, and Microvessel Density Are Not Predictive of Overall Survival in Follicular Lymphoma. 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Gratzinger, D., Zhao, S., Ai, W., Tibshirani, R., Levy, R., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1290–90
  • Differentiation-Stage-Specific Expression of MicroRNAs in B-Lymphocytes and Diffuse Large B-Cell Lymphomas (DLBCL) 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Malumbres, R., Tibshirani, R., Cubedo, E., Sarosiek, K. A., Jiang, X., Ruiz, J., Lossos, I. AMER SOC HEMATOLOGY. 2008: 299–99
  • LMO2 Protein Expression Predicts Survival in Patients with Diffuse Large B-Cell Lymphoma Treated with Immunochemotherapy (RCHOP): A Multicenter Validation Study. 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Advani, R., Talreja, N., Tibshirani, R., Zhao, S., Alizadeh, A., Briones, J., Bordes, R., Cohen, J., Horning, S., Levy, R., Lossos, I. S., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1291–91
  • Neither CD68+Nor CD163+Macrophages Are Associated with Decreased Survival in Follicular Lymphoma 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Gratzinger, D., Ai, W., Tibshirani, R., Levy, R., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1284–84


    We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

    View details for DOI 10.1214/08-AOAS182

    View details for Web of Science ID 000261057900009

  • "Preconditioning" for feature selection and regression in high-dimensional problems' ANNALS OF STATISTICS Paul, D., Bair, E., Hastie, T., Tibshirani, R. 2008; 36 (4): 1595-1618
  • Sparse inverse covariance estimation with the graphical lasso BIOSTATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2008; 9 (3): 432-441


    We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

    View details for DOI 10.1093/biostatistics/kxm045

    View details for PubMedID 18079126

  • Complementary hierarchical clustering BIOSTATISTICS Nowak, G., Tibshirani, R. 2008; 9 (3): 467-483


    When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.

    View details for DOI 10.1093/biostatistics/kxm046

    View details for PubMedID 18093965

  • Paraffin-based 6-gene model predicts outcome in diffuse large B-cell lymphoma patients treated with R-CHOP BLOOD Malumbres, R., Chen, J., Tibshirani, R., Johnson, N. A., Sehn, L. H., Natkunam, Y., Briones, J., Advani, R., Connors, J. M., Byrne, G. E., Levy, R., Gascoyne, R. D., Lossos, I. S. 2008; 111 (12): 5509-5514


    Diffuse large B-cell lymphoma (DLBCL) is a heterogeneous disease characterized by variable clinical outcomes. Outcome prediction at the time of diagnosis is of paramount importance. Previously, we constructed a 6-gene model for outcome prediction of DLBCL patients treated with anthracycline-based chemotherapies. However, the standard therapy has evolved into rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone (R-CHOP). Herein, we evaluated the predictive power of a paraffin-based 6-gene model in R-CHOP-treated DLBCL patients. RNA was successfully extracted from 132 formalin-fixed paraffin-embedded (FFPE) specimens. Expression of the 6 genes comprising the model was measured and the mortality predictor score was calculated for each patient. The mortality predictor score divided patients into low-risk (below median) and high-risk (above median) subgroups with significantly different overall survival (OS; P = .002) and progression-free survival (PFS; P = .038). The model also predicted OS and PFS when the mortality predictor score was considered as a continuous variable (P = .002 and .010, respectively) and was independent of the IPI for prediction of OS (P = .008). These findings demonstrate that the prognostic value of the 6-gene model remains significant in the era of R-CHOP treatment and that the model can be applied to routine FFPE tissue from initial diagnostic biopsies.

    View details for DOI 10.1182/blood-2008-02-136374

    View details for Web of Science ID 000256786500021

    View details for PubMedID 18445689

    View details for PubMedCentralID PMC2424149

  • A STUDY OF PRE-VALIDATION ANNALS OF APPLIED STATISTICS Hoefling, H., Tibshirani, R. 2008; 2 (2): 643-664

    View details for DOI 10.1214/07-AOAS152

    View details for Web of Science ID 000261057800015

  • An FLT3 gene-expression signature predicts clinical outcome in normal karyotype AML BLOOD Bullinger, L., Doehner, K., Kranz, R., Stirner, C., Froeling, S., Scholl, C., Kim, Y. H., Schlenk, R. F., Tibshirani, R., Doehner, H., Pollack, J. R. 2008; 111 (9): 4490-4495


    Acute myeloid leukemia with normal karyotype (NK-AML) represents a cytogenetic grouping with intermediate prognosis but substantial molecular and clinical heterogeneity. Within this subgroup, presence of FLT3 (FMS-like tyrosine kinase 3) internal tandem duplication (ITD) mutation predicts less favorable outcome. The goal of our study was to discover gene-expression patterns correlated with FLT3-ITD mutation and to evaluate the utility of a FLT3 signature for prognostication. DNA microarrays were used to profile gene expression in a training set of 65 NK-AML cases, and supervised analysis, using the Prediction Analysis of Microarrays method, was applied to build a gene expression-based predictor of FLT3-ITD mutation status. The optimal predictor, composed of 20 genes, was then evaluated by classifying expression profiles from an independent test set of 72 NK-AML cases. The predictor exhibited modest performance (73% sensitivity; 85% specificity) in classifying FLT3-ITD status. Remarkably, however, the signature outperformed FLT3-ITD mutation status in predicting clinical outcome. The signature may better define clinically relevant FLT3 signaling and/or alternative changes that phenocopy FLT3-ITD, whereas the signature genes provide a starting point to dissect these pathways. Our findings support the potential clinical utility of a gene expression-based measure of FLT3 pathway activation in AML.

    View details for DOI 10.1182/blood-2007-09-115055

    View details for Web of Science ID 000255387400016

    View details for PubMedID 18309032

  • IRF9 and STAT1 are required for IgG autoantibody production and B cell expression of TLR7 in mice JOURNAL OF CLINICAL INVESTIGATION Thibault, D. L., Chu, A. D., Graham, K. L., Balboni, I., Lee, L. Y., Kohlmoos, C., Landrigan, A., Higgins, J. P., Tibshirani, R., Utz, P. J. 2008; 118 (4): 1417-1426


    A hallmark of SLE is the production of high-titer, high-affinity, isotype-switched IgG autoantibodies directed against nucleic acid-associated antigens. Several studies have established a role for both type I IFN (IFN-I) and the activation of TLRs by nucleic acid-associated autoantigens in the pathogenesis of this disease. Here, we demonstrate that 2 IFN-I signaling molecules, IFN regulatory factor 9 (IRF9) and STAT1, were required for the production of IgG autoantibodies in the pristane-induced mouse model of SLE. In addition, levels of IgM autoantibodies were increased in pristane-treated Irf9 -/- mice, suggesting that IRF9 plays a role in isotype switching in response to self antigens. Upregulation of TLR7 by IFN-alpha was greatly reduced in Irf9 -/- and Stat1 -/- B cells. Irf9 -/- B cells were incapable of being activated through TLR7, and Stat1 -/- B cells were impaired in activation through both TLR7 and TLR9. These data may reveal a novel role for IFN-I signaling molecules in both TLR-specific B cell responses and production of IgG autoantibodies directed against nucleic acid-associated autoantigens. Our results suggest that IFN-I is upstream of TLR signaling in the activation of autoreactive B cells in SLE.

    View details for DOI 10.1172/JCI30065

    View details for PubMedID 18340381

  • Multiplexed proximity ligation assays to profile putative plasma biomarkers relevant to pancreatic and ovarian cancer CLINICAL CHEMISTRY Fredriksson, S., Horecka, J., Brustugun, O. T., Schlingemann, J., Koong, A. C., Tibshirani, R., Davis, R. W. 2008; 54 (3): 582-589


    Sensitive methods are needed for biomarker discovery and validation. We tested one promising technology, multiplex proximity ligation assay (PLA), in a pilot study profiling plasma biomarkers in pancreatic and ovarian cancer.We used 4 panels of 6- and 7-plex PLAs to detect biomarkers, with each assay consuming 1 microL plasma and using either matched monoclonal antibody pairs or single batches of polyclonal antibody. Protein analytes were converted to unique DNA amplicons by proximity ligation and subsequently detected by quantitative PCR. We profiled 18 pancreatic cancer cases and 19 controls and 19 ovarian cancer cases and 20 controls for the following proteins: a disintegrin and metalloprotease 8, CA-125, CA 19-9, carboxypeptidase A1, carcinoembryonic antigen, connective tissue growth factor, epidermal growth factor receptor, epithelial cell adhesion molecule, Her2, galectin-1, insulin-like growth factor 2, interleukin-1alpha, interleukin-7, mesothelin, macrophage migration inhibitory factor, osteopontin, secretory leukocyte peptidase inhibitor, tumor necrosis factor alpha, vascular endothelial growth factor, and chitinase 3-like 1. Probes for CA-125 were present in 3 of the multiplex panels. We measured plasma concentrations of the CA-125-mesothelin complex by use of a triple-specific PLA with 2 ligation events among 3 probes.The assays displayed consistent measurements of CA-125 independent of which other markers were simultaneously detected and showed good correlation with Luminex data. In comparison to literature reports, we achieved expected results for other putative markers.Multiplex PLA using either matched monoclonal antibodies or single batches of polyclonal antibody should prove useful for identifying and validating sets of putative disease biomarkers and finding multimarker panels.

    View details for DOI 10.1373/clinchem.2007.093195

    View details for PubMedID 18171715

  • hCAP-D3 expression marks a prostate cancer subtype with favorable clinical behavior and androgen signaling signature AMERICAN JOURNAL OF SURGICAL PATHOLOGY Lapointe, J., Malhotra, S., Higgins, J. P., Bair, E., Thompson, M., Salari, K., Giacomini, C. P., Ferrari, M., Montgomery, K., Tibshirani, R., van de Rijn, M., Brooks, J. D., Pollack, J. R. 2008; 32 (2): 205-209


    Growing evidence suggests that only a fraction of prostate cancers detected clinically are potentially lethal. An important clinical issue is identifying men with indolent cancer who might be spared aggressive therapies with associated morbidities. Previously, using microarray analysis we defined 3 molecular subtypes of prostate cancer with different gene-expression patterns. One, subtype-1, displayed features consistent with more indolent behavior, where an immunohistochemical marker (AZGP1) for subtype-1 predicted favorable outcome after radical prostatectomy. Here we characterize a second candidate tissue biomarker, hCAP-D3, expressed in subtype-1 prostate tumors. hCAP-D3 expression, assayed by RNA in situ hybridization on a tissue microarray comprising 225 cases, was associated with decreased tumor recurrence after radical prostatectomy (P=0.004), independent of pathologic tumor stage, Gleason grade, and preoperative prostate-specific antigen levels. Simultaneous assessment of hCAP-D3 and AZGP1 expression in this tumor set improved outcome prediction. We have previously demonstrated that hCAP-D3 is induced by androgen in prostate cells. Extending this finding, Gene Set Enrichment Analysis revealed enrichment of androgen-responsive genes in subtype-1 tumors (P=0.019). Our findings identify hCAP-D3 as a new biomarker for subtype-1 tumors that improves prognostication, and reveal androgen signaling as an important biologic feature of this potentially clinically favorable molecular subtype.

    View details for PubMedID 18223322

  • LMO2 protein expression predicts survival in patients with diffuse large B-Cell lymphoma treated with anthracycline-based chemotherapy with and without rituximab JOURNAL OF CLINICAL ONCOLOGY Natkunam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Gratzinger, D., Rosado, M., Zhao, S., Pohlman, B., Wongchaowart, N., Bast, M., Avigdor, A., Schiby, G., Nagler, A., Byrne, G. E., Levy, R., Gascoyne, R. D., Lossos, I. S. 2008; 26 (3): 447-454


    The heterogeneity of diffuse large B-cell lymphoma (DLBCL) has prompted the search for new markers that can accurately separate prognostic risk groups. We previously showed in a multivariate model that LMO2 mRNA was a strong predictor of superior outcome in DLBCL patients. Here, we tested the prognostic impact of LMO2 protein expression in DLBCL patients treated with anthracycline-based chemotherapy with or without rituximab.DLBCL patients treated with anthracycline-based chemotherapy alone (263 patients) or with the addition of rituximab (80 patients) were studied using immunohistochemistry for LMO2 on tissue microarrays of original biopsies. Staining results were correlated with outcome.In anthracycline-treated patients, LMO2 protein expression was significantly correlated with improved overall survival (OS) and progression-free survival (PFS) in univariate analyses (OS, P = .018; PFS, P = .010) and was a significant predictor independent of the clinical International Prognostic Index (IPI) in multivariate analysis. Similarly, in patients treated with the combination of anthracycline-containing regimens and rituximab, LMO2 protein expression was also significantly correlated with improved OS and PFS (OS, P = .005; PFS, P = .009) and was a significant predictor independent of the IPI in multivariate analysis.We conclude that LMO2 protein expression is a prognostic marker in DLBCL patients treated with anthracycline-based regimens alone or in combination with rituximab. After further validation, immunohistologic analysis of LMO2 protein expression may become a practical assay for newly diagnosed DLBCL patients to optimize their clinical management.

    View details for DOI 10.1200/JCO.2007.13.0690

    View details for PubMedID 18086797

  • Boolean implication networks derived from large scale, whole genome microarray datasets GENOME BIOLOGY Sahoo, D., Dill, D. L., Gentles, A. J., Tibshirani, R., Plevritis, S. K. 2008; 9 (10)


    We describe a method for extracting Boolean implications (if-then relationships) in very large amounts of gene expression microarray data. A meta-analysis of data from thousands of microarrays for humans, mice, and fruit flies finds millions of implication relationships between genes that would be missed by other methods. These relationships capture gender differences, tissue differences, development, and differentiation. New relationships are discovered that are preserved across all three species.

    View details for PubMedID 18973690

  • LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma treated with anthracycline-based chemotherapy with or without rituximab 97th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Natkunam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Gratzinger, D., Zhan, S., Pohlman, B., Nagler, A., Levy, R., Gascoyne, R. D., Lossos, I. S. NATURE PUBLISHING GROUP. 2008: 267A–267A
  • Prognostic significance of VEGF, VEGF receptors, and microvessel density in diffuse large B cell lymphoma treated with anthracycline-based chemotherapy LABORATORY INVESTIGATION Gratzinger, D., Zhao, S., Tibshirani, R. J., Hsi, E. D., Hans, C. P., Pohlman, B., Bast, M., Avigdor, A., Schiby, G., Nagler, A., Byrne, G. E., Lossos, I. S., Natkunam, Y. 2008; 88 (1): 38-47


    Vascular endothelial growth factor-mediated signaling has at least two potential roles in diffuse large B cell lymphoma: potentiation of angiogenesis, and potentiation of lymphoma cell proliferation and/or survival induced by autocrine vascular endothelial growth factor receptor-mediated signaling. We have recently shown that diffuse large B cell lymphomas expressing high levels of vascular endothelial growth factor protein also express high levels of vascular endothelial growth factor receptor-1 and vascular endothelial growth factor receptor-2. We have now assessed a larger multi-institutional cohort of patients with de novo diffuse large B cell lymphoma treated with anthracycline-based therapy to address whether tumor vascularity, or expression of vascular endothelial growth factor protein and its receptors, contribute to patient outcomes. Our results show that increased tumor vascularity is associated with poor overall survival (P=0.047), and is independent of the international prognostic index. High expression of vascular endothelial growth factor receptor-1 by lymphoma cells by contrast is associated with improved overall survival (P=0.044). The combination of high vascular endothelial growth factor and vascular endothelial growth factor receptor-1 protein expression by lymphoma cells identifies a subgroup of patients with improved overall (P=0.003) and progression-free (P=0.026) survival; these findings are also independent of the international prognostic index. The prognostic significance of overexpression of this ligand-receptor pair suggests that autocrine signaling via vascular endothelial growth factor receptor-1 may represent a survival or proliferation pathway in diffuse large B cell lymphoma. Dependence on autocrine vascular endothelial growth factor receptor-1-mediated signaling may render a subset of diffuse large B-cell lymphomas susceptible to anthracycline-based therapy.

    View details for DOI 10.1038/labinvest.3700697

    View details for PubMedID 17998899

  • Spatial smoothing and hot spot detection for CGH data using the fused lasso BIOSTATISTICS Tibshirani, R., Wang, P. 2008; 9 (1): 18-29


    We apply the "fused lasso" regression method of (TSRZ2004) to the problem of "hot- spot detection", in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

    View details for DOI 10.1093/biostatistics/kxm013

    View details for PubMedID 17513312

  • Polymorphisms in hypoxia inducible factor 1 and the initial clinical presentation of coronary disease AMERICAN HEART JOURNAL Hlatky, M. A., Quertermous, T., Boothroyd, D. B., Priest, J. R., Glassford, A. J., Myers, R. M., Fortmann, S. P., Iribarren, C., Tabor, H. K., Assimes, T. L., Tibshirani, R. J., Go, A. S. 2007; 154 (6): 1035-1042


    Only some patients with coronary artery disease (CAD) develop acute myocardial infarction (MI), and emerging evidence suggests vulnerability to MI varies systematically among patients and may have a genetic component. The goal of this study was to assess whether polymorphisms in genes encoding elements of pathways mediating the response to ischemia affect vulnerability to MI among patients with underlying CAD.We prospectively identified patients at the time of their initial clinical presentation of CAD who had either an acute MI or stable exertional angina. We collected clinical data and genotyped 34 polymorphisms in 6 genes (ANGPT1, HIF1A, THBS1, VEGFA, VEGFC, VEGFR2).The 909 patients with acute MI were significantly more likely than the 466 patients with stable angina to be male, current smokers, and hypertensive, and less likely to be taking beta-blockers or statins. Three polymorphisms in HIF1A (Pro582Ser, rs11549465; rs1087314; and Thr418Ile, rs41508050) were significantly more common in patients who presented with stable exertional angina rather than acute MI, even after statistical adjustment for cardiac risk factors and medications. The HIF-mediated transcriptional activity was significantly lower when HIF1A null fibroblasts were transfected with variant HIF1A alleles than with wild-type HIF1A alleles.Polymorphisms in HIF1A were associated with development of stable exertional angina rather than acute MI as the initial clinical presentation of CAD.

    View details for DOI 10.1016/j.ahj.2007.07.042

    View details for PubMedID 18035072

  • PATHWISE COORDINATE OPTIMIZATION ANNALS OF APPLIED STATISTICS Friedman, J., Hastie, T., Hoefling, H., Tibshirani, R. 2007; 1 (2): 302-332

    View details for DOI 10.1214/07-AOAS131

    View details for Web of Science ID 000261057600003

  • Major histocomplatibility class II (MHCII) and germinal center associated gene expression correlate with overall survival in ritiximab and CHOP-like treated diffuse large B.cell lymphoma (DLBCL) patients, using 49th Annual Meeting of the American-Society-of-Hematology Malumbres, R., Johnson, N. A., Sehn, L. H., Natkunam, Y., Tibshirani, R., Briones, J., Connors, J. M., Levy, R., Gascoyne, R. D., Lossos, I. S. AMER SOC HEMATOLOGY. 2007: 23A–23A
  • Survival in follicular lymphoma: The Stanford experience, 1960-2003. 49th Annual Meeting of the American-Society-of-Hematology Tan, D., Rosenberg, S. A., Levy, R., Lavori, P., Tibshirani, R., Hoppe, R. T., Warnke, R., Advani, R., Natkunam, Y., Yuen, A., Horning, S. J. AMER SOC HEMATOLOGY. 2007: 1005A–1005A
  • LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma in, the pre- and post-rituximab treatment eras 49th Annual Meeting of the American-Society-of-Hematology Natkumam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Zhao, S., Pohlman, B., Spinelli, J., Bast, M., Nagler, A., Levy, R., Gascoyne, R. D., Lossos, I. S. AMER SOC HEMATOLOGY. 2007: 24A–24A
  • Anti-idiotype antibody response afteir vaccination correlates with better overall survival in follicular lymphoma 49th Annual Meeting of the American-Society-of-Hematology Ai, W. Z., Tibshirani, R., Taidi, B., Czerwinski, D., Levy, R. AMER SOC HEMATOLOGY. 2007: 199A–199A
  • Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins NATURE MEDICINE Ray, S., Britschgi, M., Herbert, C., Takeda-Uchimura, Y., Boxer, A., Blennow, K., Friedman, L. F., Galasko, D. R., Jutel, M., Karydas, A., Kaye, J. A., Leszek, J., Miller, B. L., Minthon, L., Quinn, J. F., Rabinovici, G. D., Robinson, W. H., Sabbagh, M. N., So, Y. T., Sparks, D. L., Tabaton, M., Tinklenberg, J., Yesavage, J. A., Tibshirani, R., Wyss-Coray, T. 2007; 13 (11): 1359-1362


    A molecular test for Alzheimer's disease could lead to better treatment and therapies. We found 18 signaling proteins in blood plasma that can be used to classify blinded samples from Alzheimer's and control subjects with close to 90% accuracy and to identify patients who had mild cognitive impairment that progressed to Alzheimer's disease 2-6 years later. Biological analysis of the 18 proteins points to systemic dysregulation of hematopoiesis, immune responses, apoptosis and neuronal support in presymptomatic Alzheimer's disease.

    View details for DOI 10.1038/nm1653

    View details for Web of Science ID 000250736900029

    View details for PubMedID 17934472

  • On the "degrees of freedom" of the lasso ANNALS OF STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2007; 35 (5): 2173-2192
  • Expression and prognostic significance of a panel of tissue hypoxia markers in head-and-neck squamous cell carcinomas 48th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO) Le, Q., Kong, C., Lavori, P. W., O'Byrne, K., Erler, J. T., Huang, X., Chen, Y., Cao, H., Tibshiran, R., Denko, N., Giaccia, A. J., Koong, A. C. ELSEVIER SCIENCE INC. 2007: 167–75


    To investigate the expression pattern of hypoxia-induced proteins identified as being involved in malignant progression of head-and-neck squamous cell carcinoma (HNSCC) and to determine their relationship to tumor pO(2) and prognosis.We performed immunohistochemical staining of hypoxia-induced proteins (carbonic anhydrase IX [CA IX], BNIP3L, connective tissue growth factor, osteopontin, ephrin A1, hypoxia inducible gene-2, dihydrofolate reductase, galectin-1, IkappaB kinase beta, and lysyl oxidase) on tumor tissue arrays of 101 HNSCC patients with pretreatment pO(2) measurements. Analysis of variance and Fisher's exact tests were used to evaluate the relationship between marker expression, tumor pO(2), and CA IX staining. Cox proportional hazard model and log-rank tests were used to determine the relationship between markers and prognosis.Osteopontin expression correlated with tumor pO(2) (Eppendorf measurements) (p = 0.04). However, there was a strong correlation between lysyl oxidase, ephrin A1, and galectin-1 and CA IX staining. These markers also predicted for cancer-specific survival and overall survival on univariate analysis. A hypoxia score of 0-5 was assigned to each patient, on the basis of the presence of strong staining for these markers, whereby a higher score signifies increased marker expression. On multivariate analysis, increasing hypoxia score was an independent prognostic factor for cancer-specific survival (p = 0.015) and was borderline significant for overall survival (p = 0.057) when adjusted for other independent predictors of outcomes (hemoglobin and age).We identified a panel of hypoxia-related tissue markers that correlates with treatment outcomes in HNSCC. Validation of these markers will be needed to determine their utility in identifying patients for hypoxia-targeted therapy.

    View details for DOI 10.1016/j.ijrobp.2007.01.071

    View details for PubMedID 17707270

  • Notch signals positively regulate activity of the mTOR pathway in T-cell acute lymphoblastic leukemia BLOOD Chan, S. M., Weng, A. P., Tibshirani, R., Aster, J. C., Utz, P. J. 2007; 110 (1): 278-286


    Constitutive Notch activation is required for the proliferation of a subgroup of T-cell acute lymphoblastic leukemia (T-ALL). Downstream pathways that transmit pro-oncogenic signals are not well characterized. To identify these pathways, protein microarrays were used to profile the phosphorylation state of 108 epitopes on 82 distinct signaling proteins in a panel of 13 T-cell leukemia cell lines treated with a gamma-secretase inhibitor (GSI) to inhibit Notch signals. The microarray screen detected GSI-induced hypophosphorylation of multiple signaling proteins in the mTOR pathway. This effect was rescued by expression of the intracellular domain of Notch and mimicked by dominant negative MAML1, confirming Notch specificity. Withdrawal of Notch signals prevented stimulation of the mTOR pathway by mitogenic factors. These findings collectively suggest that the mTOR pathway is positively regulated by Notch in T-ALL cells. The effect of GSI on the mTOR pathway was independent of changes in phosphatidylinositol-3 kinase and Akt activity, but was rescued by expression of c-Myc, a direct transcriptional target of Notch, implicating c-Myc as an intermediary between Notch and mTOR. T-ALL cell growth was suppressed in a highly synergistic manner by simultaneous treatment with the mTOR inhibitor rapamycin and GSI, which represents a rational drug combination for treating this aggressive human malignancy.

    View details for DOI 10.1182/blood-2006-08-039883

    View details for PubMedID 17363738

  • Extracting binary signals from microarray time-course data NUCLEIC ACIDS RESEARCH Sahoo, D., Dill, D. L., Tibshirani, R., Plevritis, S. K. 2007; 35 (11): 3705-3712


    This article presents a new method for analyzing microarray time courses by identifying genes that undergo abrupt transitions in expression level, and the time at which the transitions occur. The algorithm matches the sequence of expression levels for each gene against temporal patterns having one or two transitions between two expression levels. The algorithm reports a P-value for the matching pattern of each gene, and a global false discovery rate can also be computed. After matching, genes can be sorted by the direction and time of transitions. Genes can be partitioned into sets based on the direction and time of change for further analysis, such as comparison with Gene Ontology annotations or binding site motifs. The method is evaluated on simulated and actual time-course data. On microarray data for budding yeast, it is shown that the groups of genes that change in similar ways and at similar times have significant and relevant Gene Ontology annotations.

    View details for DOI 10.1093/nar/gkm284

    View details for PubMedID 17517782


    View details for DOI 10.1214/07-AOAS101

    View details for Web of Science ID 000261050400006

  • Oncogenic regulators and substrates of the anaphase promoting complex/cyclosome are frequently overexpressed in malignant tumors AMERICAN JOURNAL OF PATHOLOGY Lehman, N. L., Tibshirani, R., Hsu, J. Y., Natkunam, Y., Harris, B. T., West, R. B., Masek, M. A., Montgomery, K., van de Rijn, M., Jackson, P. K. 2007; 170 (5): 1793-1805


    The fidelity of cell division is dependent on the accumulation and ordered destruction of critical protein regulators. By triggering the appropriately timed, ubiquitin-dependent proteolysis of the mitotic regulatory proteins securin, cyclin B, aurora A kinase, and polo-like kinase 1, the anaphase promoting complex/cyclosome (APC/C) ubiquitin ligase plays an essential role in maintaining genomic stability. Misexpression of these APC/C substrates, individually, has been implicated in genomic instability and cancer. However, no comprehensive survey of the extent of their misregulation in tumors has been performed. Here, we analyzed more than 1600 benign and malignant tumors by immunohistochemical staining of tissue microarrays and found frequent overexpression of securin, polo-like kinase 1, aurora A, and Skp2 in malignant tumors. Positive and negative APC/C regulators, Cdh1 and Emi1, respectively, were also more strongly expressed in malignant versus benign tumors. Clustering and statistical analysis supports the finding that malignant tumors generally show broad misregulation of mitotic APC/C substrates not seen in benign tumors, suggesting that a "mitotic profile" in tumors may result from misregulation of the APC/C destruction pathway. This profile of misregulated mitotic APC/C substrates and regulators in malignant tumors suggests that analysis of this pathway may be diagnostically useful and represent a potentially important therapeutic target.

    View details for DOI 10.2353/ajpath.2007.060767

    View details for PubMedID 17456782

  • Disease-specific genomic analysis: identifying the signature of pathologic biology BIOINFORMATICS Nicolau, M., Tibshirani, R., Borresen-Dale, A., Jeffrey, S. S. 2007; 23 (8): 957-965


    Genomic high-throughput technology generates massive data, providing opportunities to understand countless facets of the functioning genome. It also raises profound issues in identifying data relevant to the biology being studied.We introduce a method for the analysis of pathologic biology that unravels the disease characteristics of high dimensional data. The method, disease-specific genomic analysis (DSGA), is intended to precede standard techniques like clustering or class prediction, and enhance their performance and ability to detect disease. DSGA measures the extent to which the disease deviates from a continuous range of normal phenotypes, and isolates the aberrant component of data. In several microarray cancer datasets, we show that DSGA outperforms standard methods. We then use DSGA to highlight a novel subdivision of an important class of genes in breast cancer, the estrogen receptor (ER) cluster. We also identify new markers distinguishing ductal and lobular breast cancers. Although our examples focus on microarrays, DSGA generalizes to any high dimensional genomic/proteomic data.

    View details for DOI 10.1093/bioinformatics/btm033

    View details for PubMedID 17277331

  • Averaged gene expressions for regression BIOSTATISTICS Park, M. Y., Hastie, T., Tibshirani, R. 2007; 8 (2): 212-227


    Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

    View details for DOI 10.1093/biostatistics/kxl002

    View details for Web of Science ID 000245512000004

    View details for PubMedID 16698769

  • Microvessel density and expression of vascular endothelial growth factor and its receptors in diffuse large B-cell lymphoma subtypes AMERICAN JOURNAL OF PATHOLOGY Gratzinger, D., Zhao, S., Marinelli, R. J., Kapp, A. V., Tibshirani, R. J., Hammer, A. S., Hamilton-Dutoit, S., Natkunam, Y. 2007; 170 (4): 1362-1369


    Angiogenesis is known to play a major role in neoplasia, including hematolymphoid neoplasia. We assessed the relationships among angiogenesis and expression of vascular endothelial growth factor and its receptors in the context of clinically and biologically relevant subtypes of diffuse large B-cell lymphoma using immunohistochemical evaluation of tissue microarrays. We found that diffuse large B-cell lymphoma specimens showing higher local vascular endothelial growth factor expression showed correspondingly higher microvessel density, implying that lymphoma cells induce local tumor angiogenesis. In addition, local vascular endothelial growth factor expression was higher in those specimens showing higher expression of the receptors of the growth factor, suggesting an autocrine growth-promoting feedback loop. The germinal center-like and nongerminal center-like subtypes of diffuse large B-cell lymphoma were biologically and prognostically distinct. Interestingly, only in the more clinically aggressive nongerminal center-like subtype were microvessel densities significantly higher in specimens showing higher vascular endothelial growth factor expression; the same was true for the finding of higher vascular endothelial growth factor receptor-1 expression in conjunction with higher vascular endothelial growth factor expression. These differences may have important implications for the responsiveness of the two diffuse large B-cell lymphoma subtypes to anti-vascular endothelial growth factor and anti-angiogenic therapies.

    View details for DOI 10.2353/ajpath.2007.060901

    View details for PubMedID 17392174

  • Margin trees for high-dimensional classification JOURNAL OF MACHINE LEARNING RESEARCH Tibshirani, R., Hastie, T. 2007; 8: 637-652
  • Outlier sums for differential gene expression analysis BIOSTATISTICS Tibshirani, R., Hastie, T. 2007; 8 (1): 2-8


    We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

    View details for DOI 10.1093/biostatistics/kx1005

    View details for PubMedID 16702229

  • Forward stagewise regression and the monotone lasso ELECTRONIC JOURNAL OF STATISTICS Hastie, T., Taylor, J., Tibshirani, R., Walther, G. 2007; 1: 1-29

    View details for DOI 10.1214/07-EJS004

    View details for Web of Science ID 000207854200001

  • Are clusters found in one dataset present in another dataset? BIOSTATISTICS Kapp, A. V., Tibshirani, R. 2007; 8 (1): 9-31


    In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (

    View details for DOI 10.1093/biostatistics/kxj029

    View details for PubMedID 16613834

  • Regularized linear discriminant analysis and its application in microarrays BIOSTATISTICS Guo, Y., Hastie, T., Tibshirani, R. 2007; 8 (1): 86-100


    In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN ( for download and testing.

    View details for DOI 10.1093/biostatistics/kxj035

    View details for PubMedID 16603682

  • Tumor-infiltrating T cells are not predictive of clinical outcome in follicular lymphoma. 48th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Czerwinski, D., Horning, S. J., Allen, J., Tibshirani, R., Levy, R. AMER SOC HEMATOLOGY. 2006: 247A–248A
  • Preliminary report on a phase I/II study of intraturnoral injection of PF-3512676 (CpG 7909), a TLR9 agonist, combined with radiation in recurrent low-grade lymphomas. 48th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Kim, Y., Hoppe, R. T., Shah, S., Horning, S. J., Tibshirani, R., Levy, R. AMER SOC HEMATOLOGY. 2006: 767A–768A
  • Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer GENES CHROMOSOMES & CANCER Bergamaschi, A., Kim, Y. H., Wang, P., Sorlie, T., Hernandez-Boussard, T., Lonning, P. E., Tibshirani, R., Borresen-Dale, A., Pollack, J. R. 2006; 45 (11): 1033-1040


    Breast cancer is a leading cause of cancer-death among women, where the clinicopathological features of tumors are used to prognosticate and guide therapy. DNA copy number alterations (CNAs), which occur frequently in breast cancer and define key pathogenetic events, are also potentially useful prognostic or predictive factors. Here, we report a genome-wide array-based comparative genomic hybridization (array CGH) survey of CNAs in 89 breast tumors from a patient cohort with locally advanced disease. Statistical analysis links distinct cytoband loci harboring CNAs to specific clinicopathological parameters, including tumor grade, estrogen receptor status, presence of TP53 mutation, and overall survival. Notably, distinct spectra of CNAs also underlie the different subtypes of breast cancer recently defined by expression-profiling, implying these subtypes develop along distinct genetic pathways. In addition, higher numbers of gains/losses are associated with the "basal-like" tumor subtype, while high-level DNA amplification is more frequent in "luminal-B" subtype tumors, suggesting also that distinct mechanisms of genomic instability might underlie their pathogenesis. The identified CNAs may provide a basis for improved patient prognostication, as well as a starting point to define important genes to further our understanding of the pathobiology of breast cancer. This article contains Supplementary Material available at

    View details for DOI 10.1002/gcc.20366

    View details for Web of Science ID 000240601400005

    View details for PubMedID 16897746

  • Discovery and validation of breast cancer subtypes BMC GENOMICS Kapp, A. V., Jeffrey, S. S., Langerod, A., Borresen-Dale, A., Han, W., Noh, D., Bukholm, I. R., Nicolau, M., Brown, P. O., Tibshirani, R. 2006; 7


    Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+.Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability.As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.

    View details for DOI 10.1186/1471-2164-7-231

    View details for PubMedID 16965636

  • Global transcriptional response to interferon is a determinant of HCV treatment outcome and is modified by race HEPATOLOGY He, X., Ji, X., Hale, M. B., Cheung, R., Ahmed, A., Guo, Y., Nolan, G. P., Pfeffer, L. M., Wright, T. L., Risch, N., Tibshirani, R., Greenberg, H. B. 2006; 44 (2): 352-359


    Interferon (IFN)-alpha-based therapy for chronic hepatitis C is effective in fewer than 50% of all treated patients, with a substantially lower response rate in black patients. The goal of this study was to investigate the underlying host transcriptional response associated with interferon treatment outcomes. We collected peripheral blood mononuclear cells from chronic hepatitis C patients before initiation of IFN-alpha therapy and incubated the cells with or without IFN-alpha for 6 hours, followed by microarray assay to identify IFN-induced gene transcription. The microarray datasets were analyzed statistically according to the patients' race and virological responses to subsequent IFN-alpha treatment. The global induction of IFN-stimulated genes (ISGs) was significantly greater in sustained virological responders compared with nonresponders and in white patients compared with black patients. In addition, a significantly greater global induction of ISGs was observed in sustained virological responders compared with nonresponders within the group of white patients. The level of IFN-induced signal transducer and activator of transcription (STAT) 1 activation, a key component of the Janus kinase (JAK)-STAT signaling pathway, correlated with the global induction of ISGs and was significantly higher in white patients than in black patients. In conclusion, both treatment outcome and race are associated with different transcriptional responses to IFN-alpha. Because this difference is evident in the global induction of ISGs rather than a selective effect on a subset of such genes, key factors affecting the outcome of IFN-alpha therapy are likely to act at the JAK-STAT pathway that controls transcription of downstream ISGs.

    View details for DOI 10.1002/hep.21267

    View details for PubMedID 16871572

  • Sparse principal component analysis JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2006; 15 (2): 265-286
  • A tail strength measure for assessing the overall univariate significance in a dataset BIOSTATISTICS Taylor, J., Tibshirani, R. 2006; 7 (2): 167-181


    We propose an overall measure of significance for a set of hypothesis tests. The 'tail strength' is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.

    View details for DOI 10.1093/biostatistics/kxj009

    View details for PubMedID 16332926

  • Hybrid hierarchical clustering with applications to microarray data BIOSTATISTICS Chipman, H., Tibshirani, R. 2006; 7 (2): 286-301


    In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.

    View details for DOI 10.1093/biostatistics/kxj007

    View details for Web of Science ID 000236436300009

    View details for PubMedID 16301308

  • A simple method for assessing sample sizes in microarray experiments BMC BIOINFORMATICS Tibshirani, R. 2006; 7


    In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments.Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes.Our method seems to be useful for sample size assessment in microarray experiments.

    View details for DOI 10.1186/1471-2105-7-106

    View details for PubMedID 16512900

  • An evaluation of tumor oxygenation and gene expression in patients with early stage non-small cell lung cancers CLINICAL CANCER RESEARCH Le, Q. T., Chen, E., Salim, A., Cao, H. B., Kong, C. S., Whyte, R., Donington, J., Cannon, W., Wakelee, H., Tibshirani, R., Mitchell, J. D., Richardson, D., O'Byrne, K. J., Koong, A. C., Giaccia, A. J. 2006; 12 (5): 1507-1514


    To directly assess tumor oxygenation in resectable non-small cell lung cancers (NSCLC) and to correlate tumor pO2 and the selected gene and protein expression to treatment outcomes.Twenty patients with resectable NSCLC were enrolled. Intraoperative measurements of normal lung and tumor pO2 were done with the Eppendorf polarographic electrode. All patients had plasma osteopontin measurements by ELISA. Carbonic anhydrase-IX (CA IX) staining of tumor sections was done in the majority of patients (n = 16), as was gene expression profiling (n = 12) using cDNA microarrays. Tumor pO2 was correlated with CA IX staining, osteopontin levels, and treatment outcomes.The median tumor pO2 ranged from 0.7 to 46 mm Hg (median, 16.6) and was lower than normal lung pO2 in all but one patient. Because both variables were affected by the completeness of lung deflation during measurement, we used the ratio of tumor/normal lung (T/L) pO2 as a reflection of tumor oxygenation. The median T/L pO2 was 0.13. T/L pO2 correlated significantly with plasma osteopontin levels (r = 0.53, P = 0.02) and CA IX expression (P = 0.006). Gene expression profiling showed that high CD44 expression was a predictor for relapse, which was confirmed by tissue staining of CD44 variant 6 protein. Other variables associated with the risk of relapse were T stage (P = 0.02), T/L pO2 (P = 0.04), and osteopontin levels (P = 0.001).Tumor hypoxia exists in resectable NSCLC and is associated with elevated expression of osteopontin and CA IX. Tumor hypoxia and elevated osteopontin levels and CD44 expression correlated with poor prognosis. A larger study is needed to confirm the prognostic significance of these factors.

    View details for DOI 10.1158/1078-0432.CCR-05-2049

    View details for PubMedID 16533775

  • Prediction by supervised principal components JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bair, E., Hastie, T., Paul, D., Tibshirani, R. 2006; 101 (473): 119-137
  • Changes of gene expression in gastric preneoplasia following Helicobacter pylori eradication therapy CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION Tsai, C. J., Herrera-Goepfert, R., Tibshirani, R. J., Yang, S. F., Mohar, A., Guarner, J., Parsonnet, J. 2006; 15 (2): 272-280


    Helicobacter pylori causes gastric preneoplasia and neoplasia. Eradicating H. pylori can result in partial regression of preneoplastic lesions; however, the molecular underpinning of this change is unknown. To identify molecular changes in the gastric mucosa following H. pylori eradication, we used cDNA microarrays (with each array containing approximately 30,300 genes) to analyze 54 gastric biopsies from a randomized, placebo-controlled trial of H. pylori therapy. The 54 biopsies were obtained from 27 subjects (13 from the treatment and 14 from the placebo group) with chronic gastritis, atrophy, and/or intestinal metaplasia. Each subject contributed one biopsy before and another biopsy 1 year after the intervention. Significant analysis of microarrays (SAM) was used to compare the gene expression profiles of pre-intervention and post-intervention biopsies. In the treatment group, SAM identified 30 genes whose expression changed significantly from baseline to 1 year after treatment (0 up-regulated and 30 down-regulated). In the placebo group, the expression of 55 genes differed significantly over the 1-year period (32 up-regulated and 23 down-regulated). Five genes involved in cell-cell adhesion and lining (TACSTD1 and MUC13), cell cycle differentiation (S100A10), and lipid metabolism and transport (FABP1 and MTP) were down-regulated over time in the treatment group but up-regulated in the placebo group. Immunohistochemistry for one of these differentially expressed genes (FABP1) confirmed the changes in gene expression observed by microarray. In conclusion, H. pylori eradication may stop or reverse ongoing molecular processes in the stomach. Further studies are needed to evaluate the use of these genes as markers for gastric cancer risk.

    View details for DOI 10.1158/1055-9965.EPI-05-0362

    View details for PubMedID 16492915

  • Gene expression profiling predicts survival in conventional renal cell carcinoma PLOS MEDICINE Zhao, H. J., Ljungberg, B., Grankvist, K., Rasmuson, T., Tibshirani, R., Brooks, J. D. 2006; 3 (1): 115-124


    Conventional renal cell carcinoma (cRCC) accounts for most of the deaths due to kidney cancer. Tumor stage, grade, and patient performance status are used currently to predict survival after surgery. Our goal was to identify gene expression features, using comprehensive gene expression profiling, that correlate with survival.Gene expression profiles were determined in 177 primary cRCCs using DNA microarrays. Unsupervised hierarchical clustering analysis segregated cRCC into five gene expression subgroups. Expression subgroup was correlated with survival in long-term follow-up and was independent of grade, stage, and performance status. The tumors were then divided evenly into training and test sets that were balanced for grade, stage, performance status, and length of follow-up. A semisupervised learning algorithm (supervised principal components analysis) was applied to identify transcripts whose expression was associated with survival in the training set, and the performance of this gene expression-based survival predictor was assessed using the test set. With this method, we identified 259 genes that accurately predicted disease-specific survival among patients in the independent validation group (p < 0.001). In multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status (p < 0.001).cRCC displays molecular heterogeneity and can be separated into gene expression subgroups that correlate with survival after surgery. We have identified a set of 259 genes that predict survival after surgery independent of clinical prognostic factors.

    View details for DOI 10.1371/journal.pmed.0030013

    View details for PubMedID 16318415

  • Autoantibody profiling of lupus mice deficient for interferon signaling components. 6th Annual Meeting of the Federation-of-Clinical-Immunology-Societies Thibault, D., Graham, K., Balboni, I., Lee, L., Kohlmoos, C., Tibshirani, R., Utz, P. ACADEMIC PRESS INC ELSEVIER SCIENCE. 2006: S72–S73
  • Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification ONCOGENE Kim, Y. H., Girard, L., Giacomini, C. P., Wang, P., Hernandez-Boussard, T., Tibshirani, R., Minna, J. D., Pollack, J. R. 2006; 25 (1): 130-138


    DNA amplifications and deletions frequently contribute to the development and progression of lung cancer. To identify such novel alterations in small cell lung cancer (SCLC), we performed comparative genomic hybridization on a set of 24 SCLC cell lines, using cDNA microarrays representing approximately 22,000 human genes (providing an average mapping resolution of <70 kb). We identified localized DNA amplifications corresponding to oncogenes known to be amplified in SCLC, including MYC (8q24), MYCN (2p24) and MYCL1 (1p34). Additional highly localized DNA amplifications suggested candidate oncogenes not previously identified as amplified in SCLC, including the antiapoptotic genes TNFRSF4 (1p36), DAD1 (14q11), BCL2L1 (20q11) and BCL2L2 (14q11). Likewise, newly discovered PCR-validated homozygous deletions suggested candidate tumor-suppressor genes, including the proapoptotic genes MAPK10 (4q21) and TNFRSF6 (10q23). To characterize the effect of DNA amplification on gene expression patterns, we performed expression profiling using the same microarray platform. Among our findings, we identified sets of genes whose expression correlated with MYC, MYCN or MYCL1 amplification, with surprisingly little overlap among gene sets. While both MYC and MYCN amplification were associated with increased and decreased expression of known MYC upregulated and downregulated targets, respectively, MYCL1 amplification was associated only with the latter. Our findings support a role of altered apoptotic balance in the pathogenesis of SCLC, and suggest that MYC family genes might affect oncogenesis through distinct sets of targets, in particular implicating the importance of transcriptional repression.

    View details for DOI 10.1038/sj.onc.1208997

    View details for PubMedID 16116477

  • Gene expression profiling differentiates germ cell tumors from other cancers and defines subtype-specific signatures PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Juric, D., Sale, S., Hromas, R. A., Yu, R., Wang, Y., Duran, G. E., Tibshirari, R., Einhorn, L. H., Sikic, B. I. 2005; 102 (49): 17763-17768


    Germ cell tumors (GCTs) of the testis are the predominant cancer among young men. We analyzed gene expression profiles of 50 GCTs of various subtypes, and we compared them with 443 other common malignant tumors of epithelial, mesenchymal, and lymphoid origins. Significant differences in gene expression were found among major histological subtypes of GCTs, and between them and other malignancies. We identified 511 genes, belonging to several critical functional groups such as cell cycle progression, cell proliferation, and apoptosis, to be significantly differentially expressed in GCTs compared with other tumor types. Sixty-five genes were sufficient for the construction of a GCT class predictor of high predictive accuracy (100% training set, 96% test set), which might be useful in the diagnosis of tumors of unknown primary origin. Previously described diagnostic and prognostic markers were found to be expressed by the appropriate GCT subtype (AFP, POU5F1, POV1, CCND2, and KIT). Several additional differentially expressed genes were identified in teratomas (EGR1 and MMP7), yolk sac tumors (PTPN13 and FN1), and seminomas (NR6A1, DPPA4, and IRX1). Dynamic computation of interaction networks and mapping to existing pathways knowledge databases revealed a potential role of EGR1 in p21-induced cell cycle arrest and intrinsic chemotherapy resistance of mature teratomas.

    View details for DOI 10.1073/pnas.0509082102

    View details for PubMedID 16306258

  • Differential gene expression profiles in CD34+myelodysplastic syndrome marrow cells. 47th Annual Meeting of the American-Society-of-Hematology Sridhar, K., Brown, P. O., Tibshirani, R., Jamieson, C., Weissman, I., Ross, D. T., Greenberg, P. L. AMER SOC HEMATOLOGY. 2005: 956A–956A
  • Gene expression profiling and FLT3 status correlate with outcome in de novo acute myeloid leukemia (AML) with normal karyotype: Results of children's oncology group (COG) study POG #9421. 47th Annual Meeting of the American-Society-of-Hematology Lacayo, N., Meshinchi, S., Raimondi, S., Saraiya, C., O'Brien, M., Yu, R., Juric, D., Chang, M., Willman, C., Tibshirani, R., Ravindranath, Y., Sikic, B., Weinstein, H., Dahl, G. V. AMER SOC HEMATOLOGY. 2005: 667A–667A
  • Cluster validation by prediction strength JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tibshirani, R., Walther, G. 2005; 14 (3): 511-528
  • Signature patterns of gene expression in mouse atherosclerosis and their correlation to human coronary disease PHYSIOLOGICAL GENOMICS Tabibiazar, R., Wagner, R. A., Ashley, E. A., King, J. Y., Ferrara, R., Spin, J. M., Sanan, D. A., Narasimhan, B., Tibshirani, R., Tsao, P. S., Efron, B., Quertermous, T. 2005; 22 (2): 213-226


    The propensity for developing atherosclerosis is dependent on underlying genetic risk and varies as a function of age and exposure to environmental risk factors. Employing three mouse models with different disease susceptibility, two diets, and a longitudinal experimental design, it was possible to manipulate each of these factors to focus analysis on genes most likely to have a specific disease-related function. To identify differences in longitudinal gene expression patterns of atherosclerosis, we have developed and employed a statistical algorithm that relies on generalized regression and permutation analysis. Comprehensive annotation of the array with ontology and pathway terms has allowed rigorous identification of molecular and biological processes that underlie disease pathophysiology. The repertoire of atherosclerosis-related immunomodulatory genes has been extended, and additional fundamental pathways have been identified. This highly disease-specific group of mouse genes was combined with an extensive human coronary artery data set to identify a shared group of genes differentially regulated among atherosclerotic tissues from different species and different vascular beds. A small core subset of these differentially regulated genes was sufficient to accurately classify various stages of the disease in mouse. The same gene subset was also found to accurately classify human coronary lesion severity. In addition, this classifier gene set was able to distinguish with high accuracy atherectomy specimens from native coronary artery disease vs. those collected from in-stent restenosis lesions, thus identifying molecular differences between these two processes. These studies significantly focus efforts aimed at identifying central gene regulatory pathways that mediate atherosclerotic disease, and the identification of classification gene sets offers unique insights into potential diagnostic and therapeutic strategies in atherosclerotic disease.

    View details for DOI 10.1152/physiolgenomics.00001.2005

    View details for Web of Science ID 000230987900011

    View details for PubMedID 15870398

  • Array-based comparative genomic hybridization identifies localized DNA amplifications and homozygous deletions in pancreatic cancer NEOPLASIA Bashyam, M. D., Bair, R., Kim, Y. H., Wang, P., Hernandez-Boussard, T., Karikari, C. A., Tibshirani, R., Maitra, A., Pollack, J. R. 2005; 7 (6): 556-562


    Pancreatic cancer, the fourth leading cause of cancer death in the United States, is frequently associated with the amplification and deletion of specific oncogenes and tumor-suppressor genes (TSGs), respectively. To identify such novel alterations and to discover the underlying genes, we performed comparative genomic hybridization on a set of 22 human pancreatic cancer cell lines, using cDNA microarrays measuring approximately 26,000 human genes (thereby providing an average mapping resolution of <60 kb). To define the subset of amplified and deleted genes with correspondingly altered expression, we also profiled mRNA levels in parallel using the same cDNA microarray platform. In total, we identified 14 high-level amplifications (38-4934 kb in size) and 15 homozygous deletions (46-725 kb). We discovered novel localized amplicons, suggesting previously unrecognized candidate oncogenes at 6p21, 7q21 (SMURF1, TRRAP), 11q22 (BIRC2, BIRC3), 12p12, 14q24 (TGFB3), 17q12, and 19q13. Likewise, we identified novel polymerase chain reaction-validated homozygous deletions indicating new candidate TSGs at 6q25, 8p23, 8p22 (TUSC3), 9q33 (TNC, TNFSF15), 10q22, 10q24 (CHUK), 11p15 (DKK3), 16q23, 18q23, 21q22 (PRDM15, ANKRD3), and Xp11. Our findings suggest candidate genes and pathways, which may contribute to the development or progression of pancreatic cancer.

    View details for DOI 10.1593/neo.04586

    View details for PubMedID 16036106

  • Genome-wide characterization of gene expression variations and DNA copy number changes in prostate cancer cell lines PROSTATE Zhao, H. J., Kim, Y., Wang, P., Lapointe, J., Tibshirani, R., Pollack, J. R., Brooks, J. D. 2005; 63 (2): 187-197


    The aim of this study was to characterize gene expression and DNA copy number profiles in androgen sensitive (AS) and androgen insensitive (AI) prostate cancer cell lines on a genome-wide scale.Gene expression profiles and DNA copy number changes were examined using DNA microarrays in eight commonly used prostate cancer cell lines. Chromosomal regions with DNA copy number changes were identified using cluster along chromosome (CLAC).There were discrete differences in gene expression patterns between AS and AI cells that were not limited to androgen-responsive genes. AI cells displayed more DNA copy number changes, especially amplifications, than AS cells. The gene expression profiles of cell lines showed limited similarities to prostate tumors harvested at surgery.AS and AI cell lines are different in their transcriptional programs and degree of DNA copy number alterations. This dataset provides a context for the use of prostate cancer cell lines as models for clinical cancers.

    View details for DOI 10.1002/pros.20158

    View details for PubMedID 15486987

  • Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H. Y., He, Y. D., Van't Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O., van de Vijver, M. J. 2005; 102 (10): 3738-3743


    Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this "wound-response signature" to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering ("molecular subtypes") or supervised predictors of metastasis ("70-gene prognosis signature").

    View details for DOI 10.1073/pnas.0409462102

    View details for PubMedID 15701700

  • Mouse strain-specific differences in vascular wall gene expression and their relationship to vascular disease ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY Tabibiazar, R., Wagner, R. A., Spin, J. M., Ashley, E. A., Narasimhan, B., Rubin, E. M., Efron, B., Tsao, P. S., Tibshirani, R., Quertermous, T. 2005; 25 (2): 302-308


    Different strains of inbred mice exhibit different susceptibility to the development of atherosclerosis. The C3H/HeJ and C57Bl/6 mice have been used in several studies aimed at understanding the genetic basis of atherosclerosis. Under controlled environmental conditions, variations in susceptibility to atherosclerosis reflect differences in genetic makeup, and these differences must be reflected in gene expression patterns that are temporally related to the development of disease. In this study, we sought to identify the genetic pathways that are differentially activated in the aortas of these mice.We performed genome-wide transcriptional profiling of aortas from C3H/HeJ and C57Bl/6 mice. Differences in gene expression were identified at baseline as well as during normal aging and longitudinal exposure to high-fat diet. The significance of these genes to the development of atherosclerosis was evaluated by observing their temporal pattern of expression in the well-studied apolipoprotein E model of atherosclerosis.Gene expression differences between the 2 strains suggest that aortas of C57Bl/6 mice have a higher genetic propensity to develop inflammation in response to appropriate atherogenic stimuli. This study expands the repertoire of factors in known disease-related signaling pathways and identifies novel candidate genes for future study. To gain insights into the molecular pathways that are differentially activated in strains of mice with varied susceptibility to atherosclerosis, we performed comprehensive transcriptional profiling of their vascular wall. Genes identified through these studies expand the repertoire of factors in disease-related signaling pathways and identify novel candidate genes in atherosclerosis.

    View details for DOI 10.1161/011.ATV.0000151372.86863.a5

    View details for PubMedID 15550693

  • The 'miss rate' for the analysis of gene expression data BIOSTATISTICS Taylor, J., Tibshirani, R., Efron, B. 2005; 6 (1): 111-117


    Multiple testing issues are important in gene expression studies, where typically thousands of genes are compared over two or more experimental conditions. The false discovery rate has become a popular measure in this setting. Here we discuss a complementary measure, the 'miss rate', and show how to estimate it in practice.

    View details for DOI 10.1093/biostatistics/kxh021

    View details for PubMedID 15618531

  • Early detection of breast cancer based on gene-expression patterns in peripheral blood cells BREAST CANCER RESEARCH Sharma, P., Sahni, N. S., Tibshirani, R., Skaane, P., Urdal, P., Berghagen, H., Jensen, M., Kristiansen, L., Moen, C., Sharma, P., Zaka, A., Arnes, J., Sauer, T., Akslen, L. A., Schlichting, E., Borresen-Dale, A. L., Lonneborg, A. 2005; 7 (5): R634-R644


    Existing methods to detect breast cancer in asymptomatic patients have limitations, and there is a need to develop more accurate and convenient methods. In this study, we investigated whether early detection of breast cancer is possible by analyzing gene-expression patterns in peripheral blood cells.Using macroarrays and nearest-shrunken-centroid method, we analyzed the expression pattern of 1,368 genes in peripheral blood cells of 24 women with breast cancer and 32 women with no signs of this disease. The results were validated using a standard leave-one-out cross-validation approach.We identified a set of 37 genes that correctly predicted the diagnostic class in at least 82% of the samples. The majority of these genes had a decreased expression in samples from breast cancer patients, and predominantly encoded proteins implicated in ribosome production and translation control. In contrast, the expression of some defense-related genes was increased in samples from breast cancer patients.The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients. Additional studies with a large sample size, from women both with and without the disease, are warranted to confirm or refute this finding.

    View details for DOI 10.1186/bcr1203

    View details for Web of Science ID 000232332200021

    View details for PubMedID 16168108

    View details for PubMedCentralID PMC1242124

  • Sparsity and smoothness via the fused lasso JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. 2005; 67: 91-108
  • CSF1 expression signature identifies a subset of breast carcinomas and influences outcome. 28th Annual San Antonio Breast Cancer Symposium West, R. B., Horlings, H., Nuyten, D. S., Subramanian, S., Zhu, S. X., Miller, M., Rubin, B. P., Nielsen, T. O., Gilks, C. B., Huntsman, D. G., Tibshirani, R., van De Vijver, M., van de Rijn, M. SPRINGER. 2005: S135–S135
  • A method for calling gains and losses in array CGH data BIOSTATISTICS Wang, P., Kim, Y., Pollack, J., Narasimhan, B., Tibshirani, R. 2005; 6 (1): 45-58


    Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.

    View details for DOI 10.1093/biostatistics/kxh017

    View details for PubMedID 15618527

  • Sample classification from protein mass spectrometry, by 'peak probability contrasts' BIOINFORMATICS Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G. Y., Koong, A., Le, Q. T. 2004; 20 (17): 3034-3044


    Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers.Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.

    View details for DOI 10.1093/bioinformatics/bth357

    View details for PubMedID 15226172

  • The percentage of tumor-infiltrating T cells is not correlated with overall survival in follicular B-cell lymphomas 46th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Czerwinski, D. K., Tibshirani, R., Horning, S. J., Levy, R. AMER SOC HEMATOLOGY. 2004: 891A–891A
  • Gene expression profiles at diagnosis in de novo childhood AML patients identify FLT3 mutations with good clinical outcomes BLOOD Lacayo, N. J., Meshinchi, S., Kinnunen, P., Yu, R., Wang, Y., Stuber, C. M., Douglas, L., Wahab, R., Becton, D. L., Weinstein, H., Chang, M. N., Willman, C. L., Radich, J. P., Tibshirani, R., Ravindranath, Y., Sikic, B. I., Dahl, G. V. 2004; 104 (9): 2646-2654


    Fms-like tyrosine kinase 3 (FLT3) mutations are associated with unfavorable outcomes in children with acute myeloid leukemia (AML). We used DNA microarrays to identify gene expression profiles related to FLT3 status and outcome in childhood AML. Among 81 diagnostic specimens, 36 had FLT3 mutations (FLT3-MUs), 24 with internal tandem duplications (ITDs) and 12 with activating loop mutations (ALMs). In addition, 8 of 19 specimens from patients with relapses had FLT3-MUs. Predictive analysis of microarrays (PAM) identified genes that differentiated FLT3-ITD from FLT3-ALM and FLT3 wild-type (FLT3-WT) cases. Among the 42 specimens with FLT3-MUs, PAM identified 128 genes that correlated with clinical outcome. Event-free survival (EFS) in FLT3-MU patients with a favorable signature was 45% versus 5% for those with an unfavorable signature (P = .018). Among FLT3-MU specimens, high expression of the RUNX3 gene and low expression of the ATRX gene were associated with inferior outcome. The ratio of RUNX3 to ATRX expression was used to classify FLT3-MU cases into 3 EFS groups: 70%, 37%, and 0% for low, intermediate, and high ratios, respectively (P < .0001). Thus, gene expression profiling identified AML patients with divergent prognoses within the FLT3-MU group, and the RUNX3 to ATRX expression ratio should be a useful prognostic indicator in these patients.

    View details for DOI 10.1182/blood-2004-12-4449

    View details for PubMedID 15251987

  • The entire regularization path for the support vector machine JOURNAL OF MACHINE LEARNING RESEARCH Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. 2004; 5: 1391-1415
  • Developmental response to hypoxia FASEB JOURNAL Huang, S. T., Vo, K. C., Lyell, D. J., Faessen, G. H., Tulac, S., Tibshirani, R., Giaccia, A. J., Giudice, L. C. 2004; 18 (12): 1348-1365


    Molecular mechanisms underlying fetal growth restriction due to placental insufficiency and in utero hypoxia are not well understood. In the current study, time-dependent (3 h-11 days) changes in fetal tissue gene expression in a rat model of in utero hypoxia compared with normoxic controls were investigated as an initial approach to understand molecular events underlying fetal development in response to hypoxia. Under hypoxic conditions, litter size was reduced and IGFBP-1 was up-regulated in maternal serum and in fetal liver and heart. Tissue-specific, distinct regulatory patterns of gene expression were observed under acute vs. chronic hypoxic conditions. Induction of glycolytic enzymes was an early event in response to hypoxia during organ development; consistently, tissue-specific induction of calcium homeostasis-related genes and suppression of growth-related genes were observed, suggesting mechanisms underlying hypoxia-related fetal growth restriction. Furthermore, induction of inflammation-related genes in placentas exposed to long-term hypoxia (11 days) suggests a mechanism for placental dysfunction and impaired pregnancy outcome accompanying in utero hypoxia.

    View details for DOI 10.1096/fj.03-1377com

    View details for PubMedID 15333578

  • Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage Annual Scientific Meeting on Exporing Genomics in Radiation Oncology Rieger, K. E., Hong, W. J., Tusher, V. G., Tang, J., Tibshirani, R., Chu, G. ELSEVIER IRELAND LTD. 2004: S29–S29
  • The use of plasma surface-enhanced laser desorption/ionization time-of-flight mass spectrometry proteomic patterns for detection of head and neck squamous cell cancers 45th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO) Soltys, S. G., Le, Q. T., Shi, G. Y., Tibshirani, R., Giaccia, A. J., Koong, A. C. AMER ASSOC CANCER RESEARCH. 2004: 4806–12


    Our study was undertaken to determine the utility of plasma proteomic profiling using surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry for the detection of head and neck squamous cell carcinomas (HNSCCs).Pretreatment plasma samples from HNSCC patients or controls without known neoplastic disease were analyzed on the Protein Biology System IIc SELDI-TOF mass spectrometer (Ciphergen Biosystems, Fremont, CA). Proteomic spectra of mass:charge ratio (m/z) were generated by the application of plasma to immobilized metal-affinity-capture (IMAC) ProteinChip arrays activated with copper. A total of 37356 data points were generated for each sample. A training set of spectra from 56 cancer patients and 52 controls were applied to the "Lasso" technique to identify protein profiles that can distinguish cancer from noncancer, and cross-validation was used to determine test errors in this training set. The discovery pattern was then used to classify a separate masked test set of 57 cancer and 52 controls. In total, we analyzed the proteomic spectra of 113 cancer patients and 104 controls.The Lasso approach identified 65 significant data points for the discrimination of normal from cancer profiles. The discriminatory pattern correctly identified 39 of 57 HNSCC patients and 40 of 52 noncancer controls in the masked test set. These results yielded a sensitivity of 68% and specificity of 73%. Subgroup analyses in the test set of four different demographic factors (age, gender, and cigarette and alcohol use) that can potentially confound the interpretation of the results suggest that this model tended to overpredict cancer in control smokers.Plasma proteomic profiling with SELDI-TOF mass spectrometry provides moderate sensitivity and specificity in discriminating HNSCC. Further improvement and validation of this approach is needed to determine its usefulness in screening for this disease.

    View details for PubMedID 15269156

  • Efficient quadratic regularization for expression arrays BIOSTATISTICS Hastie, T., Tibshirani, R. 2004; 5 (3): 329-340


    Gene expression arrays typically have 50 to 100 samples and 1000 to 20,000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.

    View details for DOI 10.1093/biostatistics/kxh010

    View details for PubMedID 15208198

  • Different gene expression patterns in invasive lobular and ductal carcinomas of the breast MOLECULAR BIOLOGY OF THE CELL Zhao, H. J., Langerod, A., Ji, Y., Nowels, K. W., Nesland, J. M., Tibshirani, R., Bukholm, I. K., Karesen, R., Botstein, D., Borresen-Dale, A. L., Jeffrey, S. S. 2004; 15 (6): 2523-2536


    Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological types of breast cancer worldwide. Whereas IDC incidence has remained stable, ILC is the most rapidly increasing breast cancer phenotype in the United States and Western Europe. It is not clear whether IDC and ILC represent molecularly distinct entities and what genes might be involved in the development of these two phenotypes. We conducted comprehensive gene expression profiling studies to address these questions. Total RNA from 21 ILCs, 38 IDCs, two lymph node metastases, and three normal tissues were amplified and hybridized to approximately 42,000 clone cDNA microarrays. Data were analyzed using hierarchical clustering algorithms and statistical analyses that identify differentially expressed genes (significance analysis of microarrays) and minimal subsets of genes (prediction analysis for microarrays) that succinctly distinguish ILCs and IDCs. Eleven of 21 (52%) of the ILCs ("typical" ILCs) clustered together and displayed different gene expression profiles from IDCs, whereas the other ILCs ("ductal-like" ILCs) were distributed between different IDC subtypes. Many of the differentially expressed genes between ILCs and IDCs code for proteins involved in cell adhesion/motility, lipid/fatty acid transport and metabolism, immune/defense response, and electron transport. Many genes that distinguish typical and ductal-like ILCs are involved in regulation of cell growth and immune response. Our data strongly suggest that over half the ILCs differ from IDCs not only in histological and clinical features but also in global transcription programs. The remaining ILCs closely resemble IDCs in their transcription patterns. Further studies are needed to explore the differences between ILC molecular subtypes and to determine whether they require different therapeutic strategies.

    View details for DOI 10.1091/mbc.E03-11-0786

    View details for PubMedID 15034139

  • Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes NEW ENGLAND JOURNAL OF MEDICINE Lossos, I. S., Czerwinski, D. K., Alizadeh, A. A., Wechser, M. A., Tibshirani, R., Botstein, D., Levy, R. 2004; 350 (18): 1828-1837


    Several gene-expression signatures can be used to predict the prognosis in diffuse large-B-cell lymphoma, but the lack of practical tests for a genome-scale analysis has restricted the use of this method.We studied 36 genes whose expression had been reported to predict survival in diffuse large-B-cell lymphoma. We measured the expression of each of these genes in independent samples of lymphoma from 66 patients by quantitative real-time polymerase-chain-reaction analyses and related the results to overall survival.In a univariate analysis, genes were ranked on the basis of their ability to predict survival. The genes that were the strongest predictors were LMO2, BCL6, FN1, CCND2, SCYA3, and BCL2. We developed a multivariate model that was based on the expression of these six genes, and we validated the model in two independent microarray data sets. The model was independent of the International Prognostic Index and added to its predictive power.Measurement of the expression of six genes is sufficient to predict overall survival in diffuse large-B-cell lymphoma.

    View details for PubMedID 15115829

  • Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Rieger, K. E., Hong, W. J., Tusher, V. G., Tang, J., Tibshirani, R., Chu, G. 2004; 101 (17): 6635-6640


    Toxicity from radiation therapy is a grave problem for cancer patients. We hypothesized that some cases of toxicity are associated with abnormal transcriptional responses to radiation. We used microarrays to measure responses to ionizing and UV radiation in lymphoblastoid cells derived from 14 patients with acute radiation toxicity. The analysis used heterogeneity-associated transformation of the data to account for a clinical outcome arising from more than one underlying cause. To compute the risk of toxicity for each patient, we applied nearest shrunken centroids, a method that identifies and cross-validates predictive genes. Transcriptional responses in 24 genes predicted radiation toxicity in 9 of 14 patients with no false positives among 43 controls (P = 2.2 x 10(-7)). The responses of these nine patients displayed significant heterogeneity. Of the five patients with toxicity and normal responses, two were treated with protocols that proved to be highly toxic. These results may enable physicians to predict toxicity and tailor treatment for individual patients.

    View details for DOI 10.1073/pnas.0307761101

    View details for Web of Science ID 000221107900056

    View details for PubMedID 15096622

    View details for PubMedCentralID PMC404097

  • Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia NEW ENGLAND JOURNAL OF MEDICINE Bullinger, L., Dohner, K., Bair, E., Frohling, S., Schlenk, R. F., Tibshirani, R., Dohner, H., Pollack, J. R. 2004; 350 (16): 1605-1616


    In patients with acute myeloid leukemia (AML), the presence or absence of recurrent cytogenetic aberrations is used to identify the appropriate therapy. However, the current classification system does not fully reflect the molecular heterogeneity of the disease, and treatment stratification is difficult, especially for patients with intermediate-risk AML with a normal karyotype.We used complementary-DNA microarrays to determine the levels of gene expression in peripheral-blood samples or bone marrow samples from 116 adults with AML (including 45 with a normal karyotype). We used unsupervised hierarchical clustering analysis to identify molecular subgroups with distinct gene-expression signatures. Using a training set of samples from 59 patients, we applied a novel supervised learning algorithm to devise a gene-expression-based clinical-outcome predictor, which we then tested using an independent validation group comprising the 57 remaining patients.Unsupervised analysis identified new molecular subtypes of AML, including two prognostically relevant subgroups in AML with a normal karyotype. Using the supervised learning algorithm, we constructed an optimal 133-gene clinical-outcome predictor, which accurately predicted overall survival among patients in the independent validation group (P=0.006), including the subgroup of patients with AML with a normal karyotype (P=0.046). In multivariate analysis, the gene-expression predictor was a strong independent prognostic factor (odds ratio, 8.8; 95 percent confidence interval, 2.6 to 29.3; P<0.001).The use of gene-expression profiling improves the molecular classification of adult AML.

    View details for PubMedID 15084693

  • Least angle regression ANNALS OF STATISTICS Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. 2004; 32 (2): 407-451
  • Semi-supervised methods to predict patient survival from gene expression data PLOS BIOLOGY Bair, E., Tibshirani, R. 2004; 2 (4): 511-522
  • Semi-supervised methods to predict patient survival from gene expression data. PLoS biology Bair, E., Tibshirani, R. 2004; 2 (4): E108-?


    An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.

    View details for PubMedID 15094809

  • Cancer characterization and feature set extraction by discriminative margin clustering BMC BIOINFORMATICS Munagala, K., Tibshirani, R., O Brown, P. 2004; 5


    A central challenge in the molecular diagnosis and treatment of cancer is to define a set of molecular features that, taken together, distinguish a given cancer, or type of cancer, from all normal cells and tissues.Discriminative margin clustering is a new technique for analyzing high dimensional quantitative datasets, specially applicable to gene expression data from microarray experiments related to cancer. The goal of the analysis is find highly specialized sub-types of a tumor type which are similar in having a small combination of genes which together provide a unique molecular portrait for distinguishing the sub-type from any normal cell or tissue. Detection of the products of these genes can then, in principle, provide a basis for detection and diagnosis of a cancer, and a therapy directed specifically at the distinguishing constellation of molecular features can, in principle, provide a way to eliminate the cancer cells, while minimizing toxicity to any normal cell.The new methodology yields highly specialized tumor subtypes which are similar in terms of potential diagnostic markers.

    View details for PubMedID 15070405

  • Guidelines - Expression profiling - best practices for data generation and interpretation in clinical trials NATURE REVIEWS GENETICS Hoffman, E. P., Awad, T., Palma, J., Webster, T., Hubbell, E., Warrington, J. A., Spirais, A., Wright, G., Buckley, J., Triche, T., Davis, R., Tibshirani, R., Xiao, W. H., Jones, W., Tompkins, R., West, M. 2004; 5 (3): 229-237

    View details for DOI 10.1038/nrg1297

    View details for Web of Science ID 000189334500018

  • Gene expression profiling identifies clinically relevant subtypes of prostate cancer PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Lapointe, J., Li, C., Higgins, J. P., van de Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A. M., Tibshirani, R., Botstein, D., Brown, P. O., Brooks, J. D., Pollack, J. R. 2004; 101 (3): 811-816


    Prostate cancer, a leading cause of cancer death, displays a broad range of clinical behavior from relatively indolent to aggressive metastatic disease. To explore potential molecular variation underlying this clinical heterogeneity, we profiled gene expression in 62 primary prostate tumors, as well as 41 normal prostate specimens and nine lymph node metastases, using cDNA microarrays containing approximately 26,000 genes. Unsupervised hierarchical clustering readily distinguished tumors from normal samples, and further identified three subclasses of prostate tumors based on distinct patterns of gene expression. High-grade and advanced stage tumors, as well as tumors associated with recurrence, were disproportionately represented among two of the three subtypes, one of which also included most lymph node metastases. To further characterize the clinical relevance of tumor subtypes, we evaluated as surrogate markers two genes differentially expressed among tumor subgroups by using immunohistochemistry on tissue microarrays representing an independent set of 225 prostate tumors. Positive staining for MUC1, a gene highly expressed in the subgroups with "aggressive" clinicopathological features, was associated with an elevated risk of recurrence (P = 0.003), whereas strong staining for AZGP1, a gene highly expressed in the other subgroup, was associated with a decreased risk of recurrence (P = 0.0008). In multivariate analysis, MUC1 and AZGP1 staining were strong predictors of tumor recurrence independent of tumor grade, stage, and preoperative prostate-specific antigen levels. Our results suggest that prostate tumors can be usefully classified according to their gene expression patterns, and these tumor subtypes may provide a basis for improved prognostication and treatment stratification.

    View details for DOI 10.1073/pnas.0304146101

    View details for PubMedID 14711987

  • 1-norm support vector machines 17th Annual Conference on Neural Information Processing Systems (NIPS) Zhu, J., Rosset, S., Hastie, T., Tibshirani, R. M I T PRESS. 2004: 49–56
  • Boosted PRIM with application to searching for oncogenic pathway of lung cancer IEEE Computational Systems Bioinformatics Conference (CSB 2004) Wang, P., Kim, Y., Pollack, J., Tibshirani, R. IEEE COMPUTER SOC. 2004: 604–609
  • Central carbon metabolism genes that predict disease-free survival in hormone receptor negative tumors. 27th Annual San Antonio Breast Cancer Symposium Funari, V. A., Tibshirani, R., Ji, Y., Nicolau, M., Carlson, R. W., Brown, P. O., Noh, D. Y., Jeffrey, S. S. SPRINGER. 2004: S115–S115
  • Gene expression patterns in ovarian carcinomas MOLECULAR BIOLOGY OF THE CELL Schaner, M. E., Ross, D. T., Ciaravino, G., Sorlie, T., Troyanskaya, O., Diehn, M., Wang, Y. C., Duran, G. E., Sikic, T. L., Caldeira, S., Skomedal, H., Tu, I. P., Hernandez-Boussard, T., Johnson, S. W., O'Dwyer, P. J., Fero, M. J., Kristensen, G. B., Borresen-Dale, A. L., Hastie, T., Tibshirani, R., van de Rijn, M., Teng, N. N., Longacre, T. A., Botstein, D., Brown, P. O., Sikic, B. I. 2003; 14 (11): 4376-4386


    We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.

    View details for PubMedID 12960427

  • Changes in gene expression in intermediate endpoints of gastric cancer: A randomized, placebo-controlled trial of Helicobacter pylori eradication therapy. 2nd Annual Conference on Frontiers in Cancer Prevention Research Tsai, C. J., Yang, S. F., Tibshirani, R. J., Guarner, J., Mohar, A., Herrera-Goepfert, R., Parsonnet, J. AMER ASSOC CANCER RESEARCH. 2003: 1280S–1280S
  • Characterization of variant patterns of nodular lymphocyte predominant Hodgkin lymphoma with immunohistologic and clinical correlation AMERICAN JOURNAL OF SURGICAL PATHOLOGY Fan, Z., Natkunam, Y., Bair, E., Tibshirani, R., Warnke, R. A. 2003; 27 (10): 1346-1356


    Nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) has traditionally been recognized as having two morphologic patterns, nodular and diffuse, and the current WHO definition of NLPHL requires at least a partial nodular pattern. Variant patterns have not been well documented. We analyzed retrospectively the morphologic and immunophenotypic patterns of NLPHL from 118 patients (total of 137 biopsy samples). Histology plus antibodies directed against CD20, CD3, and CD21 were used to evaluate the immunoarchitecture. We identified six distinct immunoarchitectural patterns in our cases of NLPHL: "classic" (B-cell-rich) nodular, serpiginous/interconnected nodular, nodular with prominent extranodular L&H cells, T-cell-rich nodular, diffuse with a T-cell-rich background (T-cell-rich B-cell lymphoma [TCRBCL]-like), and a (diffuse) B-cell-rich pattern. Small germinal centers within neoplastic nodules were found in approximately 15% of cases, a finding not previously emphasized in NLPHL. Prominent sclerosis was identified in approximately 20% of cases and was frequently seen in recurrent disease. Clinical follow-up was obtained on 56 patients, including 26 patients who had not had recurrence of disease and 30 patients who had recurrence. The follow-up period was 5 months to 16 years (median 2.5 years). The presence of a diffuse (TCRBCL-like) pattern was significantly more common in patients with recurrent disease than those without recurrence. Furthermore, the presence of a diffuse pattern (TCRBCL-like) was shown to be an independent predictor of recurrent disease (P = 0.00324). In addition, there is a tendency for progression to an increasingly more diffuse pattern over time. Analysis of sequential biopsies from patients with recurrent disease suggests that the presence of prominent extranodular L&H cells might represent early evolution to a diffuse (TCRBCL-like) pattern. We also report three patients who presented initially with diffuse large B-cell lymphoma and later developed NLPHL.

    View details for PubMedID 14508396

  • Statistical significance for genomewide studies PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Storey, J. D., Tibshirani, R. 2003; 100 (16): 9440-9445


    With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

    View details for DOI 10.1073/pnas.1530509100

    View details for Web of Science ID 000184620000062

    View details for PubMedID 12883005

  • Repeated observation of breast tumor subtypes in independent gene expression data sets PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lonning, P. E., Brown, P. O., Borresen-Dale, A. L., Botstein, D. 2003; 100 (14): 8418-8423


    Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.

    View details for DOI 10.1073/pnas.0932692100

    View details for PubMedID 12829800

  • Note on "Comparison of model selection for regression" by Vladimir Cherkassky and Yunqian Ma NEURAL COMPUTATION Hastie, T., Tibshirani, R., Friedman, J. 2003; 15 (7): 1477-1480


    While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation--errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212).

    View details for PubMedID 12816562

  • Class prediction by nearest shrunken centroids, with applications to DNA microarrays STATISTICAL SCIENCE Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2003; 18 (1): 104-117
  • HGAL is a novel interleukin-4-inducible gene that strongly predicts survival in diffuse large B-cell lymphoma BLOOD Lossos, I. S., Alizadeh, A. A., Rajapaksa, R., Tibshirani, R., Levy, R. 2003; 101 (2): 433-440


    We have cloned and characterized a novel human gene, HGAL (human germinal center-associated lymphoma), which predicts outcome in patients with diffuse large B-cell lymphoma (DLBCL). The HGAL gene comprises 6 exons and encodes a cytoplasmic protein of 178 amino acids that contains an immunoreceptor tyrosine-based activation motif (ITAM). It is highly expressed in germinal center (GC) lymphocytes and GC-derived lymphomas and is homologous to the mouse GC-specific gene M17. Expression of the HGAL gene is specifically induced in B cells by interleukin-4 (IL-4). Patients with DLBCL expressing high levels of HGAL mRNA demonstrate significantly longer overall survival than do patients with low HGAL expression. This association was independent of the clinical international prognostic index. High HGAL mRNA expression should be used as a prognostic factor in DLBCL.

    View details for PubMedID 12509382

  • Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods in molecular biology (Clifton, N.J.) Storey, J. D., Tibshirani, R. 2003; 224: 149-157

    View details for PubMedID 12710672

  • Expression of cytokeratins 17 and 5 identifies a group of breast carcinomas with poor clinical outcome AMERICAN JOURNAL OF PATHOLOGY van de Rijn, M., Perou, C. M., Tibshirani, R., Haas, P., Kallioniemi, C., Kononen, J., Torhorst, J., Sauter, G., Zuber, M., Kochli, O. R., Mross, F., Dieterich, H., Seitz, R., Ross, D., Botstein, D., BROWN, P. 2002; 161 (6): 1991-1996


    While several prognostic factors have been identified in breast carcinoma, the clinical outcome remains hard to predict for individual patients. Better predictive markers are needed to help guide difficult treatment decisions. In a previous study of 78 breast carcinoma specimens, we noted an association between poor clinical outcome and the expression of cytokeratin 17 and/or cytokeratin 5 mRNAs. Here we describe the results of immunohistochemistry studies using monoclonal antibodies against these markers to analyze more than 600 paraffin-embedded breast tumors in tissue microarrays. We found that expression of cytokeratin 17 and/or cytokeratin 5/6 in tumor cells was associated with a poor clinical outcome. Moreover, multivariate analysis showed that in node-negative breast carcinoma, expression of these cytokeratins was a prognostic factor independent of tumor size and tumor grade.

    View details for PubMedID 12466114

  • Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Pollack, J. R., Sorlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., Tibshirani, R., Botstein, D., Borresen-Dale, A. L., Brown, P. O. 2002; 99 (20): 12963-12968


    Genomic DNA copy number alterations are key genetic events in the development and progression of human cancers. Here we report a genome-wide microarray comparative genomic hybridization (array CGH) analysis of DNA copy number variation in a series of primary human breast tumors. We have profiled DNA copy number alteration across 6,691 mapped human genes, in 44 predominantly advanced, primary breast tumors and 10 breast cancer cell lines. While the overall patterns of DNA amplification and deletion corroborate previous cytogenetic studies, the high-resolution (gene-by-gene) mapping of amplicon boundaries and the quantitative analysis of amplicon shape provide significant improvement in the localization of candidate oncogenes. Parallel microarray measurements of mRNA levels reveal the remarkable degree to which variation in gene copy number contributes to variation in gene expression in tumor cells. Specifically, we find that 62% of highly amplified genes show moderately or highly elevated expression, that DNA copy number influences gene expression across a wide range of DNA copy number alterations (deletion, low-, mid- and high-level amplification), that on average, a 2-fold change in DNA copy number is associated with a corresponding 1.5-fold change in mRNA levels, and that overall, at least 12% of all the variation in gene expression among the breast tumors is directly attributable to underlying variation in gene copy number. These findings provide evidence that widespread DNA copy number alteration can lead directly to global deregulation of gene expression, which may contribute to the development or progression of cancer.

    View details for DOI 10.1073/pnas.162471999

    View details for PubMedID 12297621

  • Empirical Bayes methods and false discovery rates for microarrays GENETIC EPIDEMIOLOGY Efron, B., Tibshirani, R. 2002; 23 (1): 70-86


    In a classic two-sample problem, one might use Wilcoxon's statistic to test for a difference between treatment and control subjects. The analogous microarray experiment yields thousands of Wilcoxon statistics, one for each gene on the array, and confronts the statistician with a difficult simultaneous inference situation. We will discuss two inferential approaches to this problem: an empirical Bayes method that requires very little a priori Bayesian modeling, and the frequentist method of "false discovery rates" proposed by Benjamini and Hochberg in 1995. It turns out that the two methods are closely related and can be used together to produce sensible simultaneous inferences.

    View details for DOI 10.1002/gepi.01124

    View details for PubMedID 12112249

  • Diagnosis of multiple cancer types by shrunken centroids of gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2002; 99 (10): 6567-6572


    We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

    View details for PubMedID 12011421

  • Precision and functional specificity in mRNA decay PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Wang, Y. L., Liu, C. L., Storey, J. D., Tibshirani, R. J., Herschlag, D., Brown, P. O. 2002; 99 (9): 5860-5865


    Posttranscriptional processing of mRNA is an integral component of the gene expression program. By using DNA microarrays, we precisely measured the decay of each yeast mRNA, after thermal inactivation of a temperature-sensitive RNA polymerase II. The half-lives varied widely, ranging from approximately 3 min to more than 90 min. We found no simple correlation between mRNA half-lives and ORF size, codon bias, ribosome density, or abundance. However, the decay rates of mRNAs encoding groups of proteins that act together in stoichiometric complexes were generally closely matched, and other evidence pointed to a more general relationship between physiological function and mRNA turnover rates. The results provide strong evidence that precise control of the decay of each mRNA is a fundamental feature of the gene expression program in yeast.

    View details for DOI 10.1073/pnas.092538799

    View details for PubMedID 11972065

  • Exploratory screening of genes and clusters from microarray experiments STATISTICA SINICA Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Botstein, D. 2002; 12 (1): 47-59
  • Transcriptional programs activated by exposure of human prostate cancer cells to androgen GENOME BIOLOGY DePrimo, S. E., Diehn, M., Nelson, J. B., Reiter, R. E., Matese, J., Fero, M., Tibshirani, R., Brown, P. O., Brooks, J. D. 2002; 3 (7)


    Androgens are required for both normal prostate development and prostate carcinogenesis. We used DNA microarrays, representing approximately 18,000 genes, to examine the temporal program of gene expression following treatment of the human prostate cancer cell line LNCaP with a synthetic androgen.We observed statistically significant changes in levels of transcripts of more than 500 genes. Many of these genes were previously reported androgen targets, but most were not previously known to be regulated by androgens. The androgen-induced expression programs in three additional androgen-responsive human prostate cancer cell lines, and in four androgen-independent subclones derived from LNCaP, shared many features with those observed in LNCaP, but some differences were observed. A remarkable fraction of the genes induced by androgen appeared to be related to production of seminal fluid and these genes included many with roles in protein folding, trafficking, and secretion.Prostate cancer cell lines retain features of androgen responsiveness that reflect normal prostatic physiology. These results provide a broad view of the effect of androgen signaling on the transcriptional program in these cancer cells, and a foundation for further studies of androgen action.

    View details for PubMedID 12184806

  • Pre-validation and inference in microarrays. Statistical applications in genetics and molecular biology Tibshirani, R. J., Efron, B. 2002; 1: Article1-?


    In microarray studies, an important problem is to compare a predictor of disease outcome derived from gene expression levels to standard clinical predictors. Comparing them on the same dataset that was used to derive the microarray predictor can lead to results strongly biased in favor of the microarray predictor. We propose a new technique called "pre-validation'' for making a fairer comparison between the two sets of predictors. We study the method analytically and explore its application in a recent study on breast cancer.

    View details for PubMedID 16646777

  • Supervised learning from microarray data 15th Biannual Conference on Computational Statistics (COMPSTAT) Hastie, T., Tibshirani, R., Narasimhan, B., Chu, G. PHYSICA-VERLAG GMBH & CO. 2002: 67–77
  • Empirical Bayes analysis of a microarray experiment 160th Annual Meeting of the American-Statistical-Association Efron, B., Tibshirani, R., Storey, J. D., Tusher, V. AMER STATISTICAL ASSOC. 2001: 1151–60
  • Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lonning, P. E., Borresen-Dale, A. L. 2001; 98 (19): 10869-10874


    The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.

    View details for Web of Science ID 000170966800067

    View details for PubMedID 11553815

    View details for PubMedCentralID PMC58566

  • Expression of a single gene, BCL-6, strongly predicts survival in patients with diffuse large B-cell lymphoma BLOOD Lossos, I. S., Jones, C. D., Warnke, R., Natkunam, Y., Kaizer, H., Zehnder, J. L., Tibshirani, R., Levy, R. 2001; 98 (4): 945-951


    Diffuse large B-cell lymphoma (DLBCL) is characterized by a marked degree of morphologic and clinical heterogeneity. Establishment of parameters that can predict outcome could help to identify patients who may benefit from risk-adjusted therapies. BCL-6 is a proto-oncogene commonly implicated in DLBCL pathogenesis. A real-time reverse transcription-polymerase chain reaction assay was established for accurate and reproducible determination of BCL-6 mRNA expression. The method was applied to evaluate the prognostic significance of BCL-6 expression in DLBCL. BCL-6 mRNA expression was assessed in tumor specimens obtained at the time of diagnosis from 22 patients with primary DLBCL. All patients were subsequently treated with anthracycline-based chemotherapy regimens. These patients could be divided into 2 DLBCL subgroups, one with high BCL-6 gene expression whose median overall survival (OS) time was 171 months and the other with low BCL-6 gene expression whose median OS was 24 months (P =.007). BCL-6 gene expression also predicted OS in an independent validation set of 39 patients with primary DLBCL (P =.01). BCL-6 protein expression, assessed by immunohistochemistry, also predicted longer OS in patients with DLBCL. BCL-6 gene expression was an independent survival predicting factor in multivariate analysis together with the elements of the International Prognostic Index (IPI) (P =.038). By contrast, the aggregate IPI score did not add further prognostic information to the patients' stratification by BCL-6 gene expression. High BCL-6 mRNA expression should be considered a new favorable prognostic factor in DLBCL and should be used in the stratification and the design of risk-adjusted therapies for patients with DLBCL. (Blood. 2001;98:945-951)

    View details for PubMedID 11493437

  • Missing value estimation methods for DNA microarrays BIOINFORMATICS Troyanskaya, O., Cantor, M., Sherlock, G., BROWN, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. B. 2001; 17 (6): 520-525


    Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

    View details for PubMedID 11395428

  • Significance analysis of microarrays applied to the ionizing radiation response PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Tusher, V. G., Tibshirani, R., Chu, G. 2001; 98 (9): 5116-5121


    Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

    View details for PubMedID 11309499

  • Supervised harvesting of expression trees GENOME BIOLOGY Hastie, T., Tibshirani, R., Botstein, D., Brown, P. 2001; 2 (1)


    We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

    View details for PubMedID 11178280

  • Estimating the number of clusters in a data set via the gap statistic JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Walther, G., Hastie, T. 2001; 63: 411-423
  • The inference of antigen selection on Ig genes JOURNAL OF IMMUNOLOGY Lossos, I. S., Tibshirani, R., Narasimhan, B., Levy, R. 2000; 165 (9): 5122-5126


    Analysis of somatic mutations in V regions of Ig genes is important for understanding various biological processes. It is customary to estimate Ag selection on Ig genes by assessment of replacement (R) as opposed to silent (S) mutations in the complementary-determining regions and S as opposed to R mutations in the framework regions. In the past such an evaluation was performed using a binomial distribution model equation, which is inappropriate for Ig genes in which mutations have four different distribution possibilities (R and S mutations in the complementary-determining region and/or framework regions of the gene). In the present work, we propose a multinomial distribution model for assessment of Ag selection. Side-by-side application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection. We suggest the use of the multinomial model for all future analysis of Ag selection.

    View details for Web of Science ID 000090076000047

    View details for PubMedID 11046043

  • Bayesian backfitting STATISTICAL SCIENCE Hastie, T., Tibshirani, R. 2000; 15 (3): 196-213
  • Bayesian backfitting - Comments and rejoinder STATISTICAL SCIENCE Cook, R. D., Pardoe, L., Gelfand, A. E., Green, P. J., Hastie, T., Tibshirani, R. 2000; 15 (3): 213-223
  • Additive logistic regression: A statistical view of boosting ANNALS OF STATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2000; 28 (2): 337-374
  • Molecular analysis of immunoglobulin genes in diffuse large B-cell lymphomas BLOOD Lossos, I. S., Okada, C. Y., Tibshirani, R., Warnke, R., Vose, J. M., Greiner, T. C., Levy, R. 2000; 95 (5): 1797-1803


    Diffuse large B-cell lymphoma (DLBCL) is a common type of non-Hodgkin's lymphoma (NHL) that is highly heterogeneous from both clinical and histopathologic viewpoints. The immunoglobulin (Ig) heavy (H) chain variable region genes were examined in 71 patients with untreated primary DLBCL. Fifty-eight potentially functional V(H) genes were detected in 53 DLBCL cases; V(H) genes were nonfunctional in 9 cases and were not detected in an additional 9 cases. The use of V(H) gene families by DLBCL tumors was unbiased without overrepresentation of any particular V(H) gene or gene family. Analysis of Ig mutations in comparison to the most closely related germline gene disclosed mutated V(H) genes in all but 1 DLBCL case. More than 2% difference from the most similar germline sequence was detected in 52 potentially functional and the 8 nonfunctional V(H) gene sequences, whereas less than 2% difference from the germline sequence was observed in 3 V(H) gene isolates. Only 3 V(H) gene isolates were unmutated. No correlation was found between V(H) gene use, mutation level, and International Prognostic Index (IPI) or survival. Six of 8 tested tumors showed evidence of ongoing somatic mutations. Evidence for positive or negative antigen selection pressure was observed in 65% of mutated DLBCL cases. Our findings indicate that the etiology and the driving forces for clonal expansion are heterogeneous, which may explain the well-known clinical and pathologic heterogeneity of DLBCL. (Blood. 2000;95:1797-1803)

    View details for Web of Science ID 000085564700037

    View details for PubMedID 10688840

  • Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling NATURE Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. G., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L. M., Marti, G. E., Moore, T., Hudson, J., Lu, L. S., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. 2000; 403 (6769): 503-511


    Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

    View details for PubMedID 10676951

  • 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome biology Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., BROWN, P. 2000; 1 (2): RESEARCH0003-?


    Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

    View details for PubMedID 11178228

  • Model search by bootstrap "bumping" JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tibshirani, R., Knight, K. 1999; 8 (4): 671-686
  • Statistical measures for the computer-aided diagnosis of mammographic masses JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Hastie, T., Ikeda, D., Tibshirani, R. 1999; 8 (3): 531-543
  • The covariance inflation criterion for adaptive model selection JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Knight, K. 1999; 61: 529-546
  • The problem of regions ANNALS OF STATISTICS Efron, B., Tibshirani, R. 1998; 26 (5): 1687-1718
  • Classification by pairwise coupling ANNALS OF STATISTICS Hastie, T., Tibshirani, R. 1998; 26 (2): 451-471
  • Classification by pairwise coupling 11th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. MIT PRESS. 1998: 507–513
  • Improvements on cross-validation: The .632+ bootstrap method JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Efron, B., Tibshirani, R. 1997; 92 (438): 548-560
  • The lasso method for variable selection in the cox model STATISTICS IN MEDICINE Tibshirani, R. 1997; 16 (4): 385-395


    I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the 'lasso' proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.

    View details for Web of Science ID A1997WK01900006

    View details for PubMedID 9044528

  • Association between cellular phones and car collisions {\it New. Engl. J. Med} Tibshirani, R., Redelmeier, D. 1997
  • Using specially designed exponential families for density estimation ANNALS OF STATISTICS Efron, B., Tibshirani, R. 1996; 24 (6): 2431-2461
  • Discriminant adaptive nearest neighbor classification and regression 9th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. M I T PRESS. 1996: 409–415
  • Generalized additive models for medical research. Statistical methods in medical research Hastie, T., Tibshirani, R. 1995; 4 (3): 187-196


    This article reviews flexible statistical methods that are useful for characterizing the effect of potential prognostic factors on disease endpoints. Applications to survival models and binary outcome models are illustrated.

    View details for PubMedID 8548102

  • Flexible discriminant analysis {\it J. Amer. Statist. Assoc. } Tibshirani, R., Hastie, T., Buja, A. 1994
  • An Introduction to the Bootstrap Chapman and Hall, New York and London. Tibshirani. R., Efron, B. 1993
  • {\it Generalized additive models}, Chapman and Hall, London Tibshirani, R., Hastie, T. 1990