Bio


Robert Tibshirani's main interests are in applied statistics, biostatistics, and data mining. He is co-author of the books "Generalized Additive Models" (with Trevor Hastie, Stanford), "An Introduction to the Bootstrap" (with Brad Efron, Stanford), and "Elements of Statistical Learning" (with Trevor Hastie and Jerry Friedman, Stanford). His current research focuses on problems in biology and genomics, medicine, and industry. With Stanford collaborator Balasubramanian Narasimhan, he also develops software packages for genomics and proteomics.

Administrative Appointments


  • Professor, Department of Biomedical Data Science and Department of Statistics, Stanford University (2015 - Present)
  • Professor, Department of Health Research and Policy and Department of Statistics, Stanford University (1998 - 2015)
  • Professor, Department of Public Health Sciences and Department of Statistics, University of Toronto (1994 - 1998)
  • Associate Professor, Department of Statistics, University of Toronto (1989 - 1994)
  • Associate Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1989 - 1994)
  • Assistant Professor, Department of Statistics, University of Toronto (1985 - 1989)
  • Assistant Professor, Department of Preventive Medicine and Biostatistics, University of Toronto (1985 - 1989)

Honors & Awards


  • Doctor Honoris Causa, University of Waterloo (2018)
  • Elected Member, National Academy of Sciences (2012)
  • Gold Medal, Statistical Society of Canada (2012)
  • Alumni Achievement Award, University of Waterloo (2006)
  • Fellow, Royal Society of Canada (2001)
  • CRM-SSC Prize in Statistics, Statistical Society of Canada (2000)
  • E.W. Steacie Memorial Fellowship, Natural Sciences and Engineering Research Council of Canada (1997)
  • President's Award, Committee of Presidents of Statistical Societies (1996)
  • Guggenheim Fellowship, J. Guggenheim Foundation (1994)
  • Fellow, Institute of Mathematical Statistics (1993)
  • Fellow, American Statistical Association (1992)

Boards, Advisory Committees, Professional Organizations


  • Associate Editor, Annals of Applied Statistics (2006 - Present)
  • Associate Editor, PLoS Biology (2001 - 2004)
  • Member, Screening Panel, National Science Foundation (1999 - 1999)
  • Associate Editor, Annals of Statistics (1998 - Present)
  • Associate Editor, Statistical Science (1995 - Present)
  • Chair, Committee on Computerization, Institute of Mathematical Statistics (1995 - Present)
  • Associate Editor, Canadian Journal of Statistics (1995 - 1997)
  • Program Chair, Statistical Computing, American Statistical Association (1995 - 1996)
  • Annual Meeting Program Chair, Statistical Society of Canada (1994 - 1994)
  • Series Editor, Computing and Graphics Monographs, Chapman & Hall (1994 - 1994)
  • Council Member, Institute of Mathematical Statistics (1991 - 1994)
  • Member, Statistical Sciences Grant Selection Committee, Natural Sciences and Engineering Research Council of Canada (1989 - 1993)
  • Associate Editor, Canadian Journal of Statistics (1988 - 1991)
  • Associate Editor, Theory and Methods, Journal of the American Statistical Association (1986 - 1995)

Professional Education


  • B.Math., University of Waterloo, Statistics and Computer Science (1979)
  • M.Sc., University of Toronto, Statistics (1980)
  • Ph.D., Stanford University, Statistics (1984)

Current Research and Scholarly Interests


My research is in applied statistics and biostatistics. I specialize in computer-intensive methods for regression and classification, bootstrap, cross-validation and statistical inference, and signal and image analysis for medical diagnosis.

2023-24 Courses


Stanford Advisees


Graduate and Fellowship Programs


All Publications


  • Smooth Multi-Period Forecasting With Application to Prediction of COVID-19 Cases JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tuzhilina, E., Hastie, T. J., Mcdonald, D. J., Tay, J., Tibshirani, R. 2024
  • Semi-supervised Cooperative Learning for Multiomics Data Fusion Ding, D., Shen, X., Snyder, M., Tibshirani, R., Maier, A. K., Schnabel, J. A., Tiwari, P., Stegle, O. SPRINGER INTERNATIONAL PUBLISHING AG. 2024: 54-63
  • Evaluating a shrinkage estimator for the treatment effect in clinical trials. Statistics in medicine van Zwet, E. W., Tian, L., Tibshirani, R. 2023

    Abstract

    The main objective of most clinical trials is to estimate the effect of some treatment compared to a control condition. We define the signal-to-noise ratio (SNR) as the ratio of the true treatment effect to the SE of its estimate. In a previous publication in this journal, we estimated the distribution of the SNR among the clinical trials in the Cochrane Database of Systematic Reviews (CDSR). We found that the SNR is often low, which implies that the power against the true effect is also low in many trials. Here we use the fact that the CDSR is a collection of meta-analyses to quantitatively assess the consequences. Among trials that have reached statistical significance we find considerable overoptimism of the usual unbiased estimator and under-coverage of the associated confidence interval. Previously, we have proposed a novel shrinkage estimator to address this "winner's curse." We compare the performance of our shrinkage estimator to the usual unbiased estimator in terms of the root mean squared error, the coverage and the bias of the magnitude. We find superior performance of the shrinkage estimator both conditionally and unconditionally on statistical significance.

    View details for DOI 10.1002/sim.9992

    View details for PubMedID 38111969

  • Multimodal Biomedical Data Fusion Using Sparse Canonical Correlation Analysis and Cooperative Learning: A Cohort Study on COVID-19. Research square Er, A. G., Ding, D. Y., Er, B., Uzun, M., Cakmak, M., Sadee, C., Durhan, G., Ozmen, M. N., Tanriover, M. D., Topeli, A., Son, Y. A., Tibshirani, R., Unal, S., Gevaert, O. 2023

    Abstract

    Through technological innovations, patient cohorts can be examined from multiple views with high-dimensional, multiscale biomedical data to classify clinical phenotypes and predict outcomes. Here, we aim to present our approach for analyzing multimodal data using unsupervised and supervised sparse linear methods in a COVID-19 patient cohort. This prospective cohort study of 149 adult patients was conducted in a tertiary care academic center. First, we used sparse canonical correlation analysis (CCA) to identify and quantify relationships across different data modalities, including viral genome sequencing, imaging, clinical data, and laboratory results. Then, we used cooperative learning to predict the clinical outcome of COVID-19 patients. We show that serum biomarkers representing severe disease and acute phase response correlate with original and wavelet radiomics features in the LLL frequency channel (corr(Xu1, Zv1) = 0.596, p-value < 0.001). Among radiomics features, histogram-based first-order features reporting the skewness, kurtosis, and uniformity have the lowest negative, whereas entropy-related features have the highest positive coefficients. Moreover, unsupervised analysis of clinical data and laboratory results gives insights into distinct clinical phenotypes. Leveraging the availability of global viral genome databases, we demonstrate that the Word2Vec natural language processing model can be used for viral genome encoding. It not only separates major SARS-CoV-2 variants but also allows the preservation of phylogenetic relationships among them. Our quadruple model using Word2Vec encoding achieves better prediction results in the supervised task. The model yields area under the curve (AUC) and accuracy values of 0.87 and 0.77, respectively. Our study illustrates that sparse CCA analysis and cooperative learning are powerful techniques for handling high-dimensional, multimodal data to investigate multivariate associations in unsupervised and supervised tasks.

    View details for DOI 10.21203/rs.3.rs-3569833/v1

    View details for PubMedID 38045288

  • Public health factors help explain cross country heterogeneity in excess death during the COVID19 pandemic. Scientific reports Sun, M. W., Troxell, D., Tibshirani, R. 2023; 13 (1): 16196

    Abstract

    The COVID-19 pandemic has taken a devastating toll around the world. Since January 2020, the World Health Organization estimates 14.9 million excess deaths have occurred globally. Despite this grim number quantifying the deadly impact, the underlying factors contributing to COVID-19 deaths at the population level remain unclear. Prior studies indicate that demographic factors like proportion of population older than 65 and population health explain the cross-country difference in COVID-19 deaths. However, there has not been a comprehensive analysis including variables describing government policies and COVID-19 vaccination rate. Furthermore, prior studies focus on COVID-19 death rather than excess death to assess the impact of the pandemic. Through a robust statistical modeling framework, we analyze 80 countries and show that actionable public health efforts beyond just the factors intrinsic to each country are important for explaining the cross-country heterogeneity in excess death.

    View details for DOI 10.1038/s41598-023-43407-0

    View details for PubMedID 37758827

    View details for PubMedCentralID PMC10533501

  • Confidence intervals for the Cox model test error from cross-validation. Statistics in medicine Sun, M. W., Tibshirani, R. 2023

    Abstract

    Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.

    View details for DOI 10.1002/sim.9873

    View details for PubMedID 37580906

  • Spatial proteomics reveals human microglial states shaped by anatomy and neuropathology. Research square Mrdjen, D., Amouzgar, M., Cannon, B., Liu, C., Spence, A., McCaffrey, E., Bharadwaj, A., Tebaykin, D., Bukhari, S., Hartmann, F. J., Kagel, A., Vijayaragavan, K., Oliveria, J. P., Yakabi, K., Serrano, G. E., Corrada, M. M., Kawas, C. H., Camacho, C., Bosse, M., Tibshirani, R., Beach, T. G., Angelo, M., Montine, T., Bendall, S. C. 2023

    Abstract

    Microglia are implicated in aging, neurodegeneration, and Alzheimer's disease (AD). Traditional, low-plex, imaging methods fall short of capturing in situ cellular states and interactions in the human brain. We utilized Multiplexed Ion Beam Imaging (MIBI) and data-driven analysis to spatially map proteomic cellular states and niches in healthy human brain, identifying a spectrum of microglial profiles, called the microglial state continuum (MSC). The MSC ranged from senescent-like to active proteomic states that were skewed across large brain regions and compartmentalized locally according to their immediate microenvironment. While more active microglial states were proximal to amyloid plaques, globally, microglia significantly shifted towards a, presumably, dysfunctional low MSC in the AD hippocampus, as confirmed in an independent cohort (n=26). This provides an in situ single cell framework for mapping human microglial states along a continuous, shifting existence that is differentially enriched between healthy brain regions and disease, reinforcing differential microglial functions overall.

    View details for DOI 10.21203/rs.3.rs-2987263/v1

    View details for PubMedID 37398389

    View details for PubMedCentralID PMC10312937

  • Distinguishing Renal Cell Carcinoma From Normal Kidney Tissue Using Mass Spectrometry Imaging Combined With Machine Learning. JCO precision oncology Shankar, V., Vijayalakshmi, K., Nolley, R., Sonn, G. A., Kao, C. S., Zhao, H., Wen, R., Eberlin, L. S., Tibshirani, R., Zare, R. N., Brooks, J. D. 2023; 7: e2200668

    Abstract

    Accurately distinguishing renal cell carcinoma (RCC) from normal kidney tissue is critical for identifying positive surgical margins (PSMs) during partial and radical nephrectomy, which remains the primary intervention for localized RCC. Techniques that detect PSM with higher accuracy and faster turnaround time than intraoperative frozen section (IFS) analysis can help decrease reoperation rates, relieve patient anxiety and costs, and potentially improve patient outcomes.Here, we extended our combined desorption electrospray ionization mass spectrometry imaging (DESI-MSI) and machine learning methodology to identify metabolite and lipid species from tissue surfaces that can distinguish normal tissues from clear cell RCC (ccRCC), papillary RCC (pRCC), and chromophobe RCC (chRCC) tissues.From 24 normal and 40 renal cancer (23 ccRCC, 13 pRCC, and 4 chRCC) tissues, we developed a multinomial lasso classifier that selects 281 total analytes from over 27,000 detected molecular species that distinguishes all histological subtypes of RCC from normal kidney tissues with 84.5% accuracy. On the basis of independent test data reflecting distinct patient populations, the classifier achieves 85.4% and 91.2% accuracy on a Stanford test set (20 normal and 28 RCC) and a Baylor-UT Austin test set (16 normal and 41 RCC), respectively. The majority of the model's selected features show consistent trends across data sets affirming its stable performance, where the suppression of arachidonic acid metabolism is identified as a shared molecular feature of ccRCC and pRCC.Together, these results indicate that signatures derived from DESI-MSI combined with machine learning may be used to rapidly determine surgical margin status with accuracies that meet or exceed those reported for IFS.

    View details for DOI 10.1200/PO.22.00668

    View details for PubMedID 37285559

  • Cross-Validation: What Does It Estimate and How Well Does It Do It? JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bates, S., Hastie, T., Tibshirani, R. 2023
  • STABL Enables Reliable and Selective biomarker Discovery in Predictive Modeling of High Dimensional Omics Data Verdonk, F., Hedou, J., Maric, I., Bellan, G., Einhaus, J., Gaudilliere, D., Ladant, F., Stelzer, I., Feyaerts, D., Tsai, A., Bonham, A., Angst, M., Aghaeepour, N., Stevenson, D., Tibshirani, R., Gaudilliere, B. LIPPINCOTT WILLIAMS & WILKINS. 2023: 814-821
  • Leading edge competition promotes context-dependent responses to receptor inputs to resolve directional dilemmas in neutrophil migration. Cell systems Hadjitheodorou, A., Bell, G. R., Ellett, F., Irimia, D., Tibshirani, R., Collins, S. R., Theriot, J. A. 2023

    Abstract

    Maintaining persistent migration in complex environments is critical for neutrophils to reach infection sites. Neutrophils avoid getting trapped, even when obstacles split their front into multiple leading edges. How they re-establish polarity to move productively while incorporating receptor inputs under such conditions remains unclear. Here, we challenge chemotaxing HL60 neutrophil-like cells with symmetric bifurcating microfluidic channels to probe cell-intrinsic processes during the resolution of competing fronts. Using supervised statistical learning, we demonstrate that cells commit to one leading edge late in the process, rather than amplifying structural asymmetries or early fluctuations. Using optogenetic tools, we show that receptor inputs only bias the decision similarly late, once mechanical stretching begins to weaken each front. Finally, a retracting edge commits to retraction, with ROCK limiting sensitivity to receptor inputs until the retraction completes. Collectively, our results suggest that cell edges locally adopt highly stable protrusion/retraction programs that are modulated by mechanical feedback.

    View details for DOI 10.1016/j.cels.2023.02.001

    View details for PubMedID 36827986

  • A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy SCIENCE ADVANCES Mayer, A. T., Holman, D. R., Sood, A., Tandon, U., Bhate, S. S., Bodapati, S., Barlow, G. L., Chang, J., Black, S., Crenshaw, E. C., Koron, A. N., Streett, S. E., Gambhir, S. S., Sandborn, W. J., Boland, B. S., Hastie, T., Tibshirani, R., Chang, J. T., Nolan, G. P., Schuerch, C. M., Rogalla, S. 2023; 9 (3)
  • A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy. Science advances Mayer, A. T., Holman, D. R., Sood, A., Tandon, U., Bhate, S. S., Bodapati, S., Barlow, G. L., Chang, J., Black, S., Crenshaw, E. C., Koron, A. N., Streett, S. E., Gambhir, S. S., Sandborn, W. J., Boland, B. S., Hastie, T., Tibshirani, R., Chang, J. T., Nolan, G. P., Schürch, C. M., Rogalla, S. 2023; 9 (3): eadd1166

    Abstract

    Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.

    View details for DOI 10.1126/sciadv.add1166

    View details for PubMedID 36662860

  • Feature-weighted elastic net: using "features of features" for better prediction. Statistica Sinica Tay, J. K., Aghaeepour, N., Hastie, T., Tibshirani, R. 2023; 33 (1): 259-279

    Abstract

    In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.

    View details for DOI 10.5705/ss.202020.0226

    View details for PubMedID 37102071

  • Improved Relapse Prediction in Pediatric Acute Myeloid Leukemia By Deconvolving Lineage-Specific and CancerSpecific Features in Single-Cell Data Keyes, T., Jager, A., Krueger, M., Plevritis, S., Tibshirani, R., Aplenc, R., Nolan, G. P., Redell, M. S., Davis, K. L. AMER SOC HEMATOLOGY. 2022: 6288-6289
  • CD8+ T cell differentiation status correlates with the feasibility of sustained unresponsiveness following oral immunotherapy. Nature communications Kaushik, A., Dunham, D., Han, X., Do, E., Andorf, S., Gupta, S., Fernandes, A., Kost, L. E., Sindher, S. B., Yu, W., Tsai, M., Tibshirani, R., Boyd, S. D., Desai, M., Maecker, H. T., Galli, S. J., Chinthrajah, R. S., DeKruyff, R. H., Manohar, M., Nadeau, K. C. 2022; 13 (1): 6646

    Abstract

    While food allergy oral immunotherapy (OIT) can provide safe and effective desensitization (DS), the immune mechanisms underlying development of sustained unresponsiveness (SU) following a period of avoidance are largely unknown. Here, we compare high dimensional phenotypes of innate and adaptive immune cell subsets of participants in a previously reported, phase 2 randomized, controlled, peanut OIT trial who achieved SU vs. DS (no vs. with allergic reactions upon food challenge after a withdrawal period; n=21 vs. 30 respectively among total 120 intent-to-treat participants). Lower frequencies of naive CD8+ T cells and terminally differentiated CD57+CD8+ T cell subsets at baseline (pre-OIT) are associated with SU. Frequency of naive CD8+ T cells shows a significant positive correlation with peanut-specific and Ara h 2-specific IgE levels at baseline. Higher frequencies of IL-4+ and IFNgamma+ CD4+ T cells post-OIT are negatively correlated with SU. Our findings provide evidence that an immune signature consisting of certain CD8+ T cell subset frequencies is potentially predictive of SU following OIT.

    View details for DOI 10.1038/s41467-022-34222-8

    View details for PubMedID 36333296

  • Cooperative learning for multiview analysis. Proceedings of the National Academy of Sciences of the United States of America Ding, D. Y., Li, S., Narasimhan, B., Tibshirani, R. 2022; 119 (38): e2202113119

    Abstract

    We propose a method for supervised learning with multiple sets of features ("views"). The multiview problem is especially important in biology and medicine, where "-omics" data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. "Cooperative learning" combines the usual squared-error loss of predictions with an "agreement" penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.

    View details for DOI 10.1073/pnas.2202113119

    View details for PubMedID 36095183

  • Post-infusion CAR T-Reg cells identify patients resistant to CD19-CAR therapy NATURE MEDICINE Good, Z., Spiegel, J. Y., Sahaf, B., Malipatlolla, M. B., Ehlinger, Z. J., Kurra, S., Desai, M. H., Reynolds, W. D., Lin, A., Vandris, P., Wu, F., Prabhu, S., Hamilton, M. P., Tamaresis, J. S., Hanson, P. J., Patel, S., Feldman, S. A., Frank, M. J., Baird, J. H., Muffly, L., Claire, G. K., Craig, J., Kong, K. A., Wagh, D., Coller, J., Bendall, S. C., Tibshirani, R. J., Plevritis, S. K., Miklos, D. B., Mackall, C. L. 2022

    Abstract

    Approximately 60% of patients with large B cell lymphoma treated with chimeric antigen receptor (CAR) T cell therapies targeting CD19 experience disease progression, and neurotoxicity remains a challenge. Biomarkers associated with resistance and toxicity are limited. In this study, single-cell proteomic profiling of circulating CAR T cells in 32 patients treated with CD19-CAR identified that CD4+Helios+ CAR T cells on day 7 after infusion are associated with progressive disease and less severe neurotoxicity. Deep profiling demonstrated that this population is non-clonal and manifests hallmark features of T regulatory (TReg) cells. Validation cohort analysis upheld the link between higher CAR TReg cells with clinical progression and less severe neurotoxicity. A model combining expansion of this subset with lactate dehydrogenase levels, as a surrogate for tumor burden, was superior for predicting durable clinical response compared to models relying on each feature alone. These data credential CAR TReg cell expansion as a novel biomarker of response and toxicity after CAR T cell therapy and raise the prospect that this subset may regulate CAR T cell responses in humans.

    View details for DOI 10.1038/s41591-022-01960-7

    View details for Web of Science ID 000852940800007

    View details for PubMedID 36097223

  • LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK. The annals of applied statistics Qian, J., Tanigawa, Y., Li, R., Tibshirani, R., Rivas, M. A., Hastie, T. 2022; 16 (3): 1891-1918

    Abstract

    In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.

    View details for DOI 10.1214/21-aoas1575

    View details for PubMedID 36091495

    View details for PubMedCentralID PMC9454085

  • LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK ANNALS OF APPLIED STATISTICS Qian, J., Tanigawa, Y., Li, R., Tibshirani, R., Rivas, M. A., Hastie, T. 2022; 16 (3): 1891-1918
  • Prediction and outlier detection in classification problems. Journal of the Royal Statistical Society. Series B, Statistical methodology Guan, L., Tibshirani, R. 2022; 84 (2): 524-546

    Abstract

    We consider the multi-class classification problem when the training data and the out-of-sample test data may have different distributions and propose a method called BCOPS (balanced and conformal optimized prediction sets). BCOPS constructs a prediction set C(x) as a subset of class labels, possibly empty. It tries to optimize the out-of-sample performance, aiming to include the correct class and to detect outliers x as often as possible. BCOPS returns no prediction (corresponding to C(x) equal to the empty set) if it infers x to be an outlier. The proposed method combines supervised learning algorithms with conformal prediction to minimize a misclassification loss averaged over the out-of-sample distribution. The constructed prediction sets have a finite sample coverage guarantee without distributional assumptions. We also propose a method to estimate the outlier detection rate of a given procedure. We prove asymptotic consistency and optimality of our proposals under suitable assumptions and illustrate our methods on real data examples.

    View details for DOI 10.1111/rssb.12443

    View details for PubMedID 35910400

    View details for PubMedCentralID PMC9305480

  • Significant sparse polygenic risk scores across 813 traits in UK Biobank. PLoS genetics Tanigawa, Y., Qian, J., Venkataraman, G., Justesen, J. M., Li, R., Tibshirani, R., Hastie, T., Rivas, M. A. 2022; 18 (3): e1010105

    Abstract

    We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).

    View details for DOI 10.1371/journal.pgen.1010105

    View details for PubMedID 35324888

  • Prediction and outlier detection in classification problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Guan, L., Tibshirani, R. 2022

    View details for DOI 10.1111/rssb.12443

    View details for Web of Science ID 000755445400001

  • Identification of end-stage renal disease metabolic signatures from human perspiration Natural Sciences Shankar, V., Michael, B., Celli, A., Zhou, Z., Ashland, M., Tibshirani, R., Snyder, M., Zare, R. 2022

    View details for DOI 10.1002/ntls.20220048

  • Can auxiliary indicators improve COVID-19 forecasting and hotspot prediction? Proceedings of the National Academy of Sciences of the United States of America McDonald, D. J., Bien, J., Green, A., Hu, A. J., DeFries, N., Hyun, S., Oliveira, N. L., Sharpnack, J., Tang, J., Tibshirani, R., Ventura, V., Wasserman, L., Tibshirani, R. J. 1800; 118 (51)

    Abstract

    Short-term forecasts of traditional streams from public health reporting (such as cases, hospitalizations, and deaths) are a key input to public health decision-making during a pandemic. Since early 2020, our research group has worked with data partners to collect, curate, and make publicly available numerous real-time COVID-19 indicators, providing multiple views of pandemic activity in the United States. This paper studies the utility of five such indicators-derived from deidentified medical insurance claims, self-reported symptoms from online surveys, and COVID-related Google search activity-from a forecasting perspective. For each indicator, we ask whether its inclusion in an autoregressive (AR) model leads to improved predictive accuracy relative to the same model excluding it. Such an AR model, without external features, is already competitive with many top COVID-19 forecasting models in use today. Our analysis reveals that 1) inclusion of each of these five indicators improves on the overall predictive accuracy of the AR model; 2) predictive gains are in general most pronounced during times in which COVID cases are trending in "flat" or "down" directions; and 3) one indicator, based on Google searches, seems to be particularly helpful during "up" trends.

    View details for DOI 10.1073/pnas.2111453118

    View details for PubMedID 34903655

  • Author Correction: Genetics of 35 blood and urine biomarkers in the UK Biobank. Nature genetics Sinnott-Armstrong, N., Tanigawa, Y., Amar, D., Mars, N., Benner, C., Aguirre, M., Venkataraman, G. R., Wainberg, M., Ollila, H. M., Kiiskinen, T., Havulinna, A. S., Pirruccello, J. P., Qian, J., Shcherbina, A., FinnGen, Rodriguez, F., Assimes, T. L., Agarwala, V., Tibshirani, R., Hastie, T., Ripatti, S., Pritchard, J. K., Daly, M. J., Rivas, M. A. 2021

    View details for DOI 10.1038/s41588-021-00956-2

    View details for PubMedID 34608296

  • Rapid Screening of COVID-19 Directly from Clinical Nasopharyngeal Swabs Using the MasSpec Pen. Analytical chemistry Garza, K. Y., Silva, A. A., Rosa, J. R., Keating, M. F., Povilaitis, S. C., Spradlin, M., Sanches, P. H., Varao Moura, A., Marrero Gutierrez, J., Lin, J. Q., Zhang, J., DeHoog, R. J., Bensussan, A., Badal, S., Cardoso de Oliveira, D., Dias Garcia, P. H., Dias de Oliveira Negrini, L., Antonio, M. A., Canevari, T. C., Eberlin, M. N., Tibshirani, R., Eberlin, L. S., Porcari, A. M. 2021

    Abstract

    The outbreak of COVID-19 has created an unprecedent global crisis. While the polymerase chain reaction (PCR) is the gold standard method for detecting active SARS-CoV-2 infection, alternative high-throughput diagnostic tests are of a significant value to meet universal testing demands. Here, we describe a new design of the MasSpec Pen technology integrated to electrospray ionization (ESI) for direct analysis of clinical swabs and investigate its use for COVID-19 screening. The redesigned MasSpec Pen system incorporates a disposable sampling device refined for uniform and efficient analysis of swab tips via liquid extraction directly coupled to an ESI source. Using this system, we analyzed nasopharyngeal swabs from 244 individuals including symptomatic COVID-19 positive, symptomatic negative, and asymptomatic negative individuals, enabling rapid detection of rich lipid profiles. Two statistical classifiers were generated based on the lipid information acquired. Classifier 1 was built to distinguish symptomatic PCR-positive from asymptomatic PCR-negative individuals, yielding a cross-validation accuracy of 83.5%, sensitivity of 76.6%, and specificity of 86.6%, and validation set accuracy of 89.6%, sensitivity of 100%, and specificity of 85.3%. Classifier 2 was built to distinguish symptomatic PCR-positive patients from negative individuals including symptomatic PCR-negative patients with moderate to severe symptoms and asymptomatic individuals, yielding a cross-validation accuracy of 78.4%, specificity of 77.21%, and sensitivity of 81.8%. Collectively, this study suggests that the lipid profiles detected directly from nasopharyngeal swabs using MasSpec Pen-ESI mass spectrometry (MS) allow fast (under a minute) screening of the COVID-19 disease using minimal operating steps and no specialized reagents, thus representing a promising alternative high-throughput method for screening of COVID-19.

    View details for DOI 10.1021/acs.analchem.1c01937

    View details for PubMedID 34432430

  • Testing for a Sweet Spot in Randomized Trials. Medical decision making : an international journal of the Society for Medical Decision Making Redelmeier, D. A., Thiruchelvam, D., Tibshirani, R. J. 2021: 272989X211025525

    Abstract

    INTRODUCTION: Randomized trials recruit diverse patients, including some individuals who may be unresponsive to the treatment. Here we follow up on prior conceptual advances and introduce a specific method that does not rely on stratification analysis and that tests whether patients in the intermediate range of disease severity experience more relative benefit than patients at the extremes of disease severity (sweet spot).METHODS: We contrast linear models to sigmoidal models when describing associations between disease severity and accumulating treatment benefit. The Gompertz curve is highlighted as a specific sigmoidal curve along with the Akaike information criterion (AIC) as a measure of goodness of fit. This approach is then applied to a matched analysis of a published landmark randomized trial evaluating whether implantable defibrillators reduce overall mortality in cardiac patients (n = 2,521).RESULTS: The linear model suggested a significant survival advantage across the spectrum of increasing disease severity (beta = 0.0847, P < 0.001, AIC = 2,491). Similarly, the sigmoidal model suggested a significant survival advantage across the spectrum of disease severity (alpha = 93, beta = 4.939, gamma = 0.00316, P < 0.001 for all, AIC = 1,660). The discrepancy between the 2 models indicated worse goodness of fit with a linear model compared to a sigmoidal model (AIC: 2,491 v. 1,660, P < 0.001), thereby suggesting a sweet spot in the midrange of disease severity. Model cross-validation using computational statistics also confirmed the superior goodness of fit of the sigmoidal curve with a concentration of survival benefits for patients in the midrange of disease severity.CONCLUSION: Systematic methods are available beyond simple stratification for identifying a sweet spot according to disease severity. The approach can assess whether some patients experience more relative benefit than other patients in a randomized trial.[Box: see text].

    View details for DOI 10.1177/0272989X211025525

    View details for PubMedID 34378458

  • Author Correction: An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging. Nature aging Sayed, N., Huang, Y., Nguyen, K., Krejciova-Rajaniemi, Z., Grawe, A. P., Gao, T., Tibshirani, R., Hastie, T., Alpert, A., Cui, L., Kuznetsova, T., Rosenberg-Hasson, Y., Ostan, R., Monti, D., Lehallier, B., Shen-Orr, S. S., Maecker, H. T., Dekker, C. L., Wyss-Coray, T., Franceschi, C., Jojic, V., Haddad, F., Montoya, J. G., Wu, J. C., Davis, M. M., Furman, D. 2021; 1 (8): 748

    View details for DOI 10.1038/s43587-021-00102-x

    View details for PubMedID 37117770

  • Penalized regression for left-truncated and right-censored survival data. Statistics in medicine McGough, S. F., Incerti, D., Lyalina, S., Copping, R., Narasimhan, B., Tibshirani, R. 2021

    Abstract

    High-dimensional data are becoming increasingly common in the medical field as large volumes of patient information are collected and processed by high-throughput screening, electronic health records, and comprehensive genomic testing. Statistical models that attempt to study the effects of many predictors on survival typically implement feature selection or penalized methods to mitigate the undesirable consequences of overfitting. In some cases survival data are also left-truncated which can give rise to an immortal time bias, but penalized survival methods that adjust for left truncation are not commonly implemented. To address these challenges, we apply a penalized Cox proportional hazards model for left-truncated and right-censored survival data and assess implications of left truncation adjustment on bias and interpretation. We use simulation studies and a high-dimensional, real-world clinico-genomic database to highlight the pitfalls of failing to account for left truncation in survival modeling.

    View details for DOI 10.1002/sim.9136

    View details for PubMedID 34302373

  • The stanford prostate cancer calculator: Development and external validation of online nomograms incorporating PIRADS scores to predict clinically significant prostate cancer. Urologic oncology Wang, N. N., Zhou, S. R., Chen, L., Tibshirani, R., Fan, R. E., Ghanouni, P., Thong, A. E., To'o, K. J., Amirkhiz, K., Nix, J. W., Gordetsky, J. B., Sprenkle, P., Rais-Bahrami, S., Sonn, G. A. 2021

    Abstract

    BACKGROUND: While multiparametric MRI (mpMRI) has high sensitivity for detection of clinically significant prostate cancer (CSC), false positives and negatives remain common. Calculators that combine mpMRI with clinical variables can improve cancer risk assessment, while providing more accurate predictions for individual patients. We sought to create and externally validate nomograms incorporating Prostate Imaging Reporting and Data System (PIRADS) scores and clinical data to predict the presence of CSC in men of all biopsy backgrounds.METHODS: Data from 2125 men undergoing mpMRI and MR fusion biopsy from 2014 to 2018 at Stanford, Yale, and UAB were prospectively collected. Clinical data included age, race, PSA, biopsy status, PIRADS scores, and prostate volume. A nomogram predicting detection of CSC on targeted or systematic biopsy was created.RESULTS: Biopsy history, Prostate Specific Antigen (PSA) density, PIRADS score of 4 or 5, Caucasian race, and age were significant independent predictors. Our nomogram-the Stanford Prostate Cancer Calculator (SPCC)-combined these factors in a logistic regression to provide stronger predictive accuracy than PSA density or PIRADS alone. Validation of the SPCC using data from Yale and UAB yielded robust AUC values.CONCLUSIONS: The SPCC combines pre-biopsy mpMRI with clinical data to more accurately predict the probability of CSC in men of all biopsy backgrounds. The SPCC demonstrates strong external generalizability with successful validation in two separate institutions. The calculator is available as a free web-based tool that can direct real-time clinical decision-making.

    View details for DOI 10.1016/j.urolonc.2021.06.004

    View details for PubMedID 34247909

  • Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. Biostatistics (Oxford, England) Li, R., Chang, C., Justesen, J. M., Tanigawa, Y., Qian, J., Hastie, T., Rivas, M. A., Tibshirani, R. 2021

    View details for DOI 10.1093/biostatistics/kxab019

    View details for PubMedID 34269393

  • An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging. Nature aging Sayed, N., Huang, Y., Nguyen, K., Krejciova-Rajaniemi, Z., Grawe, A. P., Gao, T., Tibshirani, R., Hastie, T., Alpert, A., Cui, L., Kuznetsova, T., Rosenberg-Hasson, Y., Ostan, R., Monti, D., Lehallier, B., Shen-Orr, S. S., Maecker, H. T., Dekker, C. L., Wyss-Coray, T., Franceschi, C., Jojic, V., Haddad, F., Montoya, J. G., Wu, J. C., Davis, M. M., Furman, D. 2021; 1: 598-615

    Abstract

    While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8-96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes.

    View details for DOI 10.1038/s43587-021-00082-y

    View details for PubMedID 34888528

  • Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks. Bioinformatics (Oxford, England) Li, R., Chang, C., Tanigawa, Y., Narasimhan, B., Hastie, T., Tibshirani, R., Rivas, M. A. 2021

    Abstract

    MOTIVATION: Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data.RESULTS: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10minutes and using less than 32GB of memory.AVAILABILITY: https://github.com/rivas-lab/snpnet/tree/compact.

    View details for DOI 10.1093/bioinformatics/btab452

    View details for PubMedID 34146108

  • Assessment of heterogeneous treatment effect estimation accuracy via matching. Statistics in medicine Gao, Z., Hastie, T., Tibshirani, R. 2021

    Abstract

    We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.

    View details for DOI 10.1002/sim.9010

    View details for PubMedID 33915600

  • Principal component-guided sparse regression CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Tay, J. K., Friedman, J., Tibshirani, R. 2021

    View details for DOI 10.1002/cjs.11617

    View details for Web of Science ID 000640651700001

  • LassoNet: Neural Networks with Feature Sparsity. Proceedings of machine learning research Lemhadri, I., Ruan, F., Tibshirani, R. 2021; 130: 10-18

    Abstract

    Much work has been done recently to make neural networks more interpretable, and one approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or ℓ 1-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach achieves feature sparsity by allowing a feature to participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. In experiments with real and simulated data, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.

    View details for PubMedID 36092461

    View details for PubMedCentralID PMC9453696

  • Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression. Bioinformatics (Oxford, England) Li, R., Tanigawa, Y., Justesen, J. M., Taylor, J., Hastie, T., Tibshirani, R., Rivas, M. A. 2021

    Abstract

    MOTIVATION: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data.RESULTS: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020).AVAILABILITY: https://github.com/rivas-lab/multisnpnet-Cox.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btab095

    View details for PubMedID 33560296

  • Basophil activation tests identify a peanut OIT subgroup with improved safety and outcomes Chinthrajah, S., Cao, S., Tsai, M., Mukai, K., Tibshirani, R., Sindher, S., Nadeau, K., Galli, S. MOSBY-ELSEVIER. 2021: AB166
  • Genetics of 35 blood and urine biomarkers in the UK Biobank. Nature genetics Sinnott-Armstrong, N., Tanigawa, Y., Amar, D., Mars, N., Benner, C., Aguirre, M., Venkataraman, G. R., Wainberg, M., Ollila, H. M., Kiiskinen, T., Havulinna, A. S., Pirruccello, J. P., Qian, J., Shcherbina, A., FinnGen, Rodriguez, F., Assimes, T. L., Agarwala, V., Tibshirani, R., Hastie, T., Ripatti, S., Pritchard, J. K., Daly, M. J., Rivas, M. A. 2021

    Abstract

    Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n=363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n=135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.

    View details for DOI 10.1038/s41588-020-00757-z

    View details for PubMedID 33462484

  • An open repository of real-time COVID-19 indicators. Proceedings of the National Academy of Sciences of the United States of America Reinhart, A., Brooks, L., Jahja, M., Rumack, A., Tang, J., Agrawal, S., Al Saeed, W., Arnold, T., Basu, A., Bien, J., Cabrera, Á. A., Chin, A., Chua, E. J., Clark, B., Colquhoun, S., DeFries, N., Farrow, D. C., Forlizzi, J., Grabman, J., Gratzl, S., Green, A., Haff, G., Han, R., Harwood, K., Hu, A. J., Hyde, R., Hyun, S., Joshi, A., Kim, J., Kuznetsov, A., La Motte-Kerr, W., Lee, Y. J., Lee, K., Lipton, Z. C., Liu, M. X., Mackey, L., Mazaitis, K., McDonald, D. J., McGuinness, P., Narasimhan, B., O'Brien, M. P., Oliveira, N. L., Patil, P., Perer, A., Politsch, C. A., Rajanala, S., Rucker, D., Scott, C., Shah, N. H., Shankar, V., Sharpnack, J., Shemetov, D., Simon, N., Smith, B. Y., Srivastava, V., Tan, S., Tibshirani, R., Tuzhilina, E., Van Nortwick, A. K., Ventura, V., Wasserman, L., Weaver, B., Weiss, J. C., Whitman, S., Williams, K., Rosenfeld, R., Tibshirani, R. J. 2021; 118 (51)

    Abstract

    The COVID-19 pandemic presented enormous data challenges in the United States. Policy makers, epidemiological modelers, and health researchers all require up-to-date data on the pandemic and relevant public behavior, ideally at fine spatial and temporal resolution. The COVIDcast API is our attempt to fill this need: Operational since April 2020, it provides open access to both traditional public health surveillance signals (cases, deaths, and hospitalizations) and many auxiliary indicators of COVID-19 activity, such as signals extracted from deidentified medical claims data, massive online surveys, cell phone mobility data, and internet search trends. These are available at a fine geographic resolution (mostly at the county level) and are updated daily. The COVIDcast API also tracks all revisions to historical data, allowing modelers to account for the frequent revisions and backfill that are common for many public health data sources. All of the data are available in a common format through the API and accompanying R and Python software packages. This paper describes the data sources and signals, and provides examples demonstrating that the auxiliary signals in the COVIDcast API present information relevant to tracking COVID activity, augmenting traditional public health reporting and empowering research and decision-making.

    View details for DOI 10.1073/pnas.2111452118

    View details for PubMedID 34903654

  • MassExplorer: a computational tool for analyzing desorption electrospray ionization mass spectrometry data Bioinformatics (Oxford, England) Shankar, V., Tibshirani, R., Zare, R. N. 2021

    Abstract

    High-throughput gene expression can be used to address a wide range of fundamental biological problems, but datasets of an appropriate size are often unavailable. Moreover, existing transcriptomics simulators have been criticised because they fail to emulate key properties of gene expression data. In this paper, we develop a method based on a conditional generative adversarial network to generate realistic transcriptomics data for E. coli and humans. We assess the performance of our approach across several tissues and cancer types.We show that our model preserves several gene expression properties significantly better than widely used simulators such as SynTReN or GeneNetWeaver. The synthetic data preserves tissue and cancer-specific properties of transcriptomics data. Moreover, it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way.Code is available at: https://github.com/rvinas/adversarial-gene-expression.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btab282

    View details for PubMedID 34009252

  • De novo mutational signature discovery in tumor genomes using SparseSignatures. PLoS computational biology Lal, A., Liu, K., Tibshirani, R., Sidow, A., Ramazzotti, D. 2021; 17 (6): e1009119

    Abstract

    Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.

    View details for DOI 10.1371/journal.pcbi.1009119

    View details for PubMedID 34181655

  • LassoNet: Neural Networks with Feature Sparsity Lemhadri, I., Ruan, F., Tibshirani, R., Banerjee, A., Fukumizu, K. MICROTOME PUBLISHING. 2021: 10-+
  • LassoNet: A Neural Network with Feature Sparsity JOURNAL OF MACHINE LEARNING RESEARCH Lemhadri, I., Ruan, F., Abraham, L., Tibshirani, R. 2021; 22
  • Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron INTERNATIONAL STATISTICAL REVIEW Friedman, J., Hastie, T., Tibshirani, R. 2020; 88: S73–S74

    View details for DOI 10.1111/insr.12414

    View details for Web of Science ID 000603161400008

  • Reluctant Generalised Additive Modelling. International statistical review = Revue internationale de statistique Tay, J. K., Tibshirani, R. 2020; 88 (Suppl 1): S205-S224

    Abstract

    Sparse generalised additive models (GAMs) are an extension of sparse generalised linear models that allow a model's prediction to vary non-linearly with an input variable. This enables the data analyst build more accurate models, especially when the linearity assumption is known to be a poor approximation of reality. Motivated by reluctant interaction modelling, we propose a multi-stage algorithm, called reluctant generalised additive modelling (RGAM), that can fit sparse GAMs at scale. It is guided by the principle that, if all else is equal, one should prefer a linear feature over a non-linear feature. Unlike existing methods for sparse GAMs, RGAM can be extended easily to binary, count and survival data. We demonstrate the method's effectiveness on real and simulated examples.

    View details for DOI 10.1111/insr.12429

    View details for PubMedID 36062079

    View details for PubMedCentralID PMC9435322

  • Reluctant Generalised Additive Modelling INTERNATIONAL STATISTICAL REVIEW Tay, J., Tibshirani, R. 2020

    View details for DOI 10.1111/insr.12429

    View details for Web of Science ID 000591285600001

  • Metabolic Dynamics and Prediction of Gestational Age and Time to Delivery in Pregnant Women OBSTETRICAL & GYNECOLOGICAL SURVEY Liang, L., Rasmussen, M., Piening, B., Shen, X., Chen, S., Rost, H., Snyder, J. K., Tibshirani, R., Skotte, L., Lee, N. Y., Contrepois, K., Feenstra, B., Zackriah, H., Snyder, M., Melbye, M. 2020; 75 (11): 649–51
  • Rejoinder: Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons STATISTICAL SCIENCE Hastie, T., Tibshirani, R., Tibshirani, R. J. 2020; 35 (4): 625–26
  • Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons STATISTICAL SCIENCE Hastie, T., Tibshirani, R., Tibshirani, R. 2020; 35 (4): 579–92

    View details for DOI 10.1214/19-STS733

    View details for Web of Science ID 000591728200002

  • Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions NATURE MACHINE INTELLIGENCE Culos, A., Tsai, A. S., Stanley, N., Becker, M., Ghaemi, M. S., McIlwain, D. R., Fallahzadeh, R., Tanada, A., Nassar, H., Espinosa, C., Xenochristou, M., Ganio, E., Peterson, L., Han, X., Stelzer, I. A., Ando, K., Gaudilliere, D., Phongpreecha, T., Maric, I., Chang, A. L., Shaw, G. M., Stevenson, D. K., Bendall, S., Davis, K. L., Fantl, W., Nolan, G. P., Hastie, T., Tibshirani, R., Angst, M. S., Gaudilliere, B., Aghaeepour, N. 2020
  • Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions. Nature machine intelligence Culos, A., Tsai, A. S., Stanley, N., Becker, M., Ghaemi, M. S., McIlwain, D. R., Fallahzadeh, R., Tanada, A., Nassar, H., Espinosa, C., Xenochristou, M., Ganio, E., Peterson, L., Han, X., Stelzer, I. A., Ando, K., Gaudilliere, D., Phongpreecha, T., Marić, I., Chang, A. L., Shaw, G. M., Stevenson, D. K., Bendall, S., Davis, K. L., Fantl, W., Nolan, G. P., Hastie, T., Tibshirani, R., Angst, M. S., Gaudilliere, B., Aghaeepour, N. 2020; 2 (10): 619-628

    Abstract

    The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.

    View details for DOI 10.1038/s42256-020-00232-8

    View details for PubMedID 33294774

    View details for PubMedCentralID PMC7720904

  • Transparency and reproducibility in artificial intelligence. Nature Haibe-Kains, B., Adam, G. A., Hosny, A., Khodakarami, F., Massive Analysis Quality Control (MAQC) Society Board of Directors, Waldron, L., Wang, B., McIntosh, C., Goldenberg, A., Kundaje, A., Greene, C. S., Broderick, T., Hoffman, M. M., Leek, J. T., Korthauer, K., Huber, W., Brazma, A., Pineau, J., Tibshirani, R., Hastie, T., Ioannidis, J. P., Quackenbush, J., Aerts, H. J., Shraddha, T., Kusko, R., Sansone, S., Tong, W., Wolfinger, R. D., Mason, C. E., Jones, W., Dopazo, J., Furlanello, C. 2020; 586 (7829): E14–E16

    View details for DOI 10.1038/s41586-020-2766-y

    View details for PubMedID 33057217

  • A Pliable Lasso. Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America Tibshirani, R., Friedman, J. 2020; 29 (1): 215-225

    Abstract

    We propose a generalization of the lasso that allows the model coefficients to vary as a function of a general set of some prespecified modifying variables. These modifiers might be variables such as gender, age, or time. The paradigm is quite general, with each lasso coefficient modified by a sparse linear function of the modifying variables Z. The model is estimated in a hierarchical fashion to control the degrees of freedom and avoid overfitting. The modifying variables may be observed, observed only in the training set, or unobserved overall. There are connections of our proposal to varying coefficient models and high-dimensional interaction models. We present a computationally efficient algorithm for its optimization, with exact screening rules to facilitate application to large numbers of predictors. The method is illustrated on a number of different simulated and real examples. Supplementary materials for this article are available online.

    View details for DOI 10.1080/10618600.2019.1648271

    View details for PubMedID 36340327

    View details for PubMedCentralID PMC9631466

  • Post model-fitting exploration via a "Next-Door" analysis. The Canadian journal of statistics = Revue canadienne de statistique Guan, L., Tibshirani, R. 2020; 48 (3): 447-470

    Abstract

    We propose a simple method for evaluating the model that has been chosen by an adaptive regression procedure, our main focus being the lasso. This procedure deletes each chosen predictor and refits the lasso to get a set of models that are "close" to the chosen "base model," and compares the error rates of the base model with that of nearby models. If the deletion of a predictor leads to significant deterioration in the model's predictive power, the predictor is called indispensable; otherwise, the nearby model is called acceptable and can serve as a good alternative to the base model. This provides both an assessment of the predictive contribution of each variable and a set of alternative models that may be used in place of the chosen model. We call this procedure "Next-Door analysis" since it examines models "next" to the base model. It can be applied to supervised learning problems with ℓ 1 penalization and stepwise procedures. We have implemented it in the R language as a library to accompany the well-known glmnet library.

    View details for DOI 10.1002/cjs.11542

    View details for PubMedID 36092475

    View details for PubMedCentralID PMC9454156

  • SARS-CoV-2 Antibody Responses Correlate with Resolution of RNAemia But Are Short-Lived in Patients with Mild Illness. medRxiv : the preprint server for health sciences Röltgen, K., Wirz, O. F., Stevens, B. A., Powell, A. E., Hogan, C. A., Najeeb, J., Hunter, M., Sahoo, M. K., Huang, C., Yamamoto, F., Manalac, J., Otrelo-Cardoso, A. R., Pham, T. D., Rustagi, A., Rogers, A. J., Shah, N. H., Blish, C. A., Cochran, J. R., Nadeau, K. C., Jardetzky, T. S., Zehnder, J. L., Wang, T. T., Kim, P. S., Gombar, S., Tibshirani, R., Pinsky, B. A., Boyd, S. D. 2020

    Abstract

    SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, could offer protective immunity, and may affect clinical outcomes of COVID-19 patients. We analyzed 625 serial plasma samples from 40 hospitalized COVID-19 patients and 170 SARS-CoV-2-infected outpatients and asymptomatic individuals. Severely ill patients developed significantly higher SARS-CoV-2-specific antibody responses than outpatients and asymptomatic individuals. The development of plasma antibodies was correlated with decreases in viral RNAemia, consistent with potential humoral immune clearance of virus. Using a novel competition ELISA, we detected antibodies blocking RBD-ACE2 interactions in 68% of inpatients and 40% of outpatients tested. Cross-reactive antibodies recognizing SARS-CoV RBD were found almost exclusively in hospitalized patients. Outpatient and asymptomatic individuals' serological responses to SARS-CoV-2 decreased within 2 months, suggesting that humoral protection may be short-lived.

    View details for DOI 10.1101/2020.08.15.20175794

    View details for PubMedID 32839786

    View details for PubMedCentralID PMC7444305

  • Transcriptional changes in peanut-specific CD4+ T cells over the course of oral immunotherapy. Clinical immunology (Orlando, Fla.) Wang, W., Lyu, S., Ji, X., Gupta, S., Manohar, M., Dhondalay, G. K., Chinthrajah, S., Andorf, S., Boyd, S. D., Tibshirani, R., Galli, S. J., Nadeau, K. C., Maecker, H. T. 2020: 108568

    Abstract

    Oral immunotherapy (OIT) can successfully desensitize allergic individuals to offending foods such as peanut. Our recent clinical trial (NCT02103270) of peanut OIT allowed us to monitor peanut-specific CD4+ T cells, using MHC-peptide Dextramers, over the course of OIT. We used a single-cell targeted RNAseq assay to analyze these cells at 0, 12, 24, 52, and 104 weeks of OIT. We found a transient increase in TGFbeta-producing cells at 52 weeks in those with successful desensitization, which lasted until 117 weeks. We also performed clustering and identified 5 major clusters of Dextramer+ cells, which we tracked over time. One of these clusters appeared to be anergic, while another was consistent with recently described TFH13 cells. The other 3 clusters appeared to be Th2 cells by their coordinated production of IL-4 and IL-13, but they varied in their expression of STAT signaling proteins and other markers. A cluster with high expression of STAT family members also showed a possible transient increase at week 24 in those with successful desensitization. Single cell TCRalphabeta repertoire sequences were too diverse to track clones over time. Together with increased TGFbeta production, these changes may be mechanistic predictors of successful OIT that should be further investigated.

    View details for DOI 10.1016/j.clim.2020.108568

    View details for PubMedID 32783912

  • Metabolic Dynamics and Prediction of Gestational Age and Time to Delivery in Pregnant Women. Cell Liang, L., Rasmussen, M. H., Piening, B., Shen, X., Chen, S., Rost, H., Snyder, J. K., Tibshirani, R., Skotte, L., Lee, N. C., Contrepois, K., Feenstra, B., Zackriah, H., Snyder, M., Melbye, M. 2020; 181 (7): 1680

    Abstract

    Metabolism during pregnancy is a dynamic and precisely programmed process, the failure of which can bring devastating consequences to the mother and fetus. To define a high-resolution temporal profile of metabolites during healthy pregnancy, we analyzed the untargeted metabolome of 784weekly blood samples from 30 pregnant women. Broad changes and a highly choreographed profile were revealed: 4,995 metabolic features (of 9,651 total), 460 annotated compounds (of 687 total), and 34 human metabolic pathways (of 48 total) were significantly changed during pregnancy. Using linear models, we built a metabolic clock with five metabolites that time gestational age in high accordance with ultrasound (R= 0.92). Furthermore, two to three metabolites can identify when labor occurs (time to delivery within two, four, and eight weeks, AUROC ≥ 0.85). Our study represents a weekly characterization of the human pregnancy metabolome, providing a high-resolution landscape for understanding pregnancy with potential clinical utilities.

    View details for DOI 10.1016/j.cell.2020.05.002

    View details for PubMedID 32589958

  • Molecular Transducers of Physical Activity Consortium (MoTrPAC): Mapping the Dynamic Responses to Exercise. Cell Sanford, J. A., Nogiec, C. D., Lindholm, M. E., Adkins, J. N., Amar, D., Dasari, S., Drugan, J. K., Fernandez, F. M., Radom-Aizik, S., Schenk, S., Snyder, M. P., Tracy, R. P., Vanderboom, P., Trappe, S., Walsh, M. J., Molecular Transducers of Physical Activity Consortium, Adkins, J. N., Amar, D., Dasari, S., Drugan, J. K., Evans, C. R., Fernandez, F. M., Li, Y., Lindholm, M. E., Nogiec, C. D., Radom-Aizik, S., Sanford, J. A., Schenk, S., Snyder, M. P., Tomlinson, L., Tracy, R. P., Trappe, S., Vanderboom, P., Walsh, M. J., Alekel, D. L., Bekirov, I., Boyce, A. T., Boyington, J., Fleg, J. L., Joseph, L. J., Laughlin, M. R., Maruvada, P., Morris, S. A., McGowan, J. A., Nierras, C., Pai, V., Peterson, C., Ramos, E., Roary, M. C., Williams, J. P., Xia, A., Cornell, E., Rooney, J., Miller, M. E., Ambrosius, W. T., Rushing, S., Stowe, C. L., Rejeski, W. J., Nicklas, B. J., Pahor, M., Lu, C., Trappe, T., Chambers, T., Raue, U., Lester, B., Bergman, B. C., Bessesen, D. H., Jankowski, C. M., Kohrt, W. M., Melanson, E. L., Moreau, K. L., Schauer, I. E., Schwartz, R. S., Kraus, W. E., Slentz, C. A., Huffman, K. M., Johnson, J. L., Willis, L. H., Kelly, L., Houmard, J. A., Dubis, G., Broskey, N., Goodpaster, B. H., Sparks, L. M., Coen, P. M., Cooper, D. M., Haddad, F., Rankinen, T., Ravussin, E., Johannsen, N., Harris, M., Jakicic, J. M., Newman, A. B., Forman, D. D., Kershaw, E., Rogers, R. J., Nindl, B. C., Page, L. C., Stefanovic-Racic, M., Barr, S. L., Rasmussen, B. B., Moro, T., Paddon-Jones, D., Volpi, E., Spratt, H., Musi, N., Espinoza, S., Patel, D., Serra, M., Gelfond, J., Burns, A., Bamman, M. M., Buford, T. W., Cutter, G. R., Bodine, S. C., Esser, K., Farrar, R. P., Goodyear, L. J., Hirshman, M. F., Albertson, B. G., Qian, W., Piehowski, P., Gritsenko, M. A., Monore, M. E., Petyuk, V. A., McDermott, J. E., Hansen, J. N., Hutchison, C., Moore, S., Gaul, D. A., Clish, C. B., Avila-Pacheco, J., Dennis, C., Kellis, M., Carr, S., Jean-Beltran, P. M., Keshishian, H., Mani, D. R., Clauser, K., Krug, K., Mundorff, C., Pearce, C., Ivanova, A. A., Ortlund, E. A., Maner-Smith, K., Uppal, K., Zhang, T., Sealfon, S. C., Zavlasky, E., Nair, V., Li, S., Jain, N., Ge, Y., Sun, Y., Nudelman, G., Ruf-Zamojski, F., Smith, G., Pincas, N., Rubenstein, A., Amper, M. A., Seenarine, N., Lappalainen, T., Lanza, I. R., Nair, K. S., Klaus, K., Montgomery, S. B., Smith, K. S., Gay, N. R., Zhao, B., Hung, C. J., Zebarjadi, N., Balliu, B., Fresard, L., Burant, C. F., Li, J. Z., Kachman, M., Soni, T., Raskind, A. B., Gerszten, R., Robbins, J., Ilkayeva, O., Muehlbauer, M. J., Newgard, C. B., Ashley, E. A., Wheeler, M. T., Jimenez-Morales, D., Raja, A., Dalton, K. P., Zhen, J., Kim, Y. S., Christle, J. W., Marwaha, S., Chin, E. T., Hershman, S. G., Hastie, T., Tibshirani, R., Rivas, M. A. 2020; 181 (7): 1464–74

    Abstract

    Exercise provides a robust physiological stimulus that evokes cross-talk among multiple tissues that when repeated regularly (i.e., training) improves physiological capacity, benefits numerous organ systems, and decreases the risk for premature mortality. However, a gap remains in identifying the detailed molecular signals induced by exercise that benefits health and prevents disease. The Molecular Transducers of Physical Activity Consortium (MoTrPAC) was established to address this gap and generate a molecular map of exercise. Preclinical and clinical studies will examine the systemic effects of endurance and resistance exercise across a range of ages and fitness levels by molecular probing of multiple tissues before and after acute and chronic exercise. From this multi-omic and bioinformatic analysis, a molecular map of exercise will be established. Altogether, MoTrPAC will provide a public database that is expected to enhance our understanding of the health benefits of exercise and to provide insight into how physical activity mitigates disease.

    View details for DOI 10.1016/j.cell.2020.06.004

    View details for PubMedID 32589957

  • Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Friedman, J., Hastie, T., Tibshirani, R. 2020; 115 (530): 665–66
  • Integrating genomic features for non-invasive early lung cancer detection. Nature Chabon, J. J., Hamilton, E. G., Kurtz, D. M., Esfahani, M. S., Moding, E. J., Stehr, H., Schroers-Martin, J., Nabet, B. Y., Chen, B., Chaudhuri, A. A., Liu, C. L., Hui, A. B., Jin, M. C., Azad, T. D., Almanza, D., Jeon, Y. J., Nesselbush, M. C., Co Ting Keh, L., Bonilla, R. F., Yoo, C. H., Ko, R. B., Chen, E. L., Merriott, D. J., Massion, P. P., Mansfield, A. S., Jen, J., Ren, H. Z., Lin, S. H., Costantino, C. L., Burr, R., Tibshirani, R., Gambhir, S. S., Berry, G. J., Jensen, K. C., West, R. B., Neal, J. W., Wakelee, H. A., Loo, B. W., Kunder, C. A., Leung, A. N., Lui, N. S., Berry, M. F., Shrager, J. B., Nair, V. S., Haber, D. A., Sequist, L. V., Alizadeh, A. A., Diehn, M. 2020; 580 (7802): 245-251

    Abstract

    Radiologic screening of high-risk adults reduces lung-cancer-related mortality1,2; however, a small minority of eligible individuals undergo such screening in the United States3,4. The availability of blood-based tests could increase screening uptake. Here we introduce improvements to cancer personalized profiling by deep sequencing (CAPP-Seq)5, a method for the analysis of circulating tumour DNA (ctDNA), to better facilitate screening applications. We show that, although levels are very low in early-stage lung cancers, ctDNA is present prior to treatment in most patients and its presence is strongly prognostic. We also find that the majority of somatic mutations in the cell-free DNA (cfDNA) of patients with lung cancer and of risk-matched controls reflect clonal haematopoiesis and are non-recurrent. Compared with tumour-derived mutations, clonal haematopoiesis mutations occur on longer cfDNA fragments and lack mutational signatures that are associated with tobacco smoking. Integrating these findings with other molecular features, we develop and prospectively validate a machine-learning method termed 'lung cancer likelihood in plasma' (Lung-CLiP), which can robustly discriminate early-stage lung cancer patients from risk-matched controls. This approach achieves performance similar to that of tumour-informed ctDNA detection and enables tuning of assay specificity in order to facilitate distinct clinical applications. Our findings establish the potential of cfDNA for lung cancer screening and highlight the importance of risk-matching cases and controls in cfDNA-based screening studies.

    View details for DOI 10.1038/s41586-020-2140-0

    View details for PubMedID 32269342

  • Integrating genomic features for non-invasive early lung cancer detection NATURE Chabon, J. J., Hamilton, E. G., Kurtz, D. M., Esfahani, M. S., Moding, E. J., Stehr, H., Schroers-Martin, J., Nabet, B. Y., Chen, B., Chaudhuri, A. A., Liu, C., Hui, A. B., Jin, M. C., Azad, T. D., Almanza, D., Jeon, Y., Nesselbush, M. C., Keh, L., Bonilla, R. F., Yoo, C. H., Ko, R. B., Chen, E. L., Merriott, D. J., Massion, P. P., Mansfield, A. S., Jen, J., Ren, H. Z., Lin, S. H., Costantino, C. L., Burr, R., Tibshirani, R., Gambhir, S. S., Berry, G. J., Jensen, K. C., West, R. B., Neal, J. W., Wakelee, H. A., Loo, B. W., Kunder, C. A., Leung, A. N., Lui, N. S., Berry, M. F., Shrager, J. B., Nair, V. S., Haber, D. A., Sequist, L. V., Alizadeh, A. A., Diehn, M. 2020
  • Post model-fitting exploration via a "Next-Door" analysis CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Guan, L., Tibshirani, R. 2020

    View details for DOI 10.1002/cjs.11542

    View details for Web of Science ID 000561683100001

  • Dose-related Allergic Reactions Decrease Over Time During Peanut Oral Immunotherapy in a Large, Randomized, Double-blind, Placebo-controlled, Phase 2 Study Long, A., Purington, N., Andorf, S., O'Laughlin, K., Lyu, S., Sindher, S., Manohar, M., Boyd, S., Tibshirani, R., Maecker, H., Mukai, K., Tsai, M., Desai, M., Chinthrajah, S., Galli, S., Nadeau, K. MOSBY-ELSEVIER. 2020: AB134
  • Sustained outcomes in oral immunotherapy for peanut allergy (POISED study): a large, randomised, double-blind, placebo-controlled, phase 2 study Chinthrajah, S., Purington, N., Andorf, S., Long, A., O'Laughlin, K., Lyu, S., Manohar, M., Boyd, S., Tibshirani, R., Maecker, H., Mukai, K., Tsai, M., Desai, M., Galli, S., Nadeau, K. MOSBY-ELSEVIER. 2020: AB181
  • A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS genetics Qian, J. n., Tanigawa, Y. n., Du, W. n., Aguirre, M. n., Chang, C. n., Tibshirani, R. n., Rivas, M. A., Hastie, T. n. 2020; 16 (10): e1009141

    Abstract

    The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.

    View details for DOI 10.1371/journal.pgen.1009141

    View details for PubMedID 33095761

  • Origins and clonal convergence of gastrointestinal IgE+ B cells in human peanut allergy. Science immunology Hoh, R. A., Joshi, S. A., Lee, J. Y., Martin, B. A., Varma, S. n., Kwok, S. n., Nielsen, S. C., Nejad, P. n., Haraguchi, E. n., Dixit, P. S., Shutthanandan, S. V., Roskin, K. M., Zhang, W. n., Tupa, D. n., Bunning, B. J., Manohar, M. n., Tibshirani, R. n., Fernandez-Becker, N. Q., Kambham, N. n., West, R. B., Hamilton, R. G., Tsai, M. n., Galli, S. J., Chinthrajah, R. S., Nadeau, K. C., Boyd, S. D. 2020; 5 (45)

    Abstract

    B cells in human food allergy have been studied predominantly in the blood. Little is known about IgE+ B cells or plasma cells in tissues exposed to dietary antigens. We characterized IgE+ clones in blood, stomach, duodenum, and esophagus of 19 peanut-allergic patients, using high-throughput DNA sequencing. IgE+ cells in allergic patients are enriched in stomach and duodenum, and have a plasma cell phenotype. Clonally related IgE+ and non-IgE-expressing cell frequencies in tissues suggest local isotype switching, including transitions between IgA and IgE isotypes. Highly similar antibody sequences specific for peanut allergen Ara h 2 are shared between patients, indicating that common immunoglobulin genetic rearrangements may contribute to pathogenesis. These data define the gastrointestinal tract as a reservoir of IgE+ B lineage cells in food allergy.

    View details for DOI 10.1126/sciimmunol.aay4209

    View details for PubMedID 32139586

  • Increased diversity of gut microbiota during active oral immunotherapy in peanut allergic adults. Allergy He, Z. n., Vadali, V. G., Szabady, R. L., Zhang, W. n., Norman, J. M., Roberts, B. n., Tibshirani, R. n., Desai, M. n., Chinthrajah, R. S., Galli, S. J., Andorf, S. n., Nadeau, K. C. 2020

    View details for DOI 10.1111/all.14540

    View details for PubMedID 32750160

  • Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank. Biostatistics (Oxford, England) Li, R. n., Chang, C. n., Justesen, J. M., Tanigawa, Y. n., Qiang, J. n., Hastie, T. n., Rivas, M. A., Tibshirani, R. n. 2020

    Abstract

    We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.

    View details for DOI 10.1093/biostatistics/kxaa038

    View details for PubMedID 32989444

  • Defining the features and duration of antibody responses to SARS-CoV-2 infection associated with disease severity and outcome. Science immunology Röltgen, K. n., Powell, A. E., Wirz, O. F., Stevens, B. A., Hogan, C. A., Najeeb, J. n., Hunter, M. n., Wang, H. n., Sahoo, M. K., Huang, C. n., Yamamoto, F. n., Manohar, M. n., Manalac, J. n., Otrelo-Cardoso, A. R., Pham, T. D., Rustagi, A. n., Rogers, A. J., Shah, N. H., Blish, C. A., Cochran, J. R., Jardetzky, T. S., Zehnder, J. L., Wang, T. T., Narasimhan, B. n., Gombar, S. n., Tibshirani, R. n., Nadeau, K. C., Kim, P. S., Pinsky, B. A., Boyd, S. D. 2020; 5 (54)

    Abstract

    SARS-CoV-2-specific antibodies, particularly those preventing viral spike receptor binding domain (RBD) interaction with host angiotensin-converting enzyme 2 (ACE2) receptor, can neutralize the virus. It is, however, unknown which features of the serological response may affect clinical outcomes of COVID-19 patients. We analyzed 983 longitudinal plasma samples from 79 hospitalized COVID-19 patients and 175 SARS-CoV-2-infected outpatients and asymptomatic individuals. Within this cohort, 25 patients died of their illness. Higher ratios of IgG antibodies targeting S1 or RBD domains of spike compared to nucleocapsid antigen were seen in outpatients who had mild illness versus severely ill patients. Plasma antibody increases correlated with decreases in viral RNAemia, but antibody responses in acute illness were insufficient to predict inpatient outcomes. Pseudovirus neutralization assays and a scalable ELISA measuring antibodies blocking RBD-ACE2 interaction were well correlated with patient IgG titers to RBD. Outpatient and asymptomatic individuals' SARS-CoV-2 antibodies, including IgG, progressively decreased during observation up to five months post-infection.

    View details for DOI 10.1126/sciimmunol.abe0240

    View details for PubMedID 33288645

  • Identification of Diagnostic Metabolic Signatures in Clear Cell Renal Cell Carcinoma Using Mass Spectrometry Imaging. International journal of cancer Vijayalakshmi, K., Shankar, V., Bain, R. M., Nolley, R., Sonn, G. A., Kao, C., Zhao, H., Tibshirani, R., Zare, R. N., Brooks, J. D. 2019

    Abstract

    Clear cell renal cell carcinoma (ccRCC) is the most common and lethal subtype of kidney cancer. Intraoperative frozen section (IFS) analysis is used to confirm the diagnosis during partial nephrectomy (PN). However, surgical margin evaluation using IFS analysis is time consuming and unreliable, leading to relatively low utilization. In this study, we demonstrated the use of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) as a molecular diagnostic and prognostic tool for ccRCC. DESI-MSI was conducted on fresh-frozen 23 normal-tumor paired nephrectomy specimens of ccRCC. An independent validation cohort of 17 normal-tumor pairs were analyzed. DESI-MSI provides two-dimensional molecular images of tissues with mass spectra representing small metabolites, fatty acids, and lipids. These tissues were subjected to histopathologic evaluation. A set of metabolites that distinguish ccRCC from normal kidney were identified by performing least absolute shrinkage and selection operator (Lasso) and log-ratio Lasso analysis. Lasso analysis with leave-one-patient-out cross validation selected 57 peaks from over 27,000 metabolic features across 37,608 pixels obtained using DESI-MSI of ccRCC and normal tissues. Baseline Lasso of metabolites predicted the class of each tissue to be normal or cancerous tissue with an accuracy of 94% and 76%, respectively. Combining the baseline Lasso with the ratio of glucose to arachidonic acid could potentially reduce scan time and improve accuracy to identify normal (82%) and ccRCC (88%) tissue. DESI-MSI allows rapid detection of metabolites associated with normal and ccRCC with high accuracy. As this technology advances, it could be used for rapid intraoperative assessment of surgical margin status. This article is protected by copyright. All rights reserved.

    View details for DOI 10.1002/ijc.32843

    View details for PubMedID 31863456

  • Increased T Cell Differentiation and Cytolytic Function in Bangladeshi Compared to American Children. Frontiers in immunology Wagar, L. E., Bolen, C. R., Sigal, N., Lopez Angel, C. J., Guan, L., Kirkpatrick, B. D., Haque, R., Tibshirani, R. J., Parsonnet, J., Petri, W. A., Davis, M. M. 2019; 10: 2239

    Abstract

    During the first 5 years of life, children are especially vulnerable to infection-related morbidity and mortality. Conversely, the Hygiene Hypothesis suggests that a lack of exposure to infectious agents early in life could explain the increasing incidence of allergies and autoimmunity in high-income countries. Understanding these phenomena, however, is hampered by a lack of comprehensive, direct immune monitoring in children with differing degrees of microbial exposure. Using mass cytometry, we provide an in-depth profile of the peripheral blood mononuclear cells (PBMCs) of children in regions at the extremes of exposure: the San Francisco Bay Area, USA and an economically poor district of Dhaka, Bangladesh. Despite variability in clinical health, functional characteristics of PBMCs were similar in Bangladeshi and American children at 1 year of age. However, by 2-3 years of age, Bangladeshi children's immune cells often demonstrated altered activation and cytokine production profiles upon stimulation with PMA-ionomycin, with an overall immune trajectory more in line with American adults. Conversely, immune responses in children from the US remained steady. Using principal component analysis, donor location, ethnic background, and cytomegalovirus infection status were found to account for some of the variation identified among samples. Within Bangladeshi 1-year-olds, stunting (as measured by height-for-age z-scores) was found to be associated with IL-8 and TGFβ expression in PMA-ionomycin stimulated samples. Combined, these findings provide important insights into the immune systems of children in high vs. low microbial exposure environments and suggest an important role for IL-8 and TGFβ in mitigating the microbial challenges faced by the Bangladeshi children.

    View details for DOI 10.3389/fimmu.2019.02239

    View details for PubMedID 31620139

    View details for PubMedCentralID PMC6763580

  • Increased T Cell Differentiation and Cytolytic Function in Bangladeshi Compared to American Children FRONTIERS IN IMMUNOLOGY Wager, L. E., Bolen, C. R., Sigel, N., Angel, C., Guan, L., Kirkpatrick, B. D., Haque, R., Tibshirani, R. J., Parsonnet, J., Petri, W. A., Davis, M. M. 2019; 10
  • A Pliable Lasso JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tibshirani, R., Friedman, J. 2019
  • Early detection of unilateral ureteral obstruction by desorption electrospray ionization mass spectrometry. Scientific reports Banerjee, S., Wong, A. C., Yan, X., Wu, B., Zhao, H., Tibshirani, R. J., Zare, R. N., Brooks, J. D. 2019; 9 (1): 11007

    Abstract

    Desorption electrospray ionization mass spectrometry (DESI-MS) is an emerging analytical tool for rapid in situ assessment of metabolomic profiles on tissue sections without tissue pretreatment or labeling. We applied DESI-MS to identify candidate metabolic biomarkers associated with kidney injury at the early stage. DESI-MS was performed on sections of kidneys from 80 mice over a time course following unilateral ureteral obstruction (UUO) and compared to sham controls. A predictive model of renal damage was constructed using the LASSO (least absolute shrinkage and selection operator) method. Levels of lipid and small metabolites were significantly altered and glycerophospholipids comprised a significant fraction of altered species. These changes correlate with altered expression of lipid metabolic genes, with most genes showing decreased expression. However, rapid upregulation of PG(22:6/22:6) level appeared to be a hitherto unknown feature of the metabolic shift observed in UUO. Using LASSO and SAM (significance analysis of microarrays), we identified a set of well-measured metabolites that accurately predicted UUO-induced renal damage that was detectable by 12h after UUO, prior to apparent histological changes. Thus, DESI-MS could serve as a useful adjunct to histology in identifying renal damage and demonstrates early and broad changes in membrane associated lipids.

    View details for DOI 10.1038/s41598-019-47396-x

    View details for PubMedID 31358807

  • Dynamic Risk Profiling Using Serial Tumor Biomarkers for Personalized Outcome Prediction. Cell Kurtz, D. M., Esfahani, M. S., Scherer, F., Soo, J., Jin, M. C., Liu, C. L., Newman, A. M., Duhrsen, U., Huttmann, A., Casasnovas, O., Westin, J. R., Ritgen, M., Bottcher, S., Langerak, A. W., Roschewski, M., Wilson, W. H., Gaidano, G., Rossi, D., Bahlo, J., Hallek, M., Tibshirani, R., Diehn, M., Alizadeh, A. A. 2019

    Abstract

    Accurate prediction of long-term outcomes remains a challenge in the care of cancer patients. Due to the difficulty of serial tumor sampling, previous prediction tools have focused on pretreatment factors. However, emerging non-invasive diagnostics have increased opportunities for serial tumor assessments. We describe the Continuous Individualized Risk Index (CIRI), a method to dynamically determine outcome probabilities for individual patients utilizing risk predictors acquired over time. Similar to "win probability" models in other fields, CIRI provides a real-time probability by integrating risk assessments throughout a patient's course. Applying CIRI to patients with diffuse large B cell lymphoma, we demonstrate improved outcome prediction compared to conventional risk models. We demonstrate CIRI's broader utility in analogous models of chronic lymphocytic leukemia and breast adenocarcinoma and perform a proof-of-concept analysis demonstrating how CIRI could be used to develop predictive biomarkers for therapy selection. We envision thatdynamic risk assessment will facilitate personalized medicine and enable innovative therapeutic paradigms.

    View details for DOI 10.1016/j.cell.2019.06.011

    View details for PubMedID 31280963

  • Main Effects and Interactions in Mixed and Incomplete Data Frames JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Robin, G., Klopp, O., Josse, J., Moulines, E., Tibshirani, R. 2019
  • Log-ratio lasso: Scalable, sparse estimation for log-ratio models BIOMETRICS Bates, S., Tibshirani, R. 2019; 75 (2): 613–24

    View details for DOI 10.1111/biom.12995

    View details for Web of Science ID 000483730600028

  • Proliferation tracing with single-cell mass cytometry optimizes generation of stem cell memory-like T cells NATURE BIOTECHNOLOGY Good, Z., Borges, L., Gonzalez, N., Sahaf, B., Samusik, N., Tibshirani, R., Nolan, G. P., Bendall, S. C. 2019; 37 (3): 259-+
  • Shaping of infant B cell receptor repertoires by environmental factors and infectious disease. Science translational medicine Nielsen, S. C., Roskin, K. M., Jackson, K. J., Joshi, S. A., Nejad, P., Lee, J., Wagar, L. E., Pham, T. D., Hoh, R. A., Nguyen, K. D., Tsunemoto, H. Y., Patel, S. B., Tibshirani, R., Ley, C., Davis, M. M., Parsonnet, J., Boyd, S. D. 2019; 11 (481)

    Abstract

    Antigenic exposures at epithelial sites in infancy and early childhood are thought to influence the maturation of humoral immunity and modulate the risk of developing immunoglobulin E (IgE)-mediated allergic disease. How different kinds of environmental exposures influence B cell isotype switching to IgE, IgG, or IgA, and the somatic mutation maturation of these antibody pools, is not fully understood. We sequenced antibody repertoires in longitudinal blood samples in a birth cohort from infancy through the first 3 years of life and found that, whereas IgG and IgA show linear increases in mutational maturation with age, IgM and IgD mutations are more closely tied to pathogen exposure. IgE mutation frequencies are primarily increased in children with impaired skin barrier conditions such as eczema, suggesting that IgE affinity maturation could provide a mechanistic link between epithelial barrier failure and allergy development.

    View details for PubMedID 30814336

  • Reply to J. Wang et al. Journal of clinical oncology : official journal of the American Society of Clinical Oncology Kurtz, D. M., Scherer, F., Jin, M. C., Soo, J., Craig, A. F., Esfahani, M. S., Chabon, J. J., Stehr, H., Liu, C. L., Tibshirani, R., Maeda, L. S., Gupta, N. K., Khodadoust, M. S., Advani, R. H., Newman, A. M., Duhrsen, U., Huttmann, A., Meignan, M., Casasnovas, O., Westin, J. R., Roschewski, M., Wilson, W. H., Gaidano, G., Rossi, D., Diehn, M., Alizadeh, A. A. 2019: JCO1801907

    View details for PubMedID 30753108

  • Proliferation tracing with single-cell mass cytometry optimizes generation of stem cell memory-like T cells. Nature biotechnology Good, Z., Borges, L., Vivanco Gonzalez, N., Sahaf, B., Samusik, N., Tibshirani, R., Nolan, G. P., Bendall, S. C. 2019

    Abstract

    Selective differentiation of naive T cells into multipotent T cells is of great interest clinically for the generation of cell-based cancer immunotherapies. Cellular differentiation depends crucially on division state and time. Here we adapt a dye dilution assay for tracking cell proliferative history through mass cytometry and uncouple division, time and regulatory protein expression in single naive human T cells during their activation and expansion in a complex ex vivo milieu. Using 23 markers, we defined groups of proteins controlled predominantly by division state or time and found that undivided cells account for the majority of phenotypic diversity. We next built a map of cell state changes during naive T-cell expansion. By examining cell signaling on this map, we rationally selected ibrutinib, a BTK and ITK inhibitor, and administered it before T cell activation to direct differentiation toward a T stem cell memory (TSCM)-like phenotype. This method for tracing cell fate across division states and time can be broadly applied for directing cellular differentiation.

    View details for PubMedID 30742126

  • Desensitization rates to peanut protein during OIT among children, adolescents, and adults Long, A. J., Purington, N., Woch, M., O'Laughlin, K., Tan, T., Kost, L., Hijazi, S., Shojinaga, M., Raeber, O., Alvarez, A., Andorf, S., Tibshirani, R., Galli, S. J., Nadeau, K. C., Chinthrajah, R. MOSBY-ELSEVIER. 2019: AB245
  • An Approach to Explore for a Sweet-spot in Randomized Trials. Journal of clinical epidemiology Redelmeier, D. A., Tibshirani, R. J. 2019

    Abstract

    To demonstrate how a conventional randomized trial can be analyzed through a stratified or a matched approach to identify a potential sweet-spot where observed differences seem accentuated in the mid range of disease severity.and Setting: We review a landmark randomized trial of heart failure patients that tested whether implantable defibrillators reduce mortality (n = 2,521).Overall, 22% (182 / 829) of the patients in the defibrillator group died compared to 29% (484 / 1,692) of patients in the control group. Proportional hazards analysis yielded a modest 25% survival benefit (hazard ratio = 0.75, 95% confidence interval: 0.63 to 0.89). Stratified analysis of the trial yielded a larger 52% survival benefit for those in the middle quintile of disease severity (hazard ratio = 0.48, 95% confidence interval: 0.29 to 0.79). In contrast, little of the survival benefit was explained by patients with the greatest disease severity (hazard ratio = 0.89, 95% confidence interval 0.69 to 1.15). The discrepancy between crude and stratified analyses could be visualized by graphical displays and replicated with matched comparisons.Our approach for analyzing a randomized trial could help identify a potential sweet-spot of an accentuated treatment effect.

    View details for DOI 10.1016/j.jclinepi.2019.12.012

    View details for PubMedID 31874202

  • Mapping lung cancer epithelial-mesenchymal transition states and trajectories with single-cell resolution. Nature communications Karacosta, L. G., Anchang, B. n., Ignatiadis, N. n., Kimmey, S. C., Benson, J. A., Shrager, J. B., Tibshirani, R. n., Bendall, S. C., Plevritis, S. K. 2019; 10 (1): 5587

    Abstract

    Elucidating the spectrum of epithelial-mesenchymal transition (EMT) and mesenchymal-epithelial transition (MET) states in clinical samples promises insights on cancer progression and drug resistance. Using mass cytometry time-course analysis, we resolve lung cancer EMT states through TGFβ-treatment and identify, through TGFβ-withdrawal, a distinct MET state. We demonstrate significant differences between EMT and MET trajectories using a computational tool (TRACER) for reconstructing trajectories between cell states. In addition, we construct a lung cancer reference map of EMT and MET states referred to as the EMT-MET PHENOtypic STAte MaP (PHENOSTAMP). Using a neural net algorithm, we project clinical samples onto the EMT-MET PHENOSTAMP to characterize their phenotypic profile with single-cell resolution in terms of our in vitro EMT-MET analysis. In summary, we provide a framework to phenotypically characterize clinical samples in the context of in vitro EMT-MET findings which could help assess clinical relevance of EMT in cancer in future studies.

    View details for DOI 10.1038/s41467-019-13441-6

    View details for PubMedID 31811131

  • Sustained outcomes in oral immunotherapy for peanut allergy (POISED study): a large, randomised, double-blind, placebo-controlled, phase 2 study. Lancet (London, England) Chinthrajah, R. S., Purington, N. n., Andorf, S. n., Long, A. n., O'Laughlin, K. L., Lyu, S. C., Manohar, M. n., Boyd, S. D., Tibshirani, R. n., Maecker, H. n., Plaut, M. n., Mukai, K. n., Tsai, M. n., Desai, M. n., Galli, S. J., Nadeau, K. C. 2019

    Abstract

    Dietary avoidance is recommended for peanut allergies. We evaluated the sustained effects of peanut allergy oral immunotherapy (OIT) in a randomised long-term study in adults and children.In this randomised, double-blind, placebo-controlled, phase 2 study, we enrolled participants at the Sean N Parker Center for Allergy and Asthma Research at Stanford University (Stanford, CA, USA) with peanut allergy aged 7-55 years with a positive result from a double-blind, placebo-controlled, food challenge (DBPCFC; ≤500 mg of peanut protein), a positive skin-prick test (SPT) result (≥5 mm wheal diameter above the negative control), and peanut-specific immunoglobulin (Ig)E concentration of more than 4 kU/L. Participants were randomly assigned (2·4:1·4:1) in a two-by-two block design via a computerised system to be built up and maintained on 4000 mg peanut protein through to week 104 then discontinued on peanut (peanut-0 group), to be built up and maintained on 4000 mg peanut protein through to week 104 then to ingest 300 mg peanut protein daily (peanut-300 group) for 52 weeks, or to receive oat flour (placebo group). DBPCFCs to 4000 mg peanut protein were done at baseline and weeks 104, 117, 130, 143, and 156. The pharmacist assigned treatment on the basis of a randomised computer list. Peanut or placebo (oat) flour was administered orally and participants and the study team were masked throughout by use of oat flour that was similar in look and feel to the peanut flour and nose clips, as tolerated, to mask taste. The statistician was also masked. The primary endpoint was the proportion of participants who passed DBPCFCs to a cumulative dose of 4000 mg at both 104 and 117 weeks. The primary efficacy analysis was done in the intention-to-treat population. Safety was assessed in the intention-to-treat population. This trial is registered at ClinicalTrials.gov, NCT02103270.Between April 15, 2014, and March 2, 2016, of 152 individuals assessed, we enrolled 120 participants, who were randomly assigned to the peanut-0 (n=60), peanut-300 (n=35), and placebo groups (n=25). 21 (35%) of peanut-0 group participants and one (4%) placebo group participant passed the 4000 mg challenge at both 104 and 117 weeks (odds ratio [OR] 12·7, 95% CI 1·8-554·8; p=0·0024). Over the entire study, the most common adverse events were mild gastrointestinal symptoms, which were seen in 90 of 120 patients (50/60 in the peanut-0 group, 29/35 in the peanut-300 group, and 11/25 in the placebo group) and skin disorders, which were seen in 50/120 patients (26/60 in the peanut-0 group, 15/35 in the peanut-300 group, and 9/25 in the placebo group). Adverse events decreased over time in all groups. Two participants in the peanut groups had serious adverse events during the 3-year study. In the peanut-0 group, in which eight (13%) of 60 participants passed DBPCFCs at week 156, higher baseline peanut-specific IgG4 to IgE ratio and lower Ara h 2 IgE and basophil activation responses were associated with sustained unresponsiveness. No treatment-related deaths occurred.Our study suggests that peanut OIT could desensitise individuals with peanut allergy to 4000 mg peanut protein but discontinuation, or even reduction to 300 mg daily, could increase the likelihood of regaining clinical reactivity to peanut. Since baseline blood tests correlated with week 117 treatment outcomes, this study might aid in optimal patient selection for this therapy.National Institute of Allergy and Infectious Diseases.

    View details for DOI 10.1016/S0140-6736(19)31793-3

    View details for PubMedID 31522849

  • Preoperative metabolic classification of thyroid nodules using mass spectrometry imaging of fine-needle aspiration biopsies. Proceedings of the National Academy of Sciences of the United States of America DeHoog, R. J., Zhang, J. n., Alore, E. n., Lin, J. Q., Yu, W. n., Woody, S. n., Almendariz, C. n., Lin, M. n., Engelsman, A. F., Sidhu, S. B., Tibshirani, R. n., Suliburk, J. n., Eberlin, L. S. 2019

    Abstract

    Thyroid neoplasia is common and requires appropriate clinical workup with imaging and fine-needle aspiration (FNA) biopsy to evaluate for cancer. Yet, up to 20% of thyroid nodule FNA biopsies will be indeterminate in diagnosis based on cytological evaluation. Genomic approaches to characterize the malignant potential of nodules showed initial promise but have provided only modest improvement in diagnosis. Here, we describe a method using metabolic analysis by desorption electrospray ionization mass spectrometry (DESI-MS) imaging for direct analysis and diagnosis of follicular cell-derived neoplasia tissues and FNA biopsies. DESI-MS was used to analyze 178 tissue samples to determine the molecular signatures of normal, benign follicular adenoma (FTA), and malignant follicular carcinoma (FTC) and papillary carcinoma (PTC) thyroid tissues. Statistical classifiers, including benign thyroid versus PTC and benign thyroid versus FTC, were built and validated with 114,125 mass spectra, with accuracy assessed in correlation with clinical pathology. Clinical FNA smears were prospectively collected and analyzed using DESI-MS imaging, and the performance of the statistical classifiers was tested with 69 prospectively collected clinical FNA smears. High performance was achieved for both models when predicting on the FNA test set, which included 24 nodules with indeterminate preoperative cytology, with accuracies of 93% and 89%. Our results strongly suggest that DESI-MS imaging is a valuable technology for identification of malignant potential of thyroid nodules.

    View details for DOI 10.1073/pnas.1911333116

    View details for PubMedID 31591199

  • Genomic analysis of benign prostatic hyperplasia implicates cellular re-landscaping in disease pathogenesis. JCI insight Middleton, L. W., Shen, Z. n., Varma, S. n., Pollack, A. S., Gong, X. n., Zhu, S. n., Zhu, C. n., Foley, J. W., Vennam, S. n., Sweeney, R. T., Tu, K. n., Biscocho, J. n., Eminaga, O. n., Nolley, R. n., Tibshirani, R. n., Brooks, J. D., West, R. B., Pollack, J. R. 2019; 5

    Abstract

    Benign prostatic hyperplasia (BPH) is the most common cause of lower urinary tract symptoms in men. Current treatments target prostate physiology rather than BPH pathophysiology and are only partially effective. Here, we applied next-generation sequencing to gain new insight into BPH. By RNAseq, we uncovered transcriptional heterogeneity among BPH cases, where a 65-gene BPH stromal signature correlated with symptom severity. Stromal signaling molecules BMP5 and CXCL13 were enriched in BPH while estrogen regulated pathways were depleted. Notably, BMP5 addition to cultured prostatic myofibroblasts altered their expression profile towards a BPH profile that included the BPH stromal signature. RNAseq also suggested an altered cellular milieu in BPH, which we verified by immunohistochemistry and single-cell RNAseq. In particular, BPH tissues exhibited enrichment of myofibroblast subsets, whilst depletion of neuroendocrine cells and an estrogen receptor (ESR1)-positive fibroblast cell type residing near epithelium. By whole-exome sequencing, we uncovered somatic single-nucleotide variants (SNVs) in BPH, of uncertain pathogenic significance but indicative of clonal cell expansions. Thus, genomic characterization of BPH has identified a clinically-relevant stromal signature and new candidate disease pathways (including a likely role for BMP5 signaling), and reveals BPH to be not merely a hyperplasia, but rather a fundamental re-landscaping of cell types.

    View details for DOI 10.1172/jci.insight.129749

    View details for PubMedID 31094703

  • Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy BIOINFORMATICS Ghaemi, M., DiGiulio, D. B., Contrepois, K., Callahan, B., Ngo, T. M., Lee-McMullen, B., Lehallier, B., Robaczewska, A., Mcilwain, D., Rosenberg-Hasson, Y., Wong, R. J., Quaintance, C., Culos, A., Stanley, N., Tanada, A., Tsai, A., Gaudilliere, D., Ganio, E., Han, X., Ando, K., McNeil, L., Tingle, M., Wise, P., Maric, I., Sirota, M., Wyss-Coray, T., Winn, V. D., Druzin, M. L., Gibbs, R., Darmstadt, G. L., Lewis, D. B., Nia, V., Agard, B., Tibshirani, R., Nolan, G., Snyder, M. P., Relman, D. A., Quake, S. R., Shaw, G. M., Stevenson, D. K., Angst, M. S., Gaudilliere, B., Aghaeepour, N. 2019; 35 (1): 95–103
  • Multiomics modeling of the immunome, transcriptome, microbiome, proteome and metabolome adaptations during human pregnancy. Bioinformatics (Oxford, England) Ghaemi, M. S., DiGiulio, D. B., Contrepois, K., Callahan, B., Ngo, T. T., Lee-McMullen, B., Lehallier, B., Robaczewska, A., Mcilwain, D., Rosenberg-Hasson, Y., Wong, R. J., Quaintance, C., Culos, A., Stanley, N., Tanada, A., Tsai, A., Gaudilliere, D., Ganio, E., Han, X., Ando, K., McNeil, L., Tingle, M., Wise, P., Maric, I., Sirota, M., Wyss-Coray, T., Winn, V. D., Druzin, M. L., Gibbs, R., Darmstadt, G. L., Lewis, D. B., Partovi Nia, V., Agard, B., Tibshirani, R., Nolan, G., Snyder, M. P., Relman, D. A., Quake, S. R., Shaw, G. M., Stevenson, D. K., Angst, M. S., Gaudilliere, B., Aghaeepour, N. 2019; 35 (1): 95–103

    Abstract

    Motivation: Multiple biological clocks govern a healthy pregnancy. These biological mechanisms produce immunologic, metabolomic, proteomic, genomic and microbiomic adaptations during the course of pregnancy. Modeling the chronology of these adaptations during full-term pregnancy provides the frameworks for future studies examining deviations implicated in pregnancy-related pathologies including preterm birth and preeclampsia.Results: We performed a multiomics analysis of 51 samples from 17 pregnant women, delivering at term. The datasets included measurements from the immunome, transcriptome, microbiome, proteome and metabolome of samples obtained simultaneously from the same patients. Multivariate predictive modeling using the Elastic Net (EN) algorithm was used to measure the ability of each dataset to predict gestational age. Using stacked generalization, these datasets were combined into a single model. This model not only significantly increased predictive power by combining all datasets, but also revealed novel interactions between different biological modalities. Future work includes expansion of the cohort to preterm-enriched populations and in vivo analysis of immune-modulating interventions based on the mechanisms identified.Availability and implementation: Datasets and scripts for reproduction of results are available through: https://nalab.stanford.edu/multiomics-pregnancy/.Supplementary information: Supplementary data are available at Bioinformatics online.

    View details for PubMedID 30561547

  • Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball STATISTICAL MODELLING Powers, S., Hastie, T., Tibshirani, R. 2018; 18 (5-6): 388–410
  • Found In Translation: a machine learning model for mouse-to-human inference. Nature methods Normand, R., Du, W., Briller, M., Gaujoux, R., Starosvetsky, E., Ziv-Kenet, A., Shalev-Malul, G., Tibshirani, R. J., Shen-Orr, S. S. 2018

    Abstract

    Cross-species differences form barriers to translational research that ultimately hinder the success of clinical trials, yet knowledge of species differences has yet to be systematically incorporated in the interpretation of animal models. Here we present Found In Translation (FIT; http://www.mouse2man.org ), a statistical methodology that leverages public gene expression data to extrapolate the results of a new mouse experiment to expression changes in the equivalent human condition. We applied FIT to data from mouse models of 28 different human diseases and identified experimental conditions in which FIT predictions outperformed direct cross-species extrapolation from mouse results, increasing the overlap of differentially expressed genes by 20-50%. FIT predicted novel disease-associated genes, an example of which we validated experimentally. FIT highlights signals that may otherwise be missed and reduces false leads, with no experimental cost.

    View details for PubMedID 30478323

  • Analyzing Excess Risk from Matched Designs with Double Controls: Author response. Journal of clinical epidemiology Redelmeier, D., Tibshirani, R. J. 2018

    View details for PubMedID 30453039

  • Log-ratio Lasso: Scalable, Sparse Estimation for Log-ratio Models. Biometrics Bates, S., Tibshirani, R. 2018

    Abstract

    Positive-valued signal data is common in the biological and medical sciences, due to the prevalence of mass spectrometry other imaging techniques. With such data, only the relative intensities of the raw measurements are meaningful. It is desirable to consider models consisting of the log-ratios of all pairs of the raw features, since log-ratios are the simplest meaningful derived features. In this case, however, the dimensionality of the predictor space becomes large, and computationally efficient estimation procedures are required. In this work, we introduce an embedding of the log-ratio parameter space into a space of much lower dimension and use this representation to develop an efficient penalized fitting procedure. This procedure serves as the foundation for a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step to yield highly sparse solutions. On a cancer proteomics data set, the proposed method fits a highly sparse model consisting of features of known biological relevance while greatly improving upon the predictive accuracy of less interpretable methods. This article is protected by copyright. All rights reserved.

    View details for PubMedID 30387139

  • Multicenter Study Using Desorption-Electrospray-Ionization-Mass-Spectrometry Imaging for Breast-Cancer Diagnosis ANALYTICAL CHEMISTRY Porcari, A. M., Zhang, J., Garza, K. Y., Rodrigues-Peres, R. M., Lin, J. Q., Young, J. H., Tibshirani, R., Nagi, C., Paiva, G. R., Carter, S. A., Sarian, L., Eberlin, M. N., Eberlin, L. S. 2018; 90 (19): 11324–32

    Abstract

    The histological and molecular subtypes of breast cancer demand distinct therapeutic approaches. Invasive ductal carcinoma (IDC) is subtyped according to estrogen-receptor (ER), progesterone-receptor (PR), and HER2 status, among other markers. Desorption-electrospray-ionization-mass-spectrometry imaging (DESI-MSI) is an ambient-ionization MS technique that has been previously used to diagnose IDC. Aiming to investigate the robustness of ambient-ionization MS for IDC diagnosis and subtyping over diverse patient populations and interlaboratory use, we report a multicenter study using DESI-MSI to analyze samples from 103 patients independently analyzed in the United States and Brazil. The lipid profiles of IDC and normal breast tissues were consistent across different patient races and were unrelated to country of sample collection. Similar experimental parameters used in both laboratories yielded consistent mass-spectral data in mass-to-charge ratios ( m/ z) above 700, where complex lipids are observed. Statistical classifiers built using data acquired in the United States yielded 97.6% sensitivity, 96.7% specificity, and 97.6% accuracy for cancer diagnosis. Equivalent performance was observed for the intralaboratory validation set (99.2% accuracy) and, most remarkably, for the interlaboratory validation set independently acquired in Brazil (95.3% accuracy). Separate classification models built for ER and PR statuses as well as the status of their combined hormone receptor (HR) provided predictive accuracies (>89.0%), although low classification accuracies were achieved for HER2 status. Altogether, our multicenter study demonstrates that DESI-MSI is a robust and reproducible technology for rapid breast-cancer-tissue diagnosis and therefore is of value for clinical use.

    View details for PubMedID 30170496

  • Circulating Tumor DNA Measurements As Early Outcome Predictors in Diffuse Large B-Cell Lymphoma. Journal of clinical oncology : official journal of the American Society of Clinical Oncology Kurtz, D. M., Scherer, F., Jin, M. C., Soo, J., Craig, A. F., Esfahani, M. S., Chabon, J. J., Stehr, H., Liu, C. L., Tibshirani, R., Maeda, L. S., Gupta, N. K., Khodadoust, M. S., Advani, R. H., Levy, R., Newman, A. M., Duhrsen, U., Huttmann, A., Meignan, M., Casasnovas, R., Westin, J. R., Roschewski, M., Wilson, W. H., Gaidano, G., Rossi, D., Diehn, M., Alizadeh, A. A. 2018: JCO2018785246

    Abstract

    Purpose Outcomes for patients with diffuse large B-cell lymphoma remain heterogeneous, with existing methods failing to consistently predict treatment failure. We examined the additional prognostic value of circulating tumor DNA (ctDNA) before and during therapy for predicting patient outcomes. Patients and Methods We studied the dynamics of ctDNA from 217 patients treated at six centers, using a training and validation framework. We densely characterized early ctDNA dynamics during therapy using cancer personalized profiling by deep sequencing to define response-associated thresholds within a discovery set. These thresholds were assessed in two independent validation sets. Finally, we assessed the prognostic value of ctDNA in the context of established risk factors, including the International Prognostic Index and interim positron emission tomography/computed tomography scans. Results Before therapy, ctDNA was detectable in 98% of patients; pretreatment levels were prognostic in both front-line and salvage settings. In the discovery set, ctDNA levels changed rapidly, with a 2-log decrease after one cycle (early molecular response [EMR]) and a 2.5-log decrease after two cycles (major molecular response [MMR]) stratifying outcomes. In the first validation set, patients receiving front-line therapy achieving EMR or MMR had superior outcomes at 24 months (EMR: EFS, 83% v 50%; P = .0015; MMR: EFS, 82% v 46%; P < .001). EMR also predicted superior 24-month outcomes in patients receiving salvage therapy in the first validation set (EFS, 100% v 13%; P = .011). The prognostic value of EMR and MMR was further confirmed in the second validation set. In multivariable analyses including International Prognostic Index and interim positron emission tomography/computed tomography scans across both cohorts, molecular response was independently prognostic of outcomes, including event-free and overall survival. Conclusion Pretreatment ctDNA levels and molecular responses are independently prognostic of outcomes in aggressive lymphomas. These risk factors could potentially guide future personalized risk-directed approaches.

    View details for PubMedID 30125215

  • Development of plasma cell-free DNA (cfDNA) assays for early cancer detection: first insights from the Circulating Cell-Free Genome Atlas Study (CCGA) Aravanis, A. A., Oxnard, G. R., Maddala, T., Hubbell, E., Venn, O., Jamshidi, A., Shen, L., Amini, H., Beausang, J. A., Betts, C., Civello, D., Davydov, K., Fazullina, S., Filippova, D., Gnerre, S., Gross, S., Hou, C., Jiang, R., Jung, B., Kurtzman, K., Melton, C., Nautiyal, S., Newman, J., Newman, J., Nicolaou, C., Rava, R., Sakarya, O., Satya, R., Shojaee, S., Steffen, K., Valouev, A., Xu, H., Yue, J., Zhang, N., Baselga, J., Lapham, R., Davis, D. G., Smith, D., Richards, D., Seiden, M. V., Swanton, C., Yeatman, T. J., Tibshirani, R., Curtis, C., Plevritis, S. K., Williams, R., Klein, E., Hartman, A., Liu, M. C. AMER ASSOC CANCER RESEARCH. 2018
  • Supervised learning via the "hubNet" procedure. Statistica Sinica Guan, L., Fan, Z., Tibshirani, R. 2018; 28 (3): 1225-1243

    Abstract

    We propose a new method for supervised learning. The hubNet procedure fits a hub-based graphical model to the predictors, to estimate the amount of "connection" that each predictor has with other predictors. This yields a set of predictor weights that are then used in a regularized regression such as the lasso or elastic net. The resulting procedure is easy to implement, can often yield higher or competitive prediction accuracy with fewer features than the lasso, and can give insight into the underlying structure of the predictors. HubNet can be generalized seamlessly to supervised problems such as regularized logistic regression (and other GLMs), Cox's proportional hazards model, and nonlinear procedures such as random forests and boosting. We prove recovery results under a specialized model and illustrate the method on real and simulated data.

    View details for DOI 10.5705/ss.202016.0482

    View details for PubMedID 35677806

    View details for PubMedCentralID PMC9173714

  • SUPERVISED LEARNING VIA THE "HUBNET" PROCEDURE STATISTICA SINICA Guan, L., Fan, Z., Tibshirani, R. 2018; 28 (3): 1225–43
  • Pharmacogenetics and progression to neovascular age-relatedmacular degeneration-Evidence supporting practice change REPLY PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Vavvas, D. G., Small, K. W., Awh, C., Zanke, B. W., Tibshirani, R. J., Kustra, R. 2018; 115 (25): E5640–E5641

    View details for PubMedID 29880713

  • Noninvasive blood tests for fetal development predict gestational age and preterm delivery SCIENCE Ngo, T. M., Moufarrej, M. N., Rasmussen, M. H., Camunas-Soler, J., Pan, W., Okamoto, J., Neff, N. F., Liu, K., Wong, R. J., Downes, K., Tibshirani, R., Shaw, G. M., Skotte, L., Stevenson, D. K., Biggio, J. R., Elovitz, M. A., Melbye, M., Quake, S. R. 2018; 360 (6393): 1133–36

    Abstract

    Noninvasive blood tests that provide information about fetal development and gestational age could potentially improve prenatal care. Ultrasound, the current gold standard, is not always affordable in low-resource settings and does not predict spontaneous preterm birth, a leading cause of infant death. In a pilot study of 31 healthy pregnant women, we found that measurement of nine cell-free RNA (cfRNA) transcripts in maternal blood predicted gestational age with comparable accuracy to ultrasound but at substantially lower cost. In a related study of 38 women (23 full-term and 15 preterm deliveries), all at elevated risk of delivering preterm, we identified seven cfRNA transcripts that accurately classified women who delivered preterm up to 2 months in advance of labor. These tests hold promise for prenatal care in both the developed and developing worlds, although they require validation in larger, blinded clinical trials.

    View details for PubMedID 29880692

  • Methods for analyzing matched designs with double controls: excess risk is easily estimated and misinterpreted when evaluating traffic deaths JOURNAL OF CLINICAL EPIDEMIOLOGY Redelmeier, D. A., Tibshirani, R. J. 2018; 98: 117–22

    Abstract

    To demonstrate analytic approaches for matched studies where two controls are linked to each case and events are accumulating counts rather than binary outcomes. A secondary intent is to clarify the distinction between total risk and excess risk (unmatched vs. matched perspectives).We review past research testing whether elections can lead to increased traffic risks. The results are reinterpreted by analyzing both the total count of individuals in fatal crashes and the excess count of individuals in fatal crashes, each time accounting for the matched double controls.Overall, 1,546 individuals were in fatal crashes on the 10 election days (average = 155/d), and 2,593 individuals were in fatal crashes on the 20 control days (average = 130/d). Poisson regression of total counts yielded a relative risk of 1.19 (95% confidence interval: 1.12-1.27). Poisson regression of excess counts yielded a relative risk of 3.22 (95% confidence interval: 2.72-3.80). The discrepancy between analyses of total counts and excess counts replicated with alternative statistical models and was visualized in graphical displays.Available approaches provide methods for analyzing count data in matched designs with double controls and help clarify the distinction between increases in total risk and increases in excess risk.

    View details for PubMedID 29452220

  • Some methods for heterogeneous treatment effect estimation in high dimensions STATISTICS IN MEDICINE Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., Tibshirani, R. 2018; 37 (11): 1767–87

    Abstract

    When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.

    View details for PubMedID 29508417

    View details for PubMedCentralID PMC5938172

  • Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nature medicine Good, Z., Sarno, J., Jager, A., Samusik, N., Aghaeepour, N., Simonds, E. F., White, L., Lacayo, N. J., Fantl, W. J., Fazio, G., Gaipa, G., Biondi, A., Tibshirani, R., Bendall, S. C., Nolan, G. P., Davis, K. L. 2018; 24 (4): 474–83

    Abstract

    Insight into the cancer cell populations that are responsible for relapsed disease is needed to improve outcomes. Here we report a single-cell-based study of B cell precursor acute lymphoblastic leukemia at diagnosis that reveals hidden developmentally dependent cell signaling states that are uniquely associated with relapse. By using mass cytometry we simultaneously quantified 35 proteins involved in B cell development in 60 primary diagnostic samples. Each leukemia cell was then matched to its nearest healthy B cell population by a developmental classifier that operated at the single-cell level. Machine learning identified six features of expanded leukemic populations that were sufficient to predict patient relapse at diagnosis. These features implicated the pro-BII subpopulation of B cells with activated mTOR signaling, and the pre-BI subpopulation of B cells with activated and unresponsive pre-B cell receptor signaling, to be associated with relapse. This model, termed 'developmentally dependent predictor of relapse' (DDPR), significantly improves currently established risk stratification methods. DDPR features exist at diagnosis and persist at relapse. By leveraging a data-driven approach, we demonstrate the predictive value of single-cell 'omics' for patient stratification in a translational setting and provide a framework for its application to human cancer.

    View details for PubMedID 29505032

  • Post-selection inference for 1-penalized likelihood models CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Taylor, J., Tibshirani, R. 2018; 46 (1): 41–61

    View details for DOI 10.1002/cjs.11313

    View details for Web of Science ID 000425130100004

  • Post-Selection Inference for ℓ1-Penalized Likelihood Models. The Canadian journal of statistics = Revue canadienne de statistique Taylor, J., Tibshirani, R. 2018; 46 (1): 41-61

    Abstract

    We present a new method for post-selection inference for ℓ1 (lasso)-penalized likelihood models, including generalized regression models. Our approach generalizes the post-selection framework presented in Lee et al. (2013). The method provides p-values and confidence intervals that are asymptotically valid, conditional on the inherent selection done by the lasso. We present applications of this work to (regularized) logistic regression, Cox's proportional hazards model and the graphical lasso. We do not provide rigorous proofs here of the claimed results, but rather conceptual and theoretical sketches.

    View details for DOI 10.1002/cjs.11313

    View details for PubMedID 30127543

    View details for PubMedCentralID PMC6097808

  • Genomic Feature Selection by Coverage Design Optimization. Journal of applied statistics Reid, S., Newman, A. M., Diehn, M., Alizadeh, A. A., Tibshirani, R. 2018; 45 (14): 2658-2676

    Abstract

    We introduce a novel data reduction technique whereby we select a subset of tiles to "cover" maximally events of interest in large-scale biological datasets (e.g., genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations (> 1M).

    View details for DOI 10.1080/02664763.2018.1432577

    View details for PubMedID 30294060

    View details for PubMedCentralID PMC6173524

  • CFH and ARMS2 genetic risk determines progression to neovascular age-related macular degeneration after antioxidant and zinc supplementation PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Vavvas, D. G., Small, K. W., Awh, C. C., Zanke, B. W., Tibshirani, R. J., Kustra, R. 2018; 115 (4): E696–E704

    Abstract

    We evaluated the influence of an antioxidant and zinc nutritional supplement [the Age-Related Eye Disease Study (AREDS) formulation] on delaying or preventing progression to neovascular AMD (NV) in persons with age-related macular degeneration (AMD). AREDS subjects (n = 802) with category 3 or 4 AMD at baseline who had been treated with placebo or the AREDS formulation were evaluated for differences in the risk of progression to NV as a function of complement factor H (CFH) and age-related maculopathy susceptibility 2 (ARMS2) genotype groups. We used published genetic grouping: a two-SNP haplotype risk-calling algorithm to assess CFH, and either the single SNP rs10490924 or 372_815del443ins54 to mark ARMS2 risk. Progression risk was determined using the Cox proportional hazard model. Genetics-treatment interaction on NV risk was assessed using a multiiterative bootstrap validation analysis. We identified strong interaction of genetics with AREDS formulation treatment on the development of NV. Individuals with high CFH and no ARMS2 risk alleles and taking the AREDS formulation had increased progression to NV compared with placebo. Those with low CFH risk and high ARMS2 risk had decreased progression risk. Analysis of CFH and ARMS2 genotype groups from a validation dataset reinforces this conclusion. Bootstrapping analysis confirms the presence of a genetics-treatment interaction and suggests that individual treatment response to the AREDS formulation is largely determined by genetics. The AREDS formulation modifies the risk of progression to NV based on individual genetics. Its use should be based on patient-specific genotype.

    View details for PubMedID 29311295

  • A General Framework for Estimation and Inference From Clusters of Features JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Reid, S., Taylor, J., Tibshirani, R. 2018; 113 (521): 280–93
  • Genomic feature selection by coverage design optimization Journal of Applied Statistics Reid, S., Newman, A. M., Diehn, M., Alizadeh, A. A., Tibshirani, R. 2018
  • Distinguishing malignant from benign microscopic skin lesions using desorption electrospray ionization mass spectrometry imaging. Proceedings of the National Academy of Sciences of the United States of America Margulis, K. n., Chiou, A. S., Aasi, S. Z., Tibshirani, R. J., Tang, J. Y., Zare, R. N. 2018

    Abstract

    Detection of microscopic skin lesions presents a considerable challenge in diagnosing early-stage malignancies as well as in residual tumor interrogation after surgical intervention. In this study, we established the capability of desorption electrospray ionization mass spectrometry imaging (DESI-MSI) to distinguish between micrometer-sized tumor aggregates of basal cell carcinoma (BCC), a common skin cancer, and normal human skin. We analyzed 86 human specimens collected during Mohs micrographic surgery for BCC to cross-examine spatial distributions of numerous lipids and metabolites in BCC aggregates versus adjacent skin. Statistical analysis using the least absolute shrinkage and selection operation (Lasso) was employed to categorize each 200-µm-diameter picture element (pixel) of investigated skin tissue map as BCC or normal. Lasso identified 24 molecular ion signals, which are significant for pixel classification. These ion signals included lipids observed at m/z 200-1,200 and Krebs cycle metabolites observed at m/z < 200. Based on these features, Lasso yielded an overall 94.1% diagnostic accuracy pixel by pixel of the skin map compared with histopathological evaluation. We suggest that DESI-MSI/Lasso analysis can be employed as a complementary technique for delineation of microscopic skin tumors.

    View details for PubMedID 29866838

  • Food allergy and omics. The Journal of allergy and clinical immunology Dhondalay, G. K., Rael, E. n., Acharya, S. n., Zhang, W. n., Sampath, V. n., Galli, S. J., Tibshirani, R. n., Boyd, S. D., Maecker, H. n., Nadeau, K. C., Andorf, S. n. 2018; 141 (1): 20–29

    Abstract

    Food allergy (FA) prevalence has been increasing over the last few decades and is now a global health concern. Current diagnostic methods for FA result in a high number of false-positive results, and the standard of care is either allergen avoidance or use of epinephrine on accidental exposure, although currently with no other approved treatments. The increasing prevalence of FA, lack of robust biomarkers, and inadequate treatments warrants further research into the mechanism underlying food allergies. Recent technological advances have made it possible to move beyond traditional biological techniques to more sophisticated high-throughput approaches. These technologies have created the burgeoning field of omics sciences, which permit a more systematic investigation of biological problems. Omics sciences, such as genomics, epigenomics, transcriptomics, proteomics, metabolomics, microbiomics, and exposomics, have enabled the construction of regulatory networks and biological pathway models. Parallel advances in bioinformatics and computational techniques have enabled the integration, analysis, and interpretation of these exponentially growing data sets and opens the possibility of personalized or precision medicine for FA.

    View details for PubMedID 29307411

  • DRUG-NEM: Optimizing drug combinations using single-cell perturbation response to account for intratumoral heterogeneity. Proceedings of the National Academy of Sciences of the United States of America Anchang, B. n., Davis, K. L., Fienberg, H. G., Williamson, B. D., Bendall, S. C., Karacosta, L. G., Tibshirani, R. n., Nolan, G. P., Plevritis, S. K. 2018; 115 (18): E4294–E4303

    Abstract

    An individual malignant tumor is composed of a heterogeneous collection of single cells with distinct molecular and phenotypic features, a phenomenon termed intratumoral heterogeneity. Intratumoral heterogeneity poses challenges for cancer treatment, motivating the need for combination therapies. Single-cell technologies are now available to guide effective drug combinations by accounting for intratumoral heterogeneity through the analysis of the signaling perturbations of an individual tumor sample screened by a drug panel. In particular, Mass Cytometry Time-of-Flight (CyTOF) is a high-throughput single-cell technology that enables the simultaneous measurements of multiple ([Formula: see text]40) intracellular and surface markers at the level of single cells for hundreds of thousands of cells in a sample. We developed a computational framework, entitled Drug Nested Effects Models (DRUG-NEM), to analyze CyTOF single-drug perturbation data for the purpose of individualizing drug combinations. DRUG-NEM optimizes drug combinations by choosing the minimum number of drugs that produce the maximal desired intracellular effects based on nested effects modeling. We demonstrate the performance of DRUG-NEM using single-cell drug perturbation data from tumor cell lines and primary leukemia samples.

    View details for PubMedID 29654148

  • SELECTING THE NUMBER OF PRINCIPAL COMPONENTS: ESTIMATION OF THE TRUE RANK OF A NOISY MATRIX ANNALS OF STATISTICS Choi, Y., Taylor, J., Tibshirani, R. 2017; 45 (6): 2590–2617

    View details for DOI 10.1214/16-AOS1536

    View details for Web of Science ID 000418371600011

  • KLHL6 Is Preferentially Expressed in Germinal Center-Derived B-Cell Lymphomas AMERICAN JOURNAL OF CLINICAL PATHOLOGY Kunder, C. A., Roncador, G., Advani, R. H., Gualco, G., Bacchi, C. E., Sabile, J. M., Lossos, I. S., Nie, K., Tibshirani, R., Green, M. R., Alizadeh, A. A., Natkunam, Y. 2017; 148 (6): 465–76

    Abstract

    KLHL6 is a recently described BTB-Kelch protein with selective expression in lymphoid tissues and is most strongly expressed in germinal center B cells.Using gene expression profiling as well as immunohistochemistry with an anti-KLHL6 monoclonal antibody, we have characterized the expression of this molecule in normal and neoplastic tissues. Protein expression was evaluated in 1,058 hematopoietic neoplasms.Consistent with its discovery as a germinal center marker, KLHL6 was positive mainly in B-cell neoplasms of germinal center derivation, including 95% of follicular lymphomas (106/112). B-cell lymphomas of non-germinal center derivation were generally negative (0/33 chronic lymphocytic leukemias/small lymphocytic lymphomas, 3/49 marginal zone lymphomas, and 2/66 mantle cell lymphomas).In addition to other germinal center markers, including BCL6, CD10, HGAL, and LMO2, KLHL6 immunohistochemistry may prove a useful adjunct in the diagnosis and future classification of B-cell lymphomas.

    View details for PubMedID 29140403

  • Big data modeling to predict platelet usage and minimize wastage in a tertiary care system PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Guan, L., Tian, X., Gombar, S., Zemek, A. J., Krishnan, G., Scott, R., Narasimhan, B., Tibshirani, R. J., Pham, T. D. 2017; 114 (43): 11368–73

    Abstract

    Maintaining a robust blood product supply is an essential requirement to guarantee optimal patient care in modern health care systems. However, daily blood product use is difficult to anticipate. Platelet products are the most variable in daily usage, have short shelf lives, and are also the most expensive to produce, test, and store. Due to the combination of absolute need, uncertain daily demand, and short shelf life, platelet products are frequently wasted due to expiration. Our aim is to build and validate a statistical model to forecast future platelet demand and thereby reduce wastage. We have investigated platelet usage patterns at our institution, and specifically interrogated the relationship between platelet usage and aggregated hospital-wide patient data over a recent consecutive 29-mo period. Using a convex statistical formulation, we have found that platelet usage is highly dependent on weekday/weekend pattern, number of patients with various abnormal complete blood count measurements, and location-specific hospital census data. We incorporated these relationships in a mathematical model to guide collection and ordering strategy. This model minimizes waste due to expiration while avoiding shortages; the number of remaining platelet units at the end of any day stays above 10 in our model during the same period. Compared with historical expiration rates during the same period, our model reduces the expiration rate from 10.5 to 3.2%. Extrapolating our results to the ∼2 million units of platelets transfused annually within the United States, if implemented successfully, our model can potentially save ∼80 million dollars in health care costs.

    View details for PubMedID 29073058

  • Post-selection point and interval estimation of signal sizes in Gaussian samples CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Reid, S., Taylor, J., Tibshirani, R. 2017; 45 (2): 128-148

    View details for DOI 10.1002/cjs.11320

    View details for Web of Science ID 000400027400001

  • Metabolic Markers and Statistical Prediction of Serous Ovarian Cancer Aggressiveness by Ambient Ionization Mass Spectrometry Imaging. Cancer research Sans, M., Gharpure, K., Tibshirani, R., Zhang, J., Liang, L., Liu, J., Young, J. H., Dood, R. L., Sood, A. K., Eberlin, L. S. 2017; 77 (11): 2903-2913

    Abstract

    Ovarian high-grade serous carcinoma (HGSC) results in the highest mortality among gynecological cancers, developing rapidly and aggressively. Dissimilarly, serous borderline ovarian tumors (BOT) can progress into low-grade serous carcinomas and have relatively indolent clinical behavior. The underlying biological differences between HGSC and BOT call for accurate diagnostic methodologies and tailored treatment options, and identification of molecular markers of aggressiveness could provide valuable biochemical insights and improve disease management. Here, we used desorption electrospray ionization (DESI) mass spectrometry (MS) to image and chemically characterize the metabolic profiles of HGSC, BOT, and normal ovarian tissue samples. DESI-MS imaging enabled clear visualization of fine papillary branches in serous BOT and allowed for characterization of spatial features of tumor heterogeneity such as adjacent necrosis and stroma in HGSC. Predictive markers of cancer aggressiveness were identified, including various free fatty acids, metabolites, and complex lipids such as ceramides, glycerophosphoglycerols, cardiolipins, and glycerophosphocholines. Classification models built from a total of 89,826 individual pixels, acquired in positive and negative ion modes from 78 different tissue samples, enabled diagnosis and prediction of HGSC and all tumor samples in comparison with normal tissues, with overall agreements of 96.4% and 96.2%, respectively. HGSC and BOT discrimination was achieved with an overall accuracy of 93.0%. Interestingly, our classification model allowed identification of three BOT samples presenting unusual histologic features that could be associated with the development of low-grade carcinomas. Our results suggest DESI-MS as a powerful approach for rapid serous ovarian cancer diagnosis based on altered metabolic signatures. Cancer Res; 77(11); 2903-13. ©2017 AACR.

    View details for DOI 10.1158/0008-5472.CAN-16-3044

    View details for PubMedID 28416487

  • Chemical Space Mimicry for Drug Discovery JOURNAL OF CHEMICAL INFORMATION AND MODELING Yuan, W., Jiang, D., Nambiar, D. K., Liew, L. P., Hay, M. P., Bloomstein, J., Lu, P., Turner, B., Le, Q., Tibshirani, R., Khatri, P., Moloney, M. G., Koong, A. C. 2017; 57 (4): 875-882

    Abstract

    We describe a new library generation method, Machine-based Identification of Molecules Inside Characterized Space (MIMICS), that generates sets of molecules inspired by a text-based input. MIMICS-generated libraries were found to preserve distributions of properties while simultaneously increasing structural diversity. Newly identified MIMICS-generated compounds were found to be bioactive as inhibitors of specific components of the unfolded protein response (UPR) and the VEGFR2 pathway in cell-based assays, thus confirming the applicability of this methodology toward drug design applications. Wider application of MIMICS could facilitate the efficient utilization of chemical space.

    View details for DOI 10.1021/acs.jcim.6b00754

    View details for Web of Science ID 000400204900023

    View details for PubMedID 28257191

  • Diagnosis of prostate cancer by desorption electrospray ionization mass spectrometric imaging of small metabolites and lipids PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Banerjee, S., Zare, R. N., Tibshirani, R. J., Kunder, C. A., Nolley, R., Fan, R., Brooks, J. D., Sonn, G. A. 2017; 114 (13): 3334-3339

    Abstract

    Accurate identification of prostate cancer in frozen sections at the time of surgery can be challenging, limiting the surgeon's ability to best determine resection margins during prostatectomy. We performed desorption electrospray ionization mass spectrometry imaging (DESI-MSI) on 54 banked human cancerous and normal prostate tissue specimens to investigate the spatial distribution of a wide variety of small metabolites, carbohydrates, and lipids. In contrast to several previous studies, our method included Krebs cycle intermediates (m/z <200), which we found to be highly informative in distinguishing cancer from benign tissue. Malignant prostate cells showed marked metabolic derangements compared with their benign counterparts. Using the "Least absolute shrinkage and selection operator" (Lasso), we analyzed all metabolites from the DESI-MS data and identified parsimonious sets of metabolic profiles for distinguishing between cancer and normal tissue. In an independent set of samples, we could use these models to classify prostate cancer from benign specimens with nearly 90% accuracy per patient. Based on previous work in prostate cancer showing that glucose levels are high while citrate is low, we found that measurement of the glucose/citrate ion signal ratio accurately predicted cancer when this ratio exceeds 1.0 and normal prostate when the ratio is less than 0.5. After brief tissue preparation, the glucose/citrate ratio can be recorded on a tissue sample in 1 min or less, which is in sharp contrast to the 20 min or more required by histopathological examination of frozen tissue specimens.

    View details for DOI 10.1073/pnas.1700677114

    View details for Web of Science ID 000397607300049

    View details for PubMedID 28292895

    View details for PubMedCentralID PMC5380053

  • Landscape of monoallelic DNA accessibility in mouse embryonic stem cells and neural progenitor cells. Nature genetics Xu, J., Carter, A. C., Gendrel, A., Attia, M., Loftus, J., Greenleaf, W. J., Tibshirani, R., Heard, E., Chang, H. Y. 2017; 49 (3): 377-386

    Abstract

    We developed an allele-specific assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) to genotype and profile active regulatory DNA across the genome. Using a mouse hybrid F1 system, we found that monoallelic DNA accessibility across autosomes was pervasive, developmentally programmed and composed of several patterns. Genetically determined accessibility was enriched at distal enhancers, but random monoallelically accessible (RAMA) elements were enriched at promoters and may act as gatekeepers of monoallelic mRNA expression. Allelic choice at RAMA elements was stable across cell generations and bookmarked through mitosis. RAMA elements in neural progenitor cells were biallelically accessible in embryonic stem cells but premarked with bivalent histone modifications; one allele was silenced during differentiation. Quantitative analysis indicated that allelic choice at the majority of RAMA elements is consistent with a stochastic process; however, up to 30% of RAMA elements may deviate from the expected pattern, suggesting a regulated or counting mechanism.

    View details for DOI 10.1038/ng.3769

    View details for PubMedID 28112738

  • Long-term course of patients with primary ocular adnexal MALT lymphoma: a large single-institution cohort study BLOOD Desai, A., Joag, M. G., Lekakis, L., Chapman, J. R., Vega, F., Tibshirani, R., Tse, D., Markoe, A., Lossos, I. S. 2017; 129 (3): 324-332

    Abstract

    While Primary Ocular Adnexal MALT Lymphoma (POAML) is the most common orbital tumor, there are large gaps in knowledge of its natural history. We conducted a retrospective analysis of the largest reported cohort, consisting of 182 patients with POAML, diagnosed or treated at our institution to analyze long-term outcome, response to treatment, incidence and localization of relapse and transformation. The majority of patients (80%) presented with stage I disease. Overall, 84% of treated patients achieved a complete response after first-line therapy. In patients with stage I disease treated with radiation therapy (RT), doses ≥ 30.6Gy were associated with significantly better complete response rate (p=0.04) and progression free survival (PFS) at 5 and 10-year (p<0.0001). Median overall survival and PFS for all patients were 250 months (95% CI: 222 - upper limit not reached) and 134 months (95% CI: 87 - 198), respectively. Kaplan-Meier estimates for the PFS at 1, 5, and 10 years were 91.5% (95% CI: 86.1% - 94.9%), 68.5% (95% CI: 60.4% - 75.6%), and 50.9% (95% CI: 40.5% - 61.6%), respectively. In univariate analysis, age > 60 years, radiation dose, bilateral ocular involvement at presentation and advanced stage were significantly correlated with shorter PFS (p=0.006, p=0.0001, p=0.002 and p=0.0001, respectively). Multivariate analysis showed that age >60 years (HR= 2.44) and RT<30.6Gy (HR=4.17) were the only factors correlated with shorter PFS (p=0.01 and p=0.0003, respectively). We demonstrate that POAMLs harbor a persistent and ongoing risk for relapses, including in central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow up.

    View details for DOI 10.1182/blood-2016-05-714584

    View details for Web of Science ID 000396529800010

  • An immune clock of human pregnancy. Science immunology Aghaeepour, N. n., Ganio, E. A., Mcilwain, D. n., Tsai, A. S., Tingle, M. n., Van Gassen, S. n., Gaudilliere, D. K., Baca, Q. n., McNeil, L. n., Okada, R. n., Ghaemi, M. S., Furman, D. n., Wong, R. J., Winn, V. D., Druzin, M. L., El-Sayed, Y. Y., Quaintance, C. n., Gibbs, R. n., Darmstadt, G. L., Shaw, G. M., Stevenson, D. K., Tibshirani, R. n., Nolan, G. P., Lewis, D. B., Angst, M. S., Gaudilliere, B. n. 2017; 2 (15)

    Abstract

    The maintenance of pregnancy relies on finely tuned immune adaptations. We demonstrate that these adaptations are precisely timed, reflecting an immune clock of pregnancy in women delivering at term. Using mass cytometry, the abundance and functional responses of all major immune cell subsets were quantified in serial blood samples collected throughout pregnancy. Cell signaling-based Elastic Net, a regularized regression method adapted from the elastic net algorithm, was developed to infer and prospectively validate a predictive model of interrelated immune events that accurately captures the chronology of pregnancy. Model components highlighted existing knowledge and revealed previously unreported biology, including a critical role for the interleukin-2-dependent STAT5ab signaling pathway in modulating T cell function during pregnancy. These findings unravel the precise timing of immunological events occurring during a term pregnancy and provide the analytical framework to identify immunological deviations implicated in pregnancy-related pathologies.

    View details for PubMedID 28864494

  • A simple method for analyzing matched designs with double controls: McNemar's test can be extended. Journal of clinical epidemiology Redelmeier, D. A., Tibshirani, R. J. 2017; 81: 51-55.e2

    Abstract

    To introduce a new analytic approach for matched studies, where exactly two controls are linked to each case (double controls rather than solitary controls). The intent is to extend McNemar's test for one-to-two matching (instead of one-to-one matching) when evaluating binary predictors and outcomes.We review McNemar's approach for analyzing matched data, demonstrate the Mantel-Haenszel approach for integrating two overlapping McNemar's estimates, review conditional logistic regression as an alternative analytic approach, and introduce a new method that yields a visual display and easy verification.We illustrate the new approach with real data testing the association between overcast weather and the risk of a life-threatening traffic crash (n = 6,962). We show that results from the new approach agree closely with conditional logistic regression and are sufficiently simple as to be computed on a handheld calculator. We further validate the approach by conducting simulations when a positive association was predefined and when a null association was predefined.The new approach provides a feasible, simple, and efficient method for analyzing matched designs with double controls.

    View details for DOI 10.1016/j.jclinepi.2016.08.006

    View details for PubMedID 27565976

  • An Ordered Lasso and Sparse Time-Lagged Regression TECHNOMETRICS Tibshirani, R., Suo, X. 2016; 58 (4): 415-423
  • Long term course of patients with primary ocular adnexal malt lymphoma: a large single institution cohort study. Blood Desai, A., Joag, M. G., Lekakis, L., Chapman, J. R., Vega, F., Tibshirani, R., Tse, D., Markoe, A., Lossos, I. S. 2016

    Abstract

    While Primary Ocular Adnexal MALT Lymphoma (POAML) is the most common orbital tumor, there are large gaps in knowledge of its natural history. We conducted a retrospective analysis of the largest reported cohort, consisting of 182 patients with POAML, diagnosed or treated at our institution to analyze long-term outcome, response to treatment, incidence and localization of relapse and transformation. The majority of patients (80%) presented with stage I disease. Overall, 84% of treated patients achieved a complete response after first-line therapy. In patients with stage I disease treated with radiation therapy (RT), doses ≥ 30.6Gy were associated with significantly better complete response rate (p=0.04) and progression free survival (PFS) at 5 and 10-year (p<0.0001). Median overall survival and PFS for all patients were 250 months (95% CI: 222 - upper limit not reached) and 134 months (95% CI: 87 - 198), respectively. Kaplan-Meier estimates for the PFS at 1, 5, and 10 years were 91.5% (95% CI: 86.1% - 94.9%), 68.5% (95% CI: 60.4% - 75.6%), and 50.9% (95% CI: 40.5% - 61.6%), respectively. In univariate analysis, age > 60 years, radiation dose, bilateral ocular involvement at presentation and advanced stage were significantly correlated with shorter PFS (p=0.006, p=0.0001, p=0.002 and p=0.0001, respectively). Multivariate analysis showed that age >60 years (HR= 2.44) and RT<30.6Gy (HR=4.17) were the only factors correlated with shorter PFS (p=0.01 and p=0.0003, respectively). We demonstrate that POAMLs harbor a persistent and ongoing risk for relapses, including in central nervous system, and transformation to aggressive lymphoma (4%), requiring long-term follow up.

    View details for PubMedID 27789481

  • High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences of the United States of America Wager, S., Du, W., Taylor, J., Tibshirani, R. J. 2016

    Abstract

    We study the problem of treatment effect estimation in randomized experiments with high-dimensional covariate information and show that essentially any risk-consistent regression adjustment can be used to obtain efficient estimates of the average treatment effect. Our results considerably extend the range of settings where high-dimensional regression adjustments are guaranteed to provide valid inference about the population average treatment effect. We then propose cross-estimation, a simple method for obtaining finite-sample-unbiased treatment effect estimates that leverages high-dimensional regression adjustments. Our method can be used when the regression model is estimated using the lasso, the elastic net, subset selection, etc. Finally, we extend our analysis to allow for adaptive specification search via cross-validation and flexible nonparametric regression adjustments with machine-learning methods such as random forests or neural networks.

    View details for PubMedID 27791165

  • An Ordered Lasso and Sparse Time-Lagged Regression. Technometrics : a journal of statistics for the physical, chemical, and engineering sciences Tibshirani, R., Suo, X. 2016; 58 (4): 415-423

    Abstract

    We consider regression scenarios where it is natural to impose an order constraint on the coefficients. We propose an order-constrained version of ℓ 1-regularized regression (Lasso) for this problem, and show how to solve it efficiently using the well-known Pool Adjacent Violators Algorithm as its proximal operator. The main application of this idea is to time-lagged regression, where we predict an outcome at time t from features at the previous K time points. In this setting it is natural to assume that the coefficients decay as we move farther away from t, and hence the order constraint is reasonable. Potential application areas include financial time series and prediction of dynamic patient outcomes based on clinical measurements. We illustrate this idea on real and simulated data.

    View details for DOI 10.1080/00401706.2015.1079245

    View details for PubMedID 36909149

    View details for PubMedCentralID PMC10004099

  • Cardiolipins Are Biomarkers of Mitochondria-Rich Thyroid Oncocytic Tumors. Cancer research Zhang, J., Yu, W., Ryu, S. W., Lin, J., Buentello, G., Tibshirani, R., Suliburk, J., Eberlin, L. S. 2016: -?

    Abstract

    Oncocytic tumors are characterized by an excessive eosinophilic, granular cytoplasm due to aberrant accumulation of mitochondria. Mutations in mitochondrial DNA occur in oncocytic thyroid tumors, but there is no information about their lipid composition, which might reveal candidate theranostic molecules. Here, we used desorption electrospray ionization mass spectrometry (DESI-MS) to image and chemically characterize the lipid composition of oncocytic thyroid tumors, as compared with nononcocytic thyroid tumors and normal thyroid samples. We identified a novel molecular signature of oncocytic tumors characterized by an abnormally high abundance and chemical diversity of cardiolipins (CL), including many oxidized species. DESI-MS imaging and IHC experiments confirmed that the spatial distribution of CLs overlapped with regions of accumulation of mitochondria-rich oncocytic cells. Fluorescent imaging and mitochondrial isolation showed that both mitochondrial accumulation and alteration in CL composition of mitochondria occurred in oncocytic tumors cells, thus contributing the aberrant molecular signatures detected. A total of 219 molecular ions, including CLs, other glycerophospholipids, fatty acids, and metabolites, were found at increased or decreased abundance in oncocytic, nononcocytic, or normal thyroid tissues. Our findings suggest new candidate targets for clinical and therapeutic use against oncocytic tumors. Cancer Res; 76(22); 1-10. ©2016 AACR.

    View details for PubMedID 27659048

  • Data Shared Lasso: A novel tool to discover uplift COMPUTATIONAL STATISTICS & DATA ANALYSIS Gross, S. M., Tibshirani, R. 2016; 101: 226-235
  • Data Shared Lasso: A Novel Tool to Discover Uplift. Computational statistics & data analysis Gross, S. M., Tibshirani, R. 2016; 101: 226-235

    Abstract

    A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card.

    View details for DOI 10.1016/j.csda.2016.02.015

    View details for PubMedID 29056802

    View details for PubMedCentralID PMC5650251

  • Pancreatic Cancer Surgical Resection Margins: Molecular Assessment by Mass Spectrometry Imaging. PLoS medicine Eberlin, L. S., Margulis, K., Planell-Mendez, I., Zare, R. N., Tibshirani, R., Longacre, T. A., Jalali, M., Norton, J. A., Poultsides, G. A. 2016; 13 (8)

    Abstract

    Surgical resection with microscopically negative margins remains the main curative option for pancreatic cancer; however, in practice intraoperative delineation of resection margins is challenging. Ambient mass spectrometry imaging has emerged as a powerful technique for chemical imaging and real-time diagnosis of tissue samples. We applied an approach combining desorption electrospray ionization mass spectrometry imaging (DESI-MSI) with the least absolute shrinkage and selection operator (Lasso) statistical method to diagnose pancreatic tissue sections and prospectively evaluate surgical resection margins from pancreatic cancer surgery.Our methodology was developed and tested using 63 banked pancreatic cancer samples and 65 samples (tumor and specimen margins) collected prospectively during 32 pancreatectomies from February 27, 2013, to January 16, 2015. In total, mass spectra for 254,235 individual pixels were evaluated. When cross-validation was employed in the training set of samples, 98.1% agreement with histopathology was obtained. Using an independent set of samples, 98.6% agreement was achieved. We used a statistical approach to evaluate 177,727 mass spectra from samples with complex, mixed histology, achieving an agreement of 81%. The developed method showed agreement with frozen section evaluation of specimen margins in 24 of 32 surgical cases prospectively evaluated. In the remaining eight patients, margins were found to be positive by DESI-MSI/Lasso, but negative by frozen section analysis. The median overall survival after resection was only 10 mo for these eight patients as opposed to 26 mo for patients with negative margins by both techniques. This observation suggests that our method (as opposed to the standard method to date) was able to detect tumor involvement at the margin in patients who developed early recurrence. Nonetheless, a larger cohort of samples is needed to validate the findings described in this study. Careful evaluation of the long-term benefits to patients of the use of DESI-MSI for surgical margin evaluation is also needed to determine its value in clinical practice.Our findings provide evidence that the molecular information obtained by DESI-MSI/Lasso from pancreatic tissue samples has the potential to transform the evaluation of surgical specimens. With further development, we believe the described methodology could be routinely used for intraoperative surgical margin assessment of pancreatic cancer.

    View details for DOI 10.1371/journal.pmed.1002108

    View details for PubMedID 27575375

  • Pathophysiological significance and therapeutic targeting of germinal center kinase in diffuse large B-cell lymphoma. Blood Matthews, J. M., Bhatt, S., Patricelli, M. P., Nomanbhoy, T. K., Jiang, X., Natkunam, Y., Gentles, A. J., Martinez, E., Zhu, D., Chapman, J. R., Cortizas, E., Shyam, R., Chinichian, S., Advani, R., Tan, L., Zhang, J., Choi, H. G., Tibshirani, R., Buhrlage, S. J., Gratzinger, D., Verdun, R., Gray, N. S., Lossos, I. S. 2016; 128 (2): 239-248

    Abstract

    Diffuse large B-cell lymphoma (DLBCL) is the most common subtype of non-Hodgkin lymphoma (NHL), yet 40-50% of patients will eventually succumb to their disease demonstrating a pressing need for novel therapeutic options. Gene expression profiling has identified messenger RNA's that lead to transformation, but critical events transforming cells are normally executed by kinases. Therefore, we hypothesized that previously unrecognized kinases may contribute to DLBCL pathogenesis. We performed the first comprehensive analysis of global kinase activity in DLBCL, to identify novel therapeutic targets, and discovered that Germinal Center Kinase (GCK) was extensively activated. GCK RNA interference and small molecule inhibition induced cell cycle arrest and apoptosis in DLBCL cell lines and primary tumors in vitro and decreased the tumor growth rate in vivo, resulting in a significantly extended lifespan of mice bearing DLBCL xenografts. GCK expression was also linked to adverse clinical outcome in a cohort of 151 primary DLBCL patients. These studies demonstrate, for the first time, that GCK is a molecular therapeutic target in DLBCL tumors and that inhibiting GCK may significantly extend DLBCL patient survival. Since the majority of DLBCL tumors (~80%) exhibit activation of GCK, this therapy may be applicable to most patients.

    View details for DOI 10.1182/blood-2016-02-696856

    View details for PubMedID 27151888

  • Exact Post-Selection Inference for Sequential Regression Procedures JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tibshirani, R. J., Taylor, J., Lockhart, R., Tibshirani, R. 2016; 111 (514): 600-614
  • INFERENCE IN ADAPTIVE REGRESSION VIA THE KAC-RICE FORMULA ANNALS OF STATISTICS Taylor, J. E., Loftus, J. R., Tibshirani, R. J. 2016; 44 (2): 743-770

    View details for DOI 10.1214/15-AOS1386

    View details for Web of Science ID 000372594300011

  • Sparse regression and marginal testing using cluster prototypes. Biostatistics Reid, S., Tibshirani, R. 2016; 17 (2): 364-376

    Abstract

    We propose a new approach for sparse regression and marginal testing, for data with correlated features. Our procedure first clusters the features, and then chooses as the cluster prototype the most informative feature in that cluster. Then we apply either sparse regression (lasso) or marginal significance testing to these prototypes. While this kind of strategy is not entirely new, a key feature of our proposal is its use of the post-selection inference theory of Taylor and others (2014, Exact post-selection inference for forward stepwise and least angle regression, Preprint, arXiv:1401.3889) and Lee and others (2014, Exact post-selection inference with the lasso, Preprint, arXiv:1311.6238v5) to compute exact [Formula: see text]-values and confidence intervals that properly account for the selection of prototypes. We also apply the recent "knockoff" idea of Barber and Candès (2014, Controlling the false discovery rate via knockoffs, Preprint, arXiv:1404.5609) to provide exact finite sample control of the FDR of our regression procedure. We illustrate our proposals on both real and simulated data.

    View details for DOI 10.1093/biostatistics/kxv049

    View details for PubMedID 26614384

  • Successful immunotherapy induces previously unidentified allergen-specific CD4+ T-cell subsets. Proceedings of the National Academy of Sciences of the United States of America Ryan, J. F., Hovde, R., Glanville, J., Lyu, S., Ji, X., Gupta, S., Tibshirani, R. J., Jay, D. C., Boyd, S. D., Chinthrajah, R. S., Davis, M. M., Galli, S. J., Maecker, H. T., Nadeau, K. C. 2016; 113 (9): E1286-95

    Abstract

    Allergen immunotherapy can desensitize even subjects with potentially lethal allergies, but the changes induced in T cells that underpin successful immunotherapy remain poorly understood. In a cohort of peanut-allergic participants, we used allergen-specific T-cell sorting and single-cell gene expression to trace the transcriptional "roadmap" of individual CD4+ T cells throughout immunotherapy. We found that successful immunotherapy induces allergen-specific CD4+ T cells to expand and shift toward an "anergic" Th2 T-cell phenotype largely absent in both pretreatment participants and healthy controls. These findings show that sustained success, even after immunotherapy is withdrawn, is associated with the induction, expansion, and maintenance of immunotherapy-specific memory and naive T-cell phenotypes as early as 3 mo into immunotherapy. These results suggest an approach for immune monitoring participants undergoing immunotherapy to predict the success of future treatment and could have implications for immunotherapy targets in other diseases like cancer, autoimmune disease, and transplantation.

    View details for DOI 10.1073/pnas.1520180113

    View details for PubMedID 26811452

  • Sequential selection procedures and false discovery rate control JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY G'Sell, M. G., Wager, S., Chouldechova, A., Tibshirani, R. 2016; 78 (2): 423-444

    View details for DOI 10.1111/rssb.12122

    View details for Web of Science ID 000369136600005

  • A STUDY OF ERROR VARIANCE ESTIMATION IN LASSO REGRESSION STATISTICA SINICA Reid, S., Tibshirani, R., Friedman, J. 2016; 26 (1): 35-67
  • CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE ANNALS OF APPLIED STATISTICS Powers, S., Hastie, T., Tibshirani, R. 2015; 9 (4): 1709-1725

    View details for DOI 10.1214/15-AOAS866

    View details for Web of Science ID 000370445600001

  • CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE. The annals of applied statistics Powers, S., Hastie, T., Tibshirani, R. 2015; 9 (4): 1709-1725

    Abstract

    We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal-customized training-clusters the data to find training points close to each test point and then fits an ℓ1-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of k-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.

    View details for DOI 10.1214/15-AOAS866

    View details for PubMedID 30370000

    View details for PubMedCentralID PMC6200412

  • A Permutation Approach to Testing Interactions for Binary Response by Comparing Correlations Between Classes JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Simon, N., Tibshirani, R. 2015; 110 (512): 1707-1716
  • A component lasso CANADIAN JOURNAL OF STATISTICS-REVUE CANADIENNE DE STATISTIQUE Hussami, N., Tibshirani, R. J. 2015; 43 (4): 624-646

    View details for DOI 10.1002/cjs.11267

    View details for Web of Science ID 000367667700008

  • The Radiogenomic Risk Score: Construction of a Prognostic Quantitative, Noninvasive Image-based Molecular Assay for Renal Cell Carcinoma RADIOLOGY Jamshidi, N., Jonasch, E., Zapala, M., Korn, R. L., Aganovic, L., Zhao, H., Sitaram, R. T., Tibshirani, R. J., Banerjee, S., Brooks, J. D., Ljungberg, B., Kuo, M. D. 2015; 277 (1): 114-123

    Abstract

    Purpose To evaluate the feasibility of constructing radiogenomic-based surrogates of molecular assays (SOMAs) in patients with clear-cell renal cell carcinoma (CCRCC) by using data extracted from a single computed tomographic (CT) image. Materials and Methods In this institutional review board approved study, gene expression profile data and contrast material-enhanced CT images from 70 patients with CCRCC in a training set were independently assessed by two radiologists for a set of predefined imaging features. A SOMA for a previously validated CCRCC-specific supervised principal component (SPC) risk score prognostic gene signature was constructed and termed the radiogenomic risk score (RRS). It uses the microarray data and a 28-trait image array to evaluate each CT image with multiple regression of gene expression analysis. The predictive power of the RRS SOMA was then prospectively validated in an independent dataset to confirm its relationship to the SPC gene signature (n = 70) and determination of patient outcome (n = 77). Data were analyzed by using multivariate linear regression-based methods and Cox regression modeling, and significance was assessed with receiver operator characteristic curves and Kaplan-Meier survival analysis. Results Our SOMA faithfully represents the tissue-based molecular assay it models. The RRS scaled with the SPC gene signature (R = 0.57, P < .001, classification accuracy 70.1%, P < .001) and predicted disease-specific survival (log rank P < .001). Independent validation confirmed the relationship between the RRS and the SPC gene signature (R = 0.45, P < .001, classification accuracy 68.6%, P < .001) and disease-specific survival (log-rank P < .001) and that it was independent of stage, grade, and performance status (multivariate Cox model P < .05, log-rank P < .001). Conclusion A SOMA for the CCRCC-specific SPC prognostic gene signature that is predictive of disease-specific survival and independent of stage was constructed and validated, confirming that SOMA construction is feasible. (©) RSNA, 2015 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on August 24, 2015.

    View details for DOI 10.1148/radiol.2015150800

    View details for Web of Science ID 000368434000014

  • The Radiogenomic Risk Score: Construction of a Prognostic Quantitative, Noninvasive Image-based Molecular Assay for Renal Cell Carcinoma. Radiology Jamshidi, N., Jonasch, E., Zapala, M., Korn, R. L., Aganovic, L., Zhao, H., Tumkur Sitaram, R., Tibshirani, R. J., Banerjee, S., Brooks, J. D., Ljungberg, B., Kuo, M. D. 2015; 277 (1): 114-23

    Abstract

    Purpose To evaluate the feasibility of constructing radiogenomic-based surrogates of molecular assays (SOMAs) in patients with clear-cell renal cell carcinoma (CCRCC) by using data extracted from a single computed tomographic (CT) image. Materials and Methods In this institutional review board approved study, gene expression profile data and contrast material-enhanced CT images from 70 patients with CCRCC in a training set were independently assessed by two radiologists for a set of predefined imaging features. A SOMA for a previously validated CCRCC-specific supervised principal component (SPC) risk score prognostic gene signature was constructed and termed the radiogenomic risk score (RRS). It uses the microarray data and a 28-trait image array to evaluate each CT image with multiple regression of gene expression analysis. The predictive power of the RRS SOMA was then prospectively validated in an independent dataset to confirm its relationship to the SPC gene signature (n = 70) and determination of patient outcome (n = 77). Data were analyzed by using multivariate linear regression-based methods and Cox regression modeling, and significance was assessed with receiver operator characteristic curves and Kaplan-Meier survival analysis. Results Our SOMA faithfully represents the tissue-based molecular assay it models. The RRS scaled with the SPC gene signature (R = 0.57, P < .001, classification accuracy 70.1%, P < .001) and predicted disease-specific survival (log rank P < .001). Independent validation confirmed the relationship between the RRS and the SPC gene signature (R = 0.45, P < .001, classification accuracy 68.6%, P < .001) and disease-specific survival (log-rank P < .001) and that it was independent of stage, grade, and performance status (multivariate Cox model P < .05, log-rank P < .001). Conclusion A SOMA for the CCRCC-specific SPC prognostic gene signature that is predictive of disease-specific survival and independent of stage was constructed and validated, confirming that SOMA construction is feasible. (©) RSNA, 2015 Online supplemental material is available for this article. An earlier incorrect version of this article appeared online. This article was corrected on August 24, 2015.

    View details for DOI 10.1148/radiol.2015150800

    View details for PubMedID 26402495

  • Fibromyalgia and the Risk of a Subsequent Motor Vehicle Crash. The Journal of rheumatology Redelmeier, D. A., Zung, J. D., Thiruchelvam, D., Tibshirani, R. J. 2015; 42 (8): 1502-10

    Abstract

    Motor vehicle crashes are a widespread contributor to mortality and morbidity, sometimes related to medically unfit motorists. We tested whether patients diagnosed with fibromyalgia (FM) have an increased risk of a subsequent serious motor vehicle crash.We conducted a population-based self-matched longitudinal cohort analysis to estimate the incidence rate ratio of crashes among patients diagnosed with FM relative to the population norm in Ontario, Canada. We included adults diagnosed from April 1, 2006, to March 31, 2012, excluding individuals younger than 18 years, living outside Ontario, lacking valid identifiers, or having only a single visit for the diagnosis. The primary outcome was an emergency department visit as a driver involved in a motor vehicle crash.The patients (n = 137,631) accounted for 738 crashes during the first year of followup after diagnosis, equal to an incidence rate ratio of 2.44 compared with the population norm (95% CI 2.27-2.63, p < 0.001). The crash rate was more than twice the population norm for those with a new or a persistent diagnosis. The increased risk included patients with diverse characteristics, approached the rate observed among other patients diagnosed with alcoholism, and was mitigated among those who received dedicated FM care or a physician warning for driving safety.A diagnosis of FM is associated with an increased risk of a subsequent motor vehicle crash that might justify medical interventions for traffic safety.

    View details for DOI 10.3899/jrheum.141315

    View details for PubMedID 25979716

  • Statistical learning and selective inference PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Taylor, J., Tibshirani, R. J. 2015; 112 (25): 7629-7634

    Abstract

    We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"-searched for the strongest associations-means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.

    View details for DOI 10.1073/pnas.1507583112

    View details for Web of Science ID 000356731300047

    View details for PubMedID 26100887

    View details for PubMedCentralID PMC4485109

  • Collaborative regression BIOSTATISTICS Gross, S. M., Tibshirani, R. 2015; 16 (2): 326-338

    Abstract

    We consider the scenario where one observes an outcome variable and sets of features from multiple assays, all measured on the same set of samples. One approach that has been proposed for dealing with these type of data is "sparse multiple canonical correlation analysis" (sparse mCCA). All of the current sparse mCCA techniques are biconvex and thus have no guarantees about reaching a global optimum. We propose a method for performing sparse supervised canonical correlation analysis (sparse sCCA), a specific case of sparse mCCA when one of the datasets is a vector. Our proposal for sparse sCCA is convex and thus does not face the same difficulties as the other methods. We derive efficient algorithms for this problem that can be implemented with off the shelf solvers, and illustrate their use on simulated and real data.

    View details for DOI 10.1093/biostatistics/kxu047

    View details for Web of Science ID 000354644900009

    View details for PubMedID 25406332

    View details for PubMedCentralID PMC4441100

  • CONVEX HIERARCHICAL TESTING OF INTERACTIONS ANNALS OF APPLIED STATISTICS Bien, J., Simon, N., Tibshirani, R. 2015; 9 (1): 27-42

    View details for DOI 10.1214/14-AOAS758

    View details for Web of Science ID 000358354400002

  • Molecular subtyping for clinically defined breast cancer subgroups BREAST CANCER RESEARCH Zhao, X., Rodland, E. A., Tibshirani, R., Plevritis, S. 2015; 17

    Abstract

    Breast cancer is commonly classified into intrinsic molecular subtypes. Standard gene centering is routinely done prior to molecular subtyping, but it can produce inaccurate classifications when the distribution of clinicopathological characteristics in the study cohort differs from that of the training cohort used to derive the classifier.We propose a subgroup-specific gene-centering method to perform molecular subtyping on a study cohort that has a skewed distribution of clinicopathological characteristics relative to the training cohort. On such a study cohort, we center each gene on a specified percentile, where the percentile is determined from a subgroup of the training cohort with clinicopathological characteristics similar to the study cohort. We demonstrate our method using the PAM50 classifier and its associated University of North Carolina (UNC) training cohort. We considered study cohorts with skewed clinicopathological characteristics, including subgroups composed of a single prototypic subtype of the UNC-PAM50 training cohort (n = 139), an external estrogen receptor (ER)-positive cohort (n = 48) and an external triple-negative cohort (n = 77).Subgroup-specific gene centering improved prediction performance with the accuracies between 77% and 100%, compared to accuracies between 17% and 33% from standard gene centering, when applied to the prototypic tumor subsets of the PAM50 training cohort. It reduced classification error rates on the ER-positive (11% versus 28%; P = 0.0389), the ER-negative (5% versus 41%; P < 0.0001) and the triple-negative (11% versus 56%; P = 0.1336) subgroups of the PAM50 training cohort. In addition, it produced higher accuracy for subtyping study cohorts composed of varying proportions of ER-positive versus ER-negative cases. Finally, it increased the percentage of assigned luminal subtypes on the external ER-positive cohort and basal-like subtype on the external triple-negative cohort.Gene centering is often necessary to accurately apply a molecular subtype classifier. Compared with standard gene centering, our proposed subgroup-specific gene centering produced more accurate molecular subtype assignments in a study cohort with skewed clinicopathological characteristics relative to the training cohort.

    View details for DOI 10.1186/s13058-015-0520-4

    View details for Web of Science ID 000351829500001

    View details for PubMedID 25849221

    View details for PubMedCentralID PMC4365540

  • Pancancer analysis of DNA methylation-driven genes using MethylMix GENOME BIOLOGY Gevaert, O., Tibshirani, R., Plevritis, S. K. 2015; 16

    Abstract

    Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.

    View details for DOI 10.1186/s13059-014-0579-8

    View details for Web of Science ID 000351817300001

    View details for PubMedID 25631659

    View details for PubMedCentralID PMC4365533

  • Pancancer analysis of DNA methylation-driven genes using MethylMix. Genome biology Gevaert, O., Tibshirani, R., Plevritis, S. K. 2015; 16: 17-?

    Abstract

    Aberrant DNA methylation is an important mechanism that contributes to oncogenesis. Yet, few algorithms exist that exploit this vast dataset to identify hypo- and hypermethylated genes in cancer. We developed a novel computational algorithm called MethylMix to identify differentially methylated genes that are also predictive of transcription. We apply MethylMix to 12 individual cancer sites, and additionally combine all cancer sites in a pancancer analysis. We discover pancancer hypo- and hypermethylated genes and identify novel methylation-driven subgroups with clinical implications. MethylMix analysis on combined cancer sites reveals 10 pancancer clusters reflecting new similarities across malignantly transformed tissues.

    View details for DOI 10.1186/s13059-014-0579-8

    View details for PubMedID 25631659

    View details for PubMedCentralID PMC4365533

  • A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Tian, L., Alizadeh, A. A., Gentles, A. J., Tibshirani, R. 2014; 109 (508): 1517-1532

    Abstract

    We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.

    View details for DOI 10.1080/01621459.2014.951443

    View details for Web of Science ID 000346797000016

    View details for PubMedCentralID PMC4338439

  • Quantitative SD-OCT imaging biomarkers as indicators of age-related macular degeneration progression. Investigative ophthalmology & visual science de Sisternes, L., Simon, N., Tibshirani, R., Leng, T., Rubin, D. L. 2014; 55 (11): 7093-7103

    Abstract

    Purpose: We developed a statistical model based on quantitative characteristics of drusen to estimate the likelihood of conversion from early and intermediate age-related macular degeneration (AMD) to its advanced exudative form (AMD progression) in the short term (less than 5 years), a crucial task to enable early intervention and improve outcomes. Methods: Image features of drusen quantifying their number, morphology, and reflectivity properties, as well as the longitudinal evolution in these characteristics, were automatically extracted from 2146 spectral domain optical coherence tomography (SD-OCT) scans of 330 AMD eyes in 244 patients collected over a period of 5 years, with 36 eyes showing progression during clinical follow-up. We developed and evaluated a statistical model to predict the likelihood of progression at pre-determined times using clinical and image features as predictors. Results: Area, volume, height, and reflectivity of drusen were informative features distinguishing between progressing and non-progressing cases. Discerning progression at follow-up (mean 6.16 months) resulted in a mean area under the receiver operating characteristic curve (AUC) of 0.74 ((0.58, 0.85) 95% confidence interval (CI)). The maximum predictive performance was observed at 11 months after a patient's first early AMD diagnosis, with mean AUC 0.92 ((0.83, 0.98) 95% CI). Those eyes predicted to progress showed a much higher progression rate than those predicted not to progress at any given time from the initial visit. Conclusions: Our results demonstrate the potential ability of our model to identify those AMD patients at risk of progressing to exudative AMD from an early or intermediate stage.

    View details for DOI 10.1167/iovs.14-14918

    View details for PubMedID 25301882

  • A Simple Method for Estimating Interactions between a Treatment and a Large Number of Covariates. Journal of the American Statistical Association Tian, L., Alizadeh, A. A., Gentles, A. J., Tibshirani, R. 2014; 109 (508): 1517-1532

    Abstract

    We consider a setting in which we have a treatment and a potentially large number of covariates for a set of observations, and wish to model their relationship with an outcome of interest. We propose a simple method for modeling interactions between the treatment and covariates. The idea is to modify the covariate in a simple way, and then fit a standard model using the modified covariates and no main effects. We show that coupled with an efficiency augmentation procedure, this method produces clinically meaningful estimators in a variety of settings. It can be useful for practicing personalized medicine: determining from a large set of biomarkers the subset of patients that can potentially benefit from a treatment. We apply the method to both simulated datasets and real trial data. The modified covariates idea can be used for other purposes, for example, large scale hypothesis testing for determining which of a set of covariates interact with a treatment variable.

    View details for DOI 10.1080/01621459.2014.951443

    View details for PubMedID 25729117

    View details for PubMedCentralID PMC4338439

  • Alteration of the lipid profile in lymphomas induced by MYC overexpression. Proceedings of the National Academy of Sciences of the United States of America Eberlin, L. S., Gabay, M., Fan, A. C., Gouw, A. M., Tibshirani, R. J., Felsher, D. W., Zare, R. N. 2014; 111 (29): 10450-10455

    Abstract

    Overexpression of the v-myc avian myelocytomatosis viral oncogene homolog (MYC) oncogene is one of the most commonly implicated causes of human tumorigenesis. MYC is known to regulate many aspects of cellular biology including glucose and glutamine metabolism. Little is known about the relationship between MYC and the appearance and disappearance of specific lipid species. We use desorption electrospray ionization mass spectrometry imaging (DESI-MSI), statistical analysis, and conditional transgenic animal models and cell samples to investigate changes in lipid profiles in MYC-induced lymphoma. We have detected a lipid signature distinct from that observed in normal tissue and in rat sarcoma-induced lymphoma cells. We found 104 distinct molecular ions that have an altered abundance in MYC lymphoma compared with normal control tissue by statistical analysis with a false discovery rate of less than 5%. Of these, 86 molecular ions were specifically identified as complex phospholipids. To evaluate whether the lipid signature could also be observed in human tissue, we examined 15 human lymphoma samples with varying expression levels of MYC oncoprotein. Distinct lipid profiles in lymphomas with high and low MYC expression were observed, including many of the lipid species identified as significant for MYC-induced animal lymphoma tissue. Our results suggest a relationship between the appearance of specific lipid species and the overexpression of MYC in lymphomas.

    View details for DOI 10.1073/pnas.1409778111

    View details for PubMedID 24994904

  • Automated identification of stratifying signatures in cellular subpopulations PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Bruggner, R. V., Bodenmiller, B., Dill, D. L., Tibshirani, R. J., Nolan, G. P. 2014; 111 (26): E2770-E2777

    Abstract

    Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased--and potentially more thorough--correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.

    View details for DOI 10.1073/pnas.1408792111

    View details for Web of Science ID 000338118900020

    View details for PubMedCentralID PMC4084463

  • Automated identification of stratifying signatures in cellular subpopulations. Proceedings of the National Academy of Sciences of the United States of America Bruggner, R. V., Bodenmiller, B., Dill, D. L., Tibshirani, R. J., Nolan, G. P. 2014; 111 (26): E2770-7

    Abstract

    Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased--and potentially more thorough--correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.

    View details for DOI 10.1073/pnas.1408792111

    View details for PubMedID 24979804

  • Active idiotypic vaccination versus control immunotherapy for follicular lymphoma. Journal of clinical oncology Levy, R., Ganjoo, K. N., Leonard, J. P., Vose, J. M., Flinn, I. W., Ambinder, R. F., Connors, J. M., Berinstein, N. L., Belch, A. R., Bartlett, N. L., Nichols, C., Emmanouilides, C. E., Timmerman, J. M., Gregory, S. A., Link, B. K., Inwards, D. J., Freedman, A. S., Matous, J. V., Robertson, M. J., Kunkel, L. A., Ingolia, D. E., Gentles, A. J., Liu, C. L., Tibshirani, R., Alizadeh, A. A., Denney, D. W. 2014; 32 (17): 1797-1803

    View details for DOI 10.1200/JCO.2012.43.9273

    View details for PubMedID 24799467

  • Active idiotypic vaccination versus control immunotherapy for follicular lymphoma. Journal of clinical oncology Levy, R., Ganjoo, K. N., Leonard, J. P., Vose, J. M., Flinn, I. W., Ambinder, R. F., Connors, J. M., Berinstein, N. L., Belch, A. R., Bartlett, N. L., Nichols, C., Emmanouilides, C. E., Timmerman, J. M., Gregory, S. A., Link, B. K., Inwards, D. J., Freedman, A. S., Matous, J. V., Robertson, M. J., Kunkel, L. A., Ingolia, D. E., Gentles, A. J., Liu, C. L., Tibshirani, R., Alizadeh, A. A., Denney, D. W. 2014; 32 (17): 1797-1803

    Abstract

    Idiotypes (Ids), the unique portions of tumor immunoglobulins, can serve as targets for passive and active immunotherapies for lymphoma. We performed a multicenter, randomized trial comparing a specific vaccine (MyVax), comprising Id chemically coupled to keyhole limpet hemocyanin (KLH) plus granulocyte macrophage colony-stimulating factor (GM-CSF) to a control immunotherapy with KLH plus GM-CSF.Patients with previously untreated advanced-stage follicular lymphoma (FL) received eight cycles of chemotherapy with cyclophosphamide, vincristine, and prednisone. Those achieving sustained partial or complete remission (n=287 [44%]) were randomly assigned at a ratio of 2:1 to receive one injection per month for 7 months of MyVax or control immunotherapy. Anti-Id antibody responses (humoral immune responses [IRs]) were measured before each immunization. The primary end point was progression-free survival (PFS). Secondary end points included IR and time to subsequent antilymphoma therapy.At a median follow-up of 58 months, no significant difference was observed in either PFS or time to next therapy between the two arms. In the MyVax group (n=195), anti-Id IRs were observed in 41% of patients, with a median PFS of 40 months, significantly exceeding the median PFS observed in patients without such Id-induced IRs and in those receiving control immunotherapy.This trial failed to demonstrate clinical benefit of specific immunotherapy. The subset of vaccinated patients mounting specific anti-Id responses had superior outcomes. Whether this reflects a therapeutic benefit or is a marker for more favorable underlying prognosis requires further study.

    View details for DOI 10.1200/JCO.2012.43.9273

    View details for PubMedID 24799467

  • Regularization Paths for Conditional Logistic Regression: The clogitL1 Package JOURNAL OF STATISTICAL SOFTWARE Reid, S., Tibshirani, R. 2014; 58 (12): 1-23
  • Sensitivity analysis for inference with partially identifiable covariance matrices COMPUTATIONAL STATISTICS G'Sell, M. G., Shen-Orr, S. S., Tibshirani, R. 2014; 29 (3-4): 529-546
  • LMO2 and BCL6 are associated with improved survival in primary central nervous system lymphoma BRITISH JOURNAL OF HAEMATOLOGY Lossos, C., Bayraktar, S., Weinzierl, E., Younes, S. F., Hosein, P. J., Tibshirani, R. J., Posthumus, J. S., DeAngelis, L. M., Raizer, J., Schiff, D., Abrey, L., Natkunam, Y., Lossos, I. S. 2014; 165 (5): 640-648

    Abstract

    Primary central nervous system lymphoma (PCNSL) is an aggressive sub-variant of non-Hodgkin lymphoma (NHL) with morphological similarities to diffuse large B-cell lymphoma (DLBCL). While methotrexate (MTX)-based therapies have improved patient survival, the disease remains incurable in most cases and its pathogenesis is poorly understood. We evaluated 69 cases of PCNSL for the expression of HGAL (also known as GCSAM), LMO2 and BCL6 - genes associated with DLBCL prognosis and pathobiology, and analysed their correlation to survival in 49 PCNSL patients receiving MTX-based therapy. We demonstrate that PCNSL expresses LMO2, HGAL(also known as GCSAM) and BCL6 proteins in 52%, 65% and 56% of tumours, respectively. BCL6 protein expression was associated with longer progression-free survival (P = 0·006) and overall survival (OS, P = 0·05), while expression of LMO2 protein was associated with longer OS (P = 0·027). Further research is needed to elucidate the function of BCL6 and LMO2 in PCNSL.

    View details for DOI 10.1111/bjh.12801

    View details for Web of Science ID 000335826500008

    View details for PubMedID 24571259

    View details for PubMedCentralID PMC4123533

  • A multicentre study of primary breast diffuse large B-cell lymphoma in the rituximab era BRITISH JOURNAL OF HAEMATOLOGY Hosein, P. J., Maragulia, J. C., Salzberg, M. P., Press, O. W., Habermann, T. M., Vose, J. M., Bast, M., Advani, R. H., Tibshirani, R., Evens, A. M., Islam, N., Leonard, J. P., Martin, P., Zelenetz, A. D., Lossos, I. S. 2014; 165 (3): 358-363

    Abstract

    Primary breast diffuse large B-cell lymphoma (DLBCL) is a rare subtype of non-Hodgkin lymphoma (NHL) with limited data on pathology and outcome. A multicentre retrospective study was undertaken to determine prognostic factors and the incidence of central nervous system (CNS) relapses. Data was retrospectively collected on patients from 8 US academic centres. Only patients with stage I/II disease (involvement of breast and localized lymph nodes) were included. Histologies apart from primary DLBCL were excluded. Between 1992 and 2012, 76 patients met the eligibility criteria. Most patients (86%) received chemotherapy, and 69% received immunochemotherapy with rituximab; 65% received radiation therapy and 9% received prophylactic CNS chemotherapy. After a median follow-up of 4·5 years (range 0·6-20·6 years), the Kaplan-Meier estimated median progression-free survival was 10·4 years (95% confidence interval [CI] 5·8-14·9 years), and the median overall survival was 14·6 years (95% CI 10·2-19 years). Twelve patients (16%) had CNS relapse. A low stage-modified International Prognostic Index (IPI) was associated with longer overall survival. Rituximab use was not associated with a survival advantage. Primary breast DLBCL has a high rate of CNS relapse. The stage-modified IPI score is associated with survival.

    View details for DOI 10.1111/bjh.12753

    View details for Web of Science ID 000334031000011

    View details for PubMedID 24467658

    View details for PubMedCentralID PMC3990235

  • A SIGNIFICANCE TEST FOR THE LASSO ANNALS OF STATISTICS Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R. 2014; 42 (2): 413-468

    View details for DOI 10.1214/13-AOS1175

    View details for Web of Science ID 000336888400001

  • A SIGNIFICANCE TEST FOR THE LASSO. Annals of statistics Lockhart, R., Taylor, J., Tibshirani, R. J., Tibshirani, R. 2014; 42 (2): 413-468

    Abstract

    In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p > n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a [Formula: see text] distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than [Formula: see text] under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the [Formula: see text] penalty. Therefore, the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties-adaptivity and shrinkage-and its null distribution is tractable and asymptotically Exp(1).

    View details for DOI 10.1214/13-AOS1175

    View details for PubMedID 25574062

    View details for PubMedCentralID PMC4285373

  • Molecular assessment of surgical-resection margins of gastric cancer by mass-spectrometric imaging. Proceedings of the National Academy of Sciences of the United States of America Eberlin, L. S., Tibshirani, R. J., Zhang, J., Longacre, T. A., Berry, G. J., Bingham, D. B., Norton, J. A., Zare, R. N., Poultsides, G. A. 2014; 111 (7): 2436-2441

    Abstract

    Surgical resection is the main curative option for gastrointestinal cancers. The extent of cancer resection is commonly assessed during surgery by pathologic evaluation of (frozen sections of) the tissue at the resected specimen margin(s) to verify whether cancer is present. We compare this method to an alternative procedure, desorption electrospray ionization mass spectrometric imaging (DESI-MSI), for 62 banked human cancerous and normal gastric-tissue samples. In DESI-MSI, microdroplets strike the tissue sample, the resulting splash enters a mass spectrometer, and a statistical analysis, here, the Lasso method (which stands for least absolute shrinkage and selection operator and which is a multiclass logistic regression with L1 penalty), is applied to classify tissues based on the molecular information obtained directly from DESI-MSI. The methodology developed with 28 frozen training samples of clear histopathologic diagnosis showed an overall accuracy value of 98% for the 12,480 pixels evaluated in cross-validation (CV), and 97% when a completely independent set of samples was tested. By applying an additional spatial smoothing technique, the accuracy for both CV and the independent set of samples was 99% compared with histological diagnoses. To test our method for clinical use, we applied it to a total of 21 tissue-margin samples prospectively obtained from nine gastric-cancer patients. The results obtained suggest that DESI-MSI/Lasso may be valuable for routine intraoperative assessment of the specimen margins during gastric-cancer surgery.

    View details for DOI 10.1073/pnas.1400274111

    View details for PubMedID 24550265

  • Systems analysis of sex differences reveals an immunosuppressive role for testosterone in the response to influenza vaccination. Proceedings of the National Academy of Sciences of the United States of America Furman, D., Hejblum, B. P., Simon, N., Jojic, V., Dekker, C. L., Thiébaut, R., Tibshirani, R. J., Davis, M. M. 2014; 111 (2): 869-874

    Abstract

    Females have generally more robust immune responses than males for reasons that are not well-understood. Here we used a systems analysis to investigate these differences by analyzing the neutralizing antibody response to a trivalent inactivated seasonal influenza vaccine (TIV) and a large number of immune system components, including serum cytokines and chemokines, blood cell subset frequencies, genome-wide gene expression, and cellular responses to diverse in vitro stimuli, in 53 females and 34 males of different ages. We found elevated antibody responses to TIV and expression of inflammatory cytokines in the serum of females compared with males regardless of age. This inflammatory profile correlated with the levels of phosphorylated STAT3 proteins in monocytes but not with the serological response to the vaccine. In contrast, using a machine learning approach, we identified a cluster of genes involved in lipid biosynthesis and previously shown to be up-regulated by testosterone that correlated with poor virus-neutralizing activity in men. Moreover, men with elevated serum testosterone levels and associated gene signatures exhibited the lowest antibody responses to TIV. These results demonstrate a strong association between androgens and genes involved in lipid metabolism, suggesting that these could be important drivers of the differences in immune responses between males and females.

    View details for DOI 10.1073/pnas.1321060111

    View details for PubMedID 24367114

    View details for PubMedCentralID PMC3896147

  • Increasing value and reducing waste in research design, conduct, and analysis. Lancet Ioannidis, J. P., Greenland, S., Hlatky, M. A., Khoury, M. J., Macleod, M. R., Moher, D., Schulz, K. F., Tibshirani, R. 2014; 383 (9912): 166-175

    Abstract

    Correctable weaknesses in the design, conduct, and analysis of biomedical and public health research studies can produce misleading results and waste valuable resources. Small effects can be difficult to distinguish from bias introduced by study design and analyses. An absence of detailed written protocols and poor documentation of research is common. Information obtained might not be useful or important, and statistical precision or power is often too low or used in a misleading way. Insufficient consideration might be given to both previous and continuing studies. Arbitrary choice of analyses and an overemphasis on random extremes might affect the reported findings. Several problems relate to the research workforce, including failure to involve experienced statisticians and methodologists, failure to train clinical researchers and laboratory scientists in research methods and design, and the involvement of stakeholders with conflicts of interest. Inadequate emphasis is placed on recording of research decisions and on reproducibility of research. Finally, reward systems incentivise quantity more than quality, and novelty more than reliability. We propose potential solutions for these problems, including improvements in protocols and documentation, consideration of evidence from studies in progress, standardisation of research efforts, optimisation and training of an experienced and non-conflicted scientific workforce, and reconsideration of scientific reward systems.

    View details for DOI 10.1016/S0140-6736(13)62227-8

    View details for PubMedID 24411645

  • A shared transcriptional program in early breast neoplasias despite genetic and clinical distinctions GENOME BIOLOGY Brunner, A. L., Li, J., Guo, X., Sweeney, R. T., Varma, S., Zhu, S. X., Li, R., Tibshirani, R., West, R. B. 2014; 15 (5)

    Abstract

    The earliest recognizable stages of breast neoplasia are lesions that represent a heterogeneous collection of epithelial proliferations currently classified based on morphology. Their role in the development of breast cancer is not well understood but insight into the critical events at this early stage will improve efforts in breast cancer detection and prevention. These microscopic lesions are technically difficult to study so very little is known about their molecular alterations.To characterize the transcriptional changes of early breast neoplasia, we sequenced 3'- end enriched RNAseq libraries from formalin-fixed paraffin-embedded tissue of early neoplasia samples and matched normal breast and carcinoma samples from 25 patients. We find that gene expression patterns within early neoplasias are distinct from both normal and breast cancer patterns and identify a pattern of pro-oncogenic changes, including elevated transcription of ERBB2, FOXA1, and GATA3 at this early stage. We validate these findings on a second independent gene expression profile data set generated by whole transcriptome sequencing. Measurements of protein expression by immunohistochemistry on an independent set of early neoplasias confirms that ER pathway regulators FOXA1 and GATA3, as well as ER itself, are consistently upregulated at this early stage. The early neoplasia samples also demonstrate coordinated changes in long non-coding RNA expression and microenvironment stromal gene expression patterns.This study is the first examination of global gene expression in early breast neoplasia, and the genes identified here represent candidate participants in the earliest molecular events in the development of breast cancer.

    View details for DOI 10.1186/gb-2014-15-5-r71

    View details for Web of Science ID 000338981700005

    View details for PubMedCentralID PMC4072957

  • Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data STATISTICAL METHODS IN MEDICAL RESEARCH Li, J., Tibshirani, R. 2013; 22 (5): 519-536

    Abstract

    We discuss the identification of features that are associated with an outcome in RNA-Sequencing (RNA-Seq) and other sequencing-based comparative genomic experiments. RNA-Seq data takes the form of counts, so models based on the normal distribution are generally unsuitable. The problem is especially challenging because different sequencing experiments may generate quite different total numbers of reads, or 'sequencing depths'. Existing methods for this problem are based on Poisson or negative binomial models: they are useful but can be heavily influenced by 'outliers' in the data. We introduce a simple, non-parametric method with resampling to account for the different sequencing depths. The new method is more robust than parametric methods. It can be applied to data with quantitative, survival, two-class or multiple-class outcomes. We compare our proposed method to Poisson and negative binomial-based methods in simulated and real data sets, and find that our method discovers more consistent patterns than competing methods.

    View details for DOI 10.1177/0962280211428386

    View details for Web of Science ID 000325863700005

    View details for PubMedID 22127579

    View details for PubMedCentralID PMC4605138

  • Identification of gene microarray expression profiles in patients with chronic graft-versus-host disease following allogeneic hematopoietic cell transplantation. Clinical immunology Kohrt, H. E., Tian, L., Li, L., Alizadeh, A. A., Hsieh, S., Tibshirani, R. J., Strober, S., Sarwal, M., Lowsky, R. 2013; 148 (1): 124-135

    Abstract

    Chronic graft-versus-host disease (GVHD) results in significant morbidity and mortality, limiting the benefit of allogeneic hematopoietic cell transplantation (HCT). Peripheral blood gene expression profiling of the donor immune repertoire following HCT may provide associated genes and pathways thereby improving the pathophysiologic understanding of chronic GVHD. We profiled 70 patients and identified candidate genes that provided mechanistic insight in the biologic pathways that underlie chronic GVHD. Our data revealed that the dominant gene signature in patients with chronic GVHD represented compensatory responses that control inflammation and included the interleukin-1 decoy receptor, IL-1 receptor type II, and genes that were profibrotic and associated with the IL-4, IL-6 and IL-10 signaling pathways. In addition, we identified three genes that were important regulators of extracellular matrix. Validation of this discovery phase study will determine if the identified genes have diagnostic, prognostic or therapeutic implications.

    View details for DOI 10.1016/j.clim.2013.04.013

    View details for PubMedID 23685278

  • A LASSO FOR HIERARCHICAL INTERACTIONS ANNALS OF STATISTICS Bien, J., Taylor, J., Tibshirani, R. 2013; 41 (3): 1111-1141

    View details for DOI 10.1214/13-AOS1096

    View details for Web of Science ID 000321847600003

  • A LASSO FOR HIERARCHICAL INTERACTIONS. Annals of statistics Bien, J., Taylor, J., Tibshirani, R. 2013; 41 (3): 1111-1141

    Abstract

    We add a set of convex constraints to the lasso to produce sparse interaction models that honor the hierarchy restriction that an interaction only be included in a model if one or both variables are marginally important. We give a precise characterization of the effect of this hierarchy constraint, prove that hierarchy holds with probability one and derive an unbiased estimate for the degrees of freedom of our estimator. A bound on this estimate reveals the amount of fitting "saved" by the hierarchy constraint. We distinguish between parameter sparsity-the number of nonzero coefficients-and practical sparsity-the number of raw variables one must measure to make a new prediction. Hierarchy focuses on the latter, which is more closely tied to important data collection concerns such as cost, time and effort. We develop an algorithm, available in the R package hierNet, and perform an empirical study of our method.

    View details for DOI 10.1214/13-AOS1096

    View details for PubMedID 26257447

    View details for PubMedCentralID PMC4527358

  • A Sparse-Group Lasso JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2013; 22 (2): 231-245
  • Classification of patients from time-course gene expression BIOSTATISTICS Zhang, Y., Tibshirani, R., Davis, R. 2013; 14 (1): 87-98

    Abstract

    Classifying patients into different risk groups based on their genomic measurements can help clinicians design appropriate clinical treatment plans. To produce such a classification, gene expression data were collected on a cohort of burn patients, who were monitored across multiple time points. This led us to develop a new classification method using time-course gene expressions. Our results showed that making good use of time-course information of gene expression improved the performance of classification compared with using gene expression from individual time points only. Our method is implemented into an R-package: time-course prediction analysis using microarray.

    View details for DOI 10.1093/biostatistics/kxs027

    View details for Web of Science ID 000312636300007

    View details for PubMedID 22926914

    View details for PubMedCentralID PMC3520502

  • Scientific research in the age of omics: the good, the bad, and the sloppy JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Witten, D. M., Tibshirani, R. 2013; 20 (1): 125-127

    Abstract

    It has been claimed that most research findings are false, and it is known that large-scale studies involving omics data are especially prone to errors in design, execution, and analysis. The situation is alarming because taxpayer dollars fund a substantial amount of biomedical research, and because the publication of a research article that is later determined to be flawed can erode the credibility of an entire field, resulting in a severe and negative impact for years to come. Here, we urge the development of an online, open-access, postpublication, peer review system that will increase the accountability of scientists for the quality of their research and the ability of readers to distinguish good from sloppy science.

    View details for DOI 10.1136/amiajnl-2012-000972

    View details for Web of Science ID 000313512900020

    View details for PubMedID 23037799

  • Coronary risk assessment among intermediate risk patients using a clinical and biomarker based algorithm developed and validated in two population cohorts CURRENT MEDICAL RESEARCH AND OPINION Cross, D. S., McCarty, C. A., Hytopoulos, E., Beggs, M., Nolan, N., Harrington, D. S., Hastie, T., Tibshirani, R., Tracy, R. P., Psaty, B. M., McClelland, R., Tsao, P. S., Quertermous, T. 2012; 28 (11): 1819-1830

    Abstract

    Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project (PMRP) individuals. A multivariable Cox regression model was fit using the most powerful risk predictors within the clinical and protein variables identified by repeated cross-validation. The resulting CHDRA algorithm was validated in a Multiple-Ethnic Study of Atherosclerosis (MESA) case-cohort sample.A CHDRA algorithm of age, sex, diabetes, and family history of MI, combined with serum levels of seven biomarkers (CTACK, Eotaxin, Fas Ligand, HGF, IL-16, MCP-3, and sFas) yielded a clinical net reclassification index of 42.7% (p < 0.001) for MESA patients with a recalibrated Framingham 5-year intermediate risk level. Across all patients, the model predicted acute coronary events (hazard ratio = 2.17, p < 0.001), and remained an independent predictor after Framingham risk factor adjustments.These include the slightly different event definition with the MESA samples and inability to include PMRP fatal CHD events.A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients. Improved accuracy in cardiovascular risk classification could lead to improved preventive care and fewer deaths.

    View details for DOI 10.1185/03007995.2012.742878

    View details for Web of Science ID 000310985600009

    View details for PubMedID 23092312

    View details for PubMedCentralID PMC3666558

  • Genome-wide Measurement of RNA Folding Energies MOLECULAR CELL Wan, Y., Qu, K., Ouyang, Z., Kertesz, M., Li, J., Tibshirani, R., Makino, D. L., Nutter, R. C., Segal, E., Chang, H. Y. 2012; 48 (2): 169-181

    Abstract

    RNA structural transitions are important in the function and regulation of RNAs. Here, we reveal a layer of transcriptome organization in the form of RNA folding energies. By probing yeast RNA structures at different temperatures, we obtained relative melting temperatures (Tm) for RNA structures in over 4000 transcripts. Specific signatures of RNA Tm demarcated the polarity of mRNA open reading frames and highlighted numerous candidate regulatory RNA motifs in 3' untranslated regions. RNA Tm distinguished noncoding versus coding RNAs and identified mRNAs with distinct cellular functions. We identified thousands of putative RNA thermometers, and their presence is predictive of the pattern of RNA decay in vivo during heat shock. The exosome complex recognizes unpaired bases during heat shock to degrade these RNAs, coupling intrinsic structural stabilities to gene regulation. Thus, genome-wide structural dynamics of RNA can parse functional elements of the transcriptome and reveal diverse biological insights.

    View details for DOI 10.1016/j.molcel.2012.08.008

    View details for PubMedID 22981864

  • Inference with Transposable Data: Modeling the Effects of Row and Column Correlations. Journal of the Royal Statistical Society. Series B, Statistical methodology Allen, G. I., Tibshirani, R. 2012; 74 (4): 721-743

    Abstract

    We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent due to latent variables or unknown batch effects. By modeling this matrix data using the matrix-variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously estimate row and column covariances and use these to sphere or de-correlate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance of the false discovery rate estimators.

    View details for DOI 10.1111/j.1467-9868.2011.01027.x

    View details for PubMedID 34880705

    View details for PubMedCentralID PMC8649963

  • Normalization, testing, and false discovery rate estimation for RNA-sequencing data BIOSTATISTICS Li, J., Witten, D. M., Johnstone, I. M., Tibshirani, R. 2012; 13 (3): 523-538

    Abstract

    We discuss the identification of genes that are associated with an outcome in RNA sequencing and other sequence-based comparative genomic experiments. RNA-sequencing data take the form of counts, so models based on the Gaussian distribution are unsuitable. Moreover, normalization is challenging because different sequencing experiments may generate quite different total numbers of reads. To overcome these difficulties, we use a log-linear model with a new approach to normalization. We derive a novel procedure to estimate the false discovery rate (FDR). Our method can be applied to data with quantitative, two-class, or multiple-class outcomes, and the computation is fast even for large data sets. We study the accuracy of our approaches for significance calculation and FDR estimation, and we demonstrate that our method has potential advantages over existing methods that are based on a Poisson or negative binomial model. In summary, this work provides a pipeline for the significance analysis of sequencing data.

    View details for DOI 10.1093/biostatistics/kxr031

    View details for Web of Science ID 000305420000012

    View details for PubMedID 22003245

    View details for PubMedCentralID PMC3372940

  • STANDARDIZATION AND THE GROUP LASSO PENALTY. Statistica Sinica Simon, N., Tibshirani, R. 2012; 22 (3): 983-1001

    Abstract

    We re-examine the original Group Lasso paper of Yuan and Lin (2007). The form of penalty in that paper seems to be designed for problems with uncorrelated features, but the statistical community has adopted it for general problems with correlated features. We show that for this general situation, a Group Lasso with a different choice of penalty matrix is generally more effective. We give insight into this formulation and show that it is intimately related to the uniformly most powerful invariant test for inclusion of a group. We demonstrate the efficacy of this method- the "standardized Group Lasso"- over the usual group lasso on real and simulated data sets. We also extend this to the Ridged Group Lasso to provide within group regularization as needed. We discuss a simple algorithm based on group-wise coordinate descent to fit both this standardized Group Lasso and Ridged Group Lasso.

    View details for DOI 10.5705/ss.2011.075

    View details for PubMedID 26257503

    View details for PubMedCentralID PMC4527185

  • STANDARDIZATION AND THE GROUP LASSO PENALTY STATISTICA SINICA Simon, N., Tibshirani, R. 2012; 22 (3): 983-1001
  • Autoantibody Epitope Spreading in the Pre-Clinical Phase Predicts Progression to Rheumatoid Arthritis PLOS ONE Sokolove, J., Bromberg, R., Deane, K. D., Lahey, L. J., Derber, L. A., Chandra, P. E., Edison, J. D., Gilliland, W. R., Tibshirani, R. J., Norris, J. M., Holers, V. M., Robinson, W. H. 2012; 7 (5)

    Abstract

    Rheumatoid arthritis (RA) is a prototypical autoimmune arthritis affecting nearly 1% of the world population and is a significant cause of worldwide disability. Though prior studies have demonstrated the appearance of RA-related autoantibodies years before the onset of clinical RA, the pattern of immunologic events preceding the development of RA remains unclear. To characterize the evolution of the autoantibody response in the preclinical phase of RA, we used a novel multiplex autoantigen array to evaluate development of the anti-citrullinated protein antibodies (ACPA) and to determine if epitope spread correlates with rise in serum cytokines and imminent onset of clinical RA. To do so, we utilized a cohort of 81 patients with clinical RA for whom stored serum was available from 1-12 years prior to disease onset. We evaluated the accumulation of ACPA subtypes over time and correlated this accumulation with elevations in serum cytokines. We then used logistic regression to identify a profile of biomarkers which predicts the imminent onset of clinical RA (defined as within 2 years of testing). We observed a time-dependent expansion of ACPA specificity with the number of ACPA subtypes. At the earliest timepoints, we found autoantibodies targeting several innate immune ligands including citrullinated histones, fibrinogen, and biglycan, thus providing insights into the earliest autoantigen targets and potential mechanisms underlying the onset and development of autoimmunity in RA. Additionally, expansion of the ACPA response strongly predicted elevations in many inflammatory cytokines including TNF-α, IL-6, IL-12p70, and IFN-γ. Thus, we observe that the preclinical phase of RA is characterized by an accumulation of multiple autoantibody specificities reflecting the process of epitope spread. Epitope expansion is closely correlated with the appearance of preclinical inflammation, and we identify a biomarker profile including autoantibodies and cytokines which predicts the imminent onset of clinical arthritis.

    View details for DOI 10.1371/journal.pone.0035296

    View details for PubMedID 22662108

  • DEGREES OF FREEDOM IN LASSO PROBLEMS ANNALS OF STATISTICS Tibshirani, R. J., Taylor, J. 2012; 40 (2): 1198-1232

    View details for DOI 10.1214/12-AOS1003

    View details for Web of Science ID 000307608000021

  • Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society. Series B, Statistical methodology Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R. J. 2012; 74 (2): 245-266

    Abstract

    We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.

    View details for DOI 10.1111/j.1467-9868.2011.01004.x

    View details for PubMedID 25506256

    View details for PubMedCentralID PMC4262615

  • In situ vaccination against mycosis fungoides by intratumoral injection of a TLR9 agonist combined with radiation: a phase 1/2 study BLOOD Kim, Y. H., Gratzinger, D., Harrison, C., Brody, J. D., Czerwinski, D. K., Ai, W. Z., Morales, A., Abdulla, F., Xing, L., Navi, D., Tibshirani, R. J., Advani, R. H., Lingala, B., Shah, S., Hoppe, R. T., Levy, R. 2012; 119 (2): 355-363

    Abstract

    We have developed and previously reported on a therapeutic vaccination strategy for indolent B-cell lymphoma that combines local radiation to enhance tumor immunogenicity with the injection into the tumor of a TLR9 agonist. As a result, antitumor CD8(+) T cells are induced, and systemic tumor regression was documented. Because the vaccination occurs in situ, there is no need to manufacture a vaccine product. We have now explored this strategy in a second disease: mycosis fungoides (MF). We treated 15 patients. Clinical responses were assessed at the distant, untreated sites as a measure of systemic antitumor activity. Five clinically meaningful responses were observed. The procedure was well tolerated and adverse effects consisted mostly of mild and transient injection site or flu-like symptoms. The immunized sites showed a significant reduction of CD25(+), Foxp3(+) T cells that could be either MF cells or tissue regulatory T cells and a similar reduction in S100(+), CD1a(+) dendritic cells. There was a trend toward greater reduction of CD25(+) T cells and skin dendritic cells in clinical responders versus nonresponders. Our in situ vaccination strategy is feasible also in MF and the clinical responses that occurred in a subset of patients warrant further study with modifications to augment these therapeutic effects. This study is registered at www.clinicaltrials.gov as NCT00226993.

    View details for DOI 10.1182/blood-2011-05-355222

    View details for PubMedID 22045986

  • Strong rules for discarding predictors in lasso-type problems J. Royal stat. Assoc B robert tibshirani, bien, friedman, Hastie, Simon, Taylor, Tibshirani 2012; 74: 245-266
  • Inference with transposable data: modelling the effects of row and column correlations JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Allen, G. I., Tibshirani, R. 2012; 74: 721-743
  • Strong rules for discarding predictors in lasso-type problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R. J. 2012; 74: 245-266

    Abstract

    We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.

    View details for DOI 10.1111/j.1467-9868.2011.01004.x

    View details for Web of Science ID 000301286200004

    View details for PubMedCentralID PMC4262615

  • Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers GENOME BIOLOGY Brunner, A. L., Beck, A. H., Edris, B., Sweeney, R. T., Zhu, S. X., Li, R., Montgomery, K., Varma, S., Gilks, T., Guo, X., Foley, J. W., Witten, D. M., Giacomini, C. P., Flynn, R. A., Pollack, J. R., Tibshirani, R., Chang, H. Y., van de Rijn, M., West, R. B. 2012; 13 (8)

    Abstract

    BACKGROUND: Molecular characterization of tumors has been critical for identifying important genes in cancer biology and for improving tumor classification and diagnosis. Long non-coding RNAs, as a new, relatively unstudied class of transcripts, provide a rich opportunity to identify both functional drivers and cancer-type-specific biomarkers. However, despite the potential importance of long non-coding RNAs to the cancer field, no comprehensive survey of long non-coding RNA expression across various cancers has been reported. RESULTS: We performed a sequencing-based transcriptional survey of both known long non-coding RNAs and novel intergenic transcripts across a panel of 64 archival tumor samples comprising 17 diagnostic subtypes of adenocarcinomas, squamous cell carcinomas and sarcomas. We identified hundreds of transcripts from among the known 1,065 long non-coding RNAs surveyed that showed variability in transcript levels between the tumor types and are therefore potential biomarker candidates. We discovered 1,071 novel intergenic transcribed regions and demonstrate that these show similar patterns of variability between tumor types. We found that many of these differentially expressed cancer transcripts are also expressed in normal tissues. One such novel transcript specifically expressed in breast tissue was further evaluated using RNA in situ hybridization on a panel of breast tumors. It was shown to correlate with low tumor grade and estrogen receptor expression, thereby representing a potentially important new breast cancer biomarker. CONCLUSIONS: This study provides the first large survey of long non-coding RNA expression within a panel of solid cancers and also identifies a number of novel transcribed regions differentially expressed across distinct cancer types that represent candidate biomarkers for future research.

    View details for Web of Science ID 000315867500009

  • Sparse estimation of a covariance matrix BIOMETRIKA Bien, J., Tibshirani, R. J. 2011; 98 (4): 807-820

    Abstract

    We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.

    View details for DOI 10.1093/biomet/asr054

    View details for Web of Science ID 000297366000004

    View details for PubMedCentralID PMC3413177

  • Sparse estimation of a covariance matrix. Biometrika Bien, J., Tibshirani, R. J. 2011; 98 (4): 807-820

    Abstract

    We suggest a method for estimating a covariance matrix on the basis of a sample of vectors drawn from a multivariate normal distribution. In particular, we penalize the likelihood with a lasso penalty on the entries of the covariance matrix. This penalty plays two important roles: it reduces the effective number of parameters, which is important even when the dimension of the vectors is smaller than the sample size since the number of parameters grows quadratically in the number of variables, and it produces an estimate which is sparse. In contrast to sparse inverse covariance estimation, our method's close relative, the sparsity attained here is in the covariance matrix itself rather than in the inverse matrix. Zeros in the covariance matrix correspond to marginal independencies; thus, our method performs model selection while providing a positive definite estimate of the covariance. The proposed penalized maximum likelihood problem is not convex, so we use a majorize-minimize approach in which we iteratively solve convex approximations to the original nonconvex problem. We discuss tuning parameter selection and demonstrate on a flow-cytometry dataset how our method produces an interpretable graphical display of the relationship between variables. We perform simulations that suggest that simple elementwise thresholding of the empirical covariance matrix is competitive with our method for identifying the sparsity structure. Additionally, we show how our method can be used to solve a previously studied special case in which a desired sparsity pattern is prespecified.

    View details for DOI 10.1093/biomet/asr054

    View details for PubMedID 23049130

    View details for PubMedCentralID PMC3413177

  • PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION ANNALS OF APPLIED STATISTICS Bien, J., Tibshirani, R. 2011; 5 (4): 2403-2424

    View details for DOI 10.1214/11-AOAS495

    View details for Web of Science ID 000300382800008

  • A fused lasso latent feature model for analyzing multi-sample aCGH data BIOSTATISTICS Nowak, G., Hastie, T., Pollack, J. R., Tibshirani, R. 2011; 12 (4): 776-791

    Abstract

    Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.

    View details for DOI 10.1093/biostatistics/kxr012

    View details for Web of Science ID 000294806800014

    View details for PubMedID 21642389

  • Hierarchical Clustering With Prototypes via Minimax Linkage JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bien, J., Tibshirani, R. 2011; 106 (495): 1075-1084
  • Prediction of survival in diffuse large B-cell lymphoma based on the expression of 2 genes reflecting tumor and microenvironment BLOOD Alizadeh, A. A., Gentles, A. J., Alencar, A. J., Liu, C. L., Kohrt, H. E., Houot, R., Goldstein, M. J., Zhao, S., Natkunam, Y., Advani, R. H., Gascoyne, R. D., Briones, J., Tibshirani, R. J., Myklebust, J. H., Plevritis, S. K., Lossos, I. S., Levy, R. 2011; 118 (5): 1350-1358

    Abstract

    Several gene-expression signatures predict survival in diffuse large B-cell lymphoma (DLBCL), but the lack of practical methods for genome-scale analysis has limited translation to clinical practice. We built and validated a simple model using one gene expressed by tumor cells and another expressed by host immune cells, assessing added prognostic value to the clinical International Prognostic Index (IPI). LIM domain only 2 (LMO2) was validated as an independent predictor of survival and the "germinal center B cell-like" subtype. Expression of tumor necrosis factor receptor superfamily member 9 (TNFRSF9) from the DLBCL microenvironment was the best gene in bivariate combination with LMO2. Study of TNFRSF9 tissue expression in 95 patients with DLBCL showed expression limited to infiltrating T cells. A model integrating these 2 genes was independent of "cell-of-origin" classification, "stromal signatures," IPI, and added to the predictive power of the IPI. A composite score integrating these genes with IPI performed well in 3 independent cohorts of 545 DLBCL patients, as well as in a simple assay of routine formalin-fixed specimens from a new validation cohort of 147 patients with DLBCL. We conclude that the measurement of a single gene expressed by tumor cells (LMO2) and a single gene expressed by the immune microenvironment (TNFRSF9) powerfully predicts overall survival in patients with DLBCL.

    View details for DOI 10.1182/blood-2011-03-345272

    View details for PubMedID 21670469

  • NOVEL CELL-TYPE SPECIFIC DECONVOLUTION OF WHOLE-BLOOD GENE EXPRESSION PROFILES IN RENAL ACUTE REJECTION Khatri, P., Shen-Orr, S., Tibshirani, R., Butte, A., Sarwal, M. WILEY-BLACKWELL. 2011: 79–80
  • MicroRNAs Are Independent Predictors of Outcome in Diffuse Large B-Cell Lymphoma Patients Treated with R-CHOP CLINICAL CANCER RESEARCH Alencar, A. J., Malumbres, R., Kozloski, G. A., Advani, R., Talreja, N., Chinichian, S., Briones, J., Natkunam, Y., Sehn, L. H., Gascoyne, R. D., Tibshirani, R., Lossos, I. S. 2011; 17 (12): 4125-4135

    Abstract

    Diffuse large B-cell lymphoma (DLBCL) heterogeneity has prompted investigations for new biomarkers that can accurately predict survival. A previously reported 6-gene model combined with the International Prognostic Index (IPI) could predict patients' outcome. However, even these predictors are not capable of unambiguously identifying outcome, suggesting that additional biomarkers might improve their predictive power.We studied expression of 11 microRNAs (miRNA) that had previously been reported to have variable expression in DLBCL tumors. We measured the expression of each miRNA by quantitative real-time PCR analyses in 176 samples from uniformly treated DLBCL patients and correlated the results to survival.In a univariate analysis, the expression of miR-18a correlated with overall survival (OS), whereas the expression of miR-181a and miR-222 correlated with progression-free survival (PFS). A multivariate Cox regression analysis including the IPI, the 6-gene model-derived mortality predictor score and expression of the miR-18a, miR-181a, and miR-222, revealed that all variables were independent predictors of survival except the expression of miR-222 for OS and the expression of miR-18a for PFS.The expression of specific miRNAs may be useful for DLBCL survival prediction and their role in the pathogenesis of this disease should be examined further.

    View details for DOI 10.1158/1078-0432.CCR-11-0224

    View details for Web of Science ID 000291644700029

    View details for PubMedID 21525173

    View details for PubMedCentralID PMC3117929

  • THE SOLUTION PATH OF THE GENERALIZED LASSO ANNALS OF STATISTICS Tibshirani, R. J., Taylor, J. 2011; 39 (3): 1335-1371

    View details for DOI 10.1214/11-AOS878

    View details for Web of Science ID 000293716500001

  • Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2011; 39 (5): 1-13

    Abstract

    We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.

    View details for Web of Science ID 000288204000001

    View details for PubMedCentralID PMC4824408

  • Human transcriptome array for high-throughput clinical studies PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Xu, W., Seok, J., Mindrinos, M. N., Schweitzer, A. C., Jiang, H., Wilhelmy, J., Clark, T. A., Kapur, K., Xing, Y., Faham, M., Storey, J. D., Moldawer, L. L., Maier, R. V., Tompkins, R. G., Wong, W. H., Davis, R. W., Xiao, W. 2011; 108 (9): 3707-3712

    Abstract

    A 6.9 million-feature oligonucleotide array of the human transcriptome [Glue Grant human transcriptome (GG-H array)] has been developed for high-throughput and cost-effective analyses in clinical studies. This array allows comprehensive examination of gene expression and genome-wide identification of alternative splicing as well as detection of coding SNPs and noncoding transcripts. The performance of the array was examined and compared with mRNA sequencing (RNA-Seq) results over multiple independent replicates of liver and muscle samples. Compared with RNA-Seq of 46 million uniquely mappable reads per replicate, the GG-H array is highly reproducible in estimating gene and exon abundance. Although both platforms detect similar expression changes at the gene level, the GG-H array is more sensitive at the exon level. Deeper sequencing is required to adequately cover low-abundance transcripts. The array has been implemented in a multicenter clinical program and has generated high-quality, reproducible data. Considering the clinical trial requirements of cost, sample availability, and throughput, the GG-H array has a wide range of applications. An emerging approach for large-scale clinical genomic studies is to first use RNA-Seq to the sufficient depth for the discovery of transcriptome elements relevant to the disease process followed by high-throughput and reliable screening of these elements on thousands of patient samples using custom-designed arrays.

    View details for DOI 10.1073/pnas.1019753108

    View details for Web of Science ID 000287844400051

    View details for PubMedID 21317363

    View details for PubMedCentralID PMC3048146

  • The Prognostic Value of Tumor-Associated Macrophages in Leiomyosarcoma A Single Institution Study AMERICAN JOURNAL OF CLINICAL ONCOLOGY-CANCER CLINICAL TRIALS Ganjoo, K. N., Witten, D., Patel, M., Espinosa, I., La, T., Tibshirani, R., van de Rijn, M., Jacobs, C., West, R. B. 2011; 34 (1): 82-86

    Abstract

    High numbers of tumor-associated macrophages (TAMs) have been associated with poor outcome in several solid tumors. In 2 previous studies, we showed that colony stimulating factor-1 (CSF1) is secreted by leiomyosarcoma (LMS) and that the increase in macrophages and CSF1 associated proteins are markers for poor prognosis in both gynecologic and nongynecologic LMS in a multicentered study. The purpose of this study is to evaluate the outcome of patients with LMS from a single institution according to the number of TAMs evaluated through 3 CSF1 associated proteins.Patients with LMS treated at Stanford University with adequate archived tissue and clinical data were eligible for this retrospective study. Data from chart reviews included tumor site, size, grade, stage, treatment, and disease status at the time of last follow-up. The 3 CSF1 associated proteins (CD163, CD16, and cathepsin L) were evaluated by immunohistochemistry on tissue microarrays. Kaplan-Meier survival curves and univariate Cox proportional hazards models were fit to assess the association of clinical predictors as well as CSF1 associated proteins with overall survival.A total of 52 patients diagnosed from 1983 to 2007 were evaluated. Univariate Cox proportional hazards models were fit to assess the significance of grade, size, stage, and the 3 CSF1 associated proteins in predicting OS. Grade, size, and stage were not significantly associated with survival in the full patient cohort, but grade and stage were significant predictors of survival in the gynecologic (GYN) LMS samples (P = 0.038 and P = 0.0164, respectively). Increased cathepsin L was associated with a worse outcome in GYN LMS (P = 0.049). Similar findings were seen with CD16 (P < 0.0001). In addition, CSF1 response enriched (all 3 stains positive) GYN LMS had a poor overall survival when compared with CSF1 response poor tumors (P = 0.001). These results were not seen in non-GYN LMS.Our data form an independent confirmation of the prognostic significance of TAMs and the CSF1 associated proteins in LMS. More aggressive or targeted therapies could be considered in the subset of LMS patients that highly express these markers.

    View details for DOI 10.1097/COC.0b013e3181d26d5e

    View details for PubMedID 23781555

  • Nearly-Isotonic Regression TECHNOMETRICS Tibshirani, R. J., Hoefling, H., Tibshirani, R. 2011; 53 (1): 54-61
  • Adaptive index models for marker-based risk stratification BIOSTATISTICS Tian, L., Tibshirani, R. 2011; 12 (1): 68-86

    Abstract

    We use the term "index predictor" to denote a score that consists of K binary rules such as "age > 60" or "blood pressure > 120 mm Hg." The index predictor is the sum of these binary scores, yielding a value from 0 to K. Such indices as often used in clinical studies to stratify population risk: They are usually derived from subject area considerations. In this paper, we propose a fast data-driven procedure for automatically constructing such indices for linear, logistic, and Cox regression models. We also extend the procedure to create indices for detecting treatment-marker interactions. The methods are illustrated on a study with protein biomarkers as well as a large microarray gene expression study.

    View details for DOI 10.1093/biostatistics/kxq047

    View details for Web of Science ID 000285625800005

    View details for PubMedID 20663850

    View details for PubMedCentralID PMC3006126

  • Regression shrinkage and selection via the lasso: a retrospective JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R. 2011; 73: 273-282
  • Penalized classification using Fisher's linear discriminant JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Witten, D. M., Tibshirani, R. 2011; 73: 753-772
  • Bayesian gene set analysis for identifying significant biological pathways JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS Shahbaba, B., Tibshirani, R., Shachaf, C. M., Plevritis, S. K. 2011; 60: 541-557

    Abstract

    We propose a hierarchical Bayesian model for analyzing gene expression data to identify pathways differentiating between two biological states (e.g., cancer vs. non-cancer and mutant vs. normal). Finding significant pathways can improve our understanding of biological processes. When the biological process of interest is related to a specific disease, eliciting a better understanding of the underlying pathways can lead to designing a more effective treatment. We apply our method to data obtained by interrogating the mutational status of p53 in 50 cancer cell lines (33 mutated and 17 normal). We identify several significant pathways with strong biological connections. We show that our approach provides a natural framework for incorporating prior biological information, and it has the best overall performance in terms of correctly identifying significant pathways compared to several alternative methods.

    View details for DOI 10.1111/j.1467-9876.2011.00765.x

    View details for Web of Science ID 000293235800004

    View details for PubMedCentralID PMC3156489

  • Supervised multidimensional scaling for visualization, classification, and bipartite ranking COMPUTATIONAL STATISTICS & DATA ANALYSIS Witten, D. M., Tibshirani, R. 2011; 55 (1): 789-801
  • A statistician plays darts JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY Tibshirani, R. J., Price, A., Taylor, J. 2011; 174: 213-226
  • In Situ Vaccination with TLR9 Agonist Combined with Local Radiation In Mycosis Fungoides: Analysis of Phase I/II Study 52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH) Kim, Y. H., Gratzinger, D., Harrison, C., Brody, J., Czerwinski, D., Xing, L., Morales, A., Ai, W., Abdulla, F., Navi, D., Tibshirani, R. J., Advani, R., Natkunam, Y., Hoppe, R. T., Levy, R. AMER SOC HEMATOLOGY. 2010: 130–30
  • Prediction of Survival In Diffuse Large B-Cell Lymphoma Based On the Expression of Two Genes Reflecting Tumor and Microenvironment 52nd Annual Meeting and Exposition of the American-Society-of-Hematology (ASH) Alizadeh, A. A., Gentles, A. J., Alencar, A. J., Kohrt, H. E., Houot, R., Goldstein, M. J., Zhao, S., Natkunam, Y., Advani, R., Gascoyne, R. D., Briones, J., Tibshirani, R. J., Myklebust, J. H., Plevritis, S. K., Lossos, I. S., Levy, R. AMER SOC HEMATOLOGY. 2010: 836–37
  • In Situ Vaccination With a TLR9 Agonist Induces Systemic Lymphoma Regression: A Phase I/II Study JOURNAL OF CLINICAL ONCOLOGY Brody, J. D., Ai, W. Z., Czerwinski, D. K., Torchia, J. A., Levy, M., Advani, R. H., Kim, Y. H., Hoppe, R. T., Knox, S. J., Shin, L. K., Wapnir, I., Tibshirani, R. J., Levy, R. 2010; 28 (28): 4324-4332

    Abstract

    Combining tumor antigens with an immunostimulant can induce the immune system to specifically eliminate cancer cells. Generally, this combination is accomplished in an ex vivo, customized manner. In a preclinical lymphoma model, intratumoral injection of a Toll-like receptor 9 (TLR9) agonist induced systemic antitumor immunity and cured large, disseminated tumors.We treated 15 patients with low-grade B-cell lymphoma using low-dose radiotherapy to a single tumor site and-at that same site-injected the C-G enriched, synthetic oligodeoxynucleotide (also referred to as CpG) TLR9 agonist PF-3512676. Clinical responses were assessed at distant, untreated tumor sites. Immune responses were evaluated by measuring T-cell activation after in vitro restimulation with autologous tumor cells.This in situ vaccination maneuver was well-tolerated with only grade 1 to 2 local or systemic reactions and no treatment-limiting adverse events. One patient had a complete clinical response, three others had partial responses, and two patients had stable but continually regressing disease for periods significantly longer than that achieved with prior therapies. Vaccination induced tumor-reactive memory CD8 T cells. Some patients' tumors were able to induce a suppressive, regulatory phenotype in autologous T cells in vitro; these patients tended to have a shorter time to disease progression. One clinically responding patient received a second course of vaccination after relapse resulting in a second, more rapid clinical response.In situ tumor vaccination with a TLR9 agonist induces systemic antilymphoma clinical responses. This maneuver is clinically feasible and does not require the production of a customized vaccine product.

    View details for DOI 10.1200/JCO.2010.28.9793

    View details for Web of Science ID 000282272700032

    View details for PubMedID 20697067

    View details for PubMedCentralID PMC2954133

  • Spectral Regularization Algorithms for Learning Large Incomplete Matrices JOURNAL OF MACHINE LEARNING RESEARCH Mazumder, R., Hastie, T., Tibshirani, R. 2010; 11: 2287-2322

    Abstract

    We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

    View details for Web of Science ID 000282523300010

    View details for PubMedCentralID PMC3087301

  • Analysis of factorial time-course microarrays with application to a clinical study of burn injury. Proceedings of the National Academy of Sciences of the United States of America Zhou, B., Xu, W., Herndon, D., Tompkins, R., Davis, R., Xiao, W., Wong, W. H., Toner, M., Warren, H. S., Schoenfeld, D. A., Rahme, L., McDonald-Smith, G. P., Hayden, D., Mason, P., Fagan, S., Yu, Y., Cobb, J. P., Remick, D. G., Mannick, J. A., Lederer, J. A., Gamelli, R. L., Silver, G. M., West, M. A., Shapiro, M. B., Smith, R., Camp, D. G., Qian, W., Storey, J., Mindrinos, M., Tibshirani, R., Lowry, S., Calvano, S., Chaudry, I., West, M. A., Cohen, M., Moore, E. E., Johnson, J., Moldawer, L. L., Baker, H. V., Efron, P. A., Balis, U. G., Billiar, T. R., Ochoa, J. B., Sperry, J. L., Miller-Graziano, C. L., De, A. K., Bankey, P. E., Finnerty, C. C., Jeschke, M. G., Minei, J. P., Arnoldo, B. D., Hunt, J. L., Horton, J., Cobb, J. P., Brownstein, B., Freeman, B., Maier, R. V., Nathens, A. B., Cuschieri, J., Gibran, N., Klein, M., O'Keefe, G. 2010; 107 (22): 9923-9928

    Abstract

    Time-course microarray experiments are capable of capturing dynamic gene expression profiles. It is important to study how these dynamic profiles depend on the multiple factors that characterize the experimental condition under which the time course is observed. Analytic methods are needed to simultaneously handle the time course and factorial structure in the data. We developed a method to evaluate factor effects by pooling information across the time course while accounting for multiple testing and nonnormality of the microarray data. The method effectively extracts gene-specific response features and models their dependency on the experimental factors. Both longitudinal and cross-sectional time-course data can be handled by our approach. The method was used to analyze the impact of age on the temporal gene response to burn injury in a large-scale clinical study. Our analysis reveals that 21% of the genes responsive to burn are age-specific, among which expressions of mitochondria and immunoglobulin genes are differentially perturbed in pediatric and adult patients by burn injury. These new findings in the body's response to burn injury between children and adults support further investigations of therapeutic options targeting specific age groups. The methodology proposed here has been implemented in R package "TANOVA" and submitted to the Comprehensive R Archive Network at http://www.r-project.org/. It is also available for download at http://gluegrant1.stanford.edu/TANOVA/.

    View details for DOI 10.1073/pnas.1002757107

    View details for PubMedID 20479259

    View details for PubMedCentralID PMC2890487

  • TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION. The annals of applied statistics Allen, G. I., Tibshirani, R. 2010; 4 (2): 764-790

    Abstract

    Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

    View details for DOI 10.1214/09-AOAS314

    View details for PubMedID 26877823

    View details for PubMedCentralID PMC4751046

  • A Framework for Feature Selection in Clustering JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Witten, D. M., Tibshirani, R. 2010; 105 (490): 713-726
  • TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION ANNALS OF APPLIED STATISTICS Allen, G. I., Tibshirani, R. 2010; 4 (2): 764-790

    View details for DOI 10.1214/09-AOAS314

    View details for Web of Science ID 000283528500011

  • Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls BMC BIOLOGY Witten, D., Tibshirani, R., Gu, S. G., Fire, A., Lui, W. 2010; 8

    Abstract

    Ultra-high throughput sequencing technologies provide opportunities both for discovery of novel molecular species and for detailed comparisons of gene expression patterns. Small RNA populations are particularly well suited to this analysis, as many different small RNAs can be completely sequenced in a single instrument run.We prepared small RNA libraries from 29 tumour/normal pairs of human cervical tissue samples. Analysis of the resulting sequences (42 million in total) defined 64 new human microRNA (miRNA) genes. Both arms of the hairpin precursor were observed in twenty-three of the newly identified miRNA candidates. We tested several computational approaches for the analysis of class differences between high throughput sequencing datasets and describe a novel application of a log linear model that has provided the most effective analysis for this data. This method resulted in the identification of 67 miRNAs that were differentially-expressed between the tumour and normal samples at a false discovery rate less than 0.001.This approach can potentially be applied to any kind of RNA sequencing data for analysing differential sequence representation between biological sample sets.

    View details for DOI 10.1186/1741-7007-8-58

    View details for Web of Science ID 000279780700001

    View details for PubMedID 20459774

    View details for PubMedCentralID PMC2880020

  • Cell type-specific gene expression differences in complex tissues NATURE METHODS Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M., Butte, A. J. 2010; 7 (4): 287-289

    Abstract

    We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.

    View details for DOI 10.1038/NMETH.1439

    View details for Web of Science ID 000276150600017

    View details for PubMedID 20208531

  • Novel Cell-Type Specific Deconvolution of Whole-Blood Gene Expression Profiles in Renal Acute Rejection 10th American Transplant Congress Khatri, P., Shen-Orr, S., Tibshirani, R., Butte, A. J., Sarwal, M. M. WILEY-BLACKWELL. 2010: 294–294
  • C-C Chemokine Receptor 1 Expression in Human Hematolymphoid Neoplasia AMERICAN JOURNAL OF CLINICAL PATHOLOGY Anderson, M. W., Zhao, S., Ai, W. Z., Tibshirani, R., Levy, R., Lossos, I. S., Natkunam, Y. 2010; 133 (3): 473-483

    Abstract

    Chemokine receptor 1 (CCR1) is a G protein-coupled receptor that binds to members of the C-C chemokine family. Recently, CCL3 (MIP-1alpha), a high-affinity CCR1 ligand, was identified as part of a model that independently predicts survival in patients with diffuse large B-cell lymphoma (DLBCL). However, the role of chemokine signaling in the pathogenesis of human lymphomas is unclear. In normal human hematopoietic tissues, we found CCR1 expression in intraepithelial B cells of human tonsil and granulocytic/monocytic cells in the bone marrow. Immunohistochemical analysis of 944 cases of hematolymphoid neoplasia identified CCR1 expression in a subset of B- and T-cell lymphomas, plasma cell myeloma, acute myeloid leukemia, and classical Hodgkin lymphoma. CCR1 expression correlated with the non-germinal center subtype of DLBCL but did not predict overall survival in follicular lymphoma. These data suggest that CCR1 may be useful for lymphoma classification and support a role for chemokine signaling in the pathogenesis of hematolymphoid neoplasia.

    View details for DOI 10.1309/AJCP1TA3FLOQTMHF

    View details for Web of Science ID 000274687800016

    View details for PubMedID 20154287

  • Spectral Regularization Algorithms for Learning Large Incomplete Matrices. Journal of machine learning research : JMLR Mazumder, R., Hastie, T., Tibshirani, R. 2010; 11: 2287-2322

    Abstract

    We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

    View details for PubMedID 21552465

    View details for PubMedCentralID PMC3087301

  • Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling ONCOGENE Beck, A. H., Lee, C., WITTEN, D. M., Gleason, B. C., Edris, B., Espinosa, I., Zhu, S., Li, R., Montgomery, K. D., Marinelli, R. J., Tibshirani, R., Hastie, T., Jablons, D. M., Rubin, B. P., Fletcher, C. D., West, R. B., van de Rijn, M. 2010; 29 (6): 845-854

    Abstract

    Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.

    View details for DOI 10.1038/onc.2009.381

    View details for Web of Science ID 000274397800007

    View details for PubMedID 19901961

    View details for PubMedCentralID PMC2820592

  • Survival analysis with high-dimensional covariates STATISTICAL METHODS IN MEDICAL RESEARCH Witten, D. M., Tibshirani, R. 2010; 19 (1): 29-51

    Abstract

    In recent years, breakthroughs in biomedical technology have led to a wealth of data in which the number of features (for instance, genes on which expression measurements are available) exceeds the number of observations (e.g. patients). Sometimes survival outcomes are also available for those same observations. In this case, one might be interested in (a) identifying features that are associated with survival (in a univariate sense), and (b) developing a multivariate model for the relationship between the features and survival that can be used to predict survival in a new observation. Due to the high dimensionality of this data, most classical statistical methods for survival analysis cannot be applied directly. Here, we review a number of methods from the literature that address these two problems.

    View details for DOI 10.1177/0962280209105024

    View details for Web of Science ID 000274317100003

    View details for PubMedID 19654171

  • CD81 protein is expressed at high levels in normal germinal center B cells and in subtypes of human lymphomas HUMAN PATHOLOGY Luo, R. F., Zhao, S., Tibshirani, R., Myklebust, J. H., Sanyal, M., Fernandez, R., Gratzinger, D., Marinelli, R. J., Lu, Z. S., Wong, A., Levy, R., Levy, S., Natkunam, Y. 2010; 41 (2): 271-280

    Abstract

    CD81 is a tetraspanin cell surface protein that regulates CD19 expression in B lymphocytes and enables hepatitis C virus infection of human cells. Immunohistologic analysis in normal hematopoietic tissue showed strong staining for CD81 in normal germinal center B cells, a cell type in which its increased expression has not been previously recognized. High-dimensional flow cytometry analysis of normal hematopoietic tissue confirmed that among B- and T-cell subsets, germinal center B cells showed the highest level of CD81 expression. In more than 800 neoplastic tissue samples, its expression was also found in most non-Hodgkin lymphomas. Staining for CD81 was rarely seen in multiple myeloma, Hodgkin lymphoma, or myeloid leukemia. In hierarchical cluster analysis of diffuse large B-cell lymphoma, staining for CD81 was most similar to other germinal center B cell-associated markers, particularly LMO2. By flow cytometry, CD81 was expressed in diffuse large B-cell lymphoma cells independent of the presence or absence of CD10, another germinal center B-cell marker. The detection of CD81 in routine biopsy samples and its differential expression in lymphoma subtypes, particularly diffuse large B-cell lymphoma, warrant further study to assess CD81 expression and its role in the risk stratification of patients with diffuse large B-cell lymphoma.

    View details for DOI 10.1016/j.humpath.2009.07.022

    View details for Web of Science ID 000276493600015

    View details for PubMedID 20004001

    View details for PubMedCentralID PMC2813949

  • DR-Integrator: a new analytic tool for integrating DNA copy number and gene expression data BIOINFORMATICS Salari, K., Tibshirani, R., Pollack, J. R. 2010; 26 (3): 414-416

    Abstract

    DNA copy number alterations (CNA) frequently underlie gene expression changes by increasing or decreasing gene dosage. However, only a subset of genes with altered dosage exhibit concordant changes in gene expression. This subset is likely to be enriched for oncogenes and tumor suppressor genes, and can be identified by integrating these two layers of genome-scale data. We introduce DNA/RNA-Integrator (DR-Integrator), a statistical software tool to perform integrative analyses on paired DNA copy number and gene expression data. DR-Integrator identifies genes with significant correlations between DNA copy number and gene expression, and implements a supervised analysis that captures genes with significant alterations in both DNA copy number and gene expression between two sample classes.DR-Integrator is freely available for non-commercial use from the Pollack Lab at http://pollacklab.stanford.edu/ and can be downloaded as a plug-in application to Microsoft Excel and as a package for the R statistical computing environment. The R package is available under the name 'DRI' at http://cran.r-project.org/. An example analysis using DR-Integrator is included as supplemental material.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btp702

    View details for Web of Science ID 000274342800021

    View details for PubMedID 20031972

    View details for PubMedCentralID PMC2815664

  • Regularization Paths for Generalized Linear Models via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Friedman, J., Hastie, T., Tibshirani, R. 2010; 33 (1): 1-22

    Abstract

    We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

    View details for Web of Science ID 000275203200001

    View details for PubMedCentralID PMC2929880

  • 3 '-End Sequencing for Expression Quantification (3SEQ) from Archival Tumor Samples PLOS ONE Beck, A. H., Weng, Z., Witten, D. M., Zhu, S., Foley, J. W., Lacroute, P., Smith, C. L., Tibshirani, R., van de Rijn, M., Sidow, A., West, R. B. 2010; 5 (1)

    Abstract

    Gene expression microarrays are the most widely used technique for genome-wide expression profiling. However, microarrays do not perform well on formalin fixed paraffin embedded tissue (FFPET). Consequently, microarrays cannot be effectively utilized to perform gene expression profiling on the vast majority of archival tumor samples. To address this limitation of gene expression microarrays, we designed a novel procedure (3'-end sequencing for expression quantification (3SEQ)) for gene expression profiling from FFPET using next-generation sequencing. We performed gene expression profiling by 3SEQ and microarray on both frozen tissue and FFPET from two soft tissue tumors (desmoid type fibromatosis (DTF) and solitary fibrous tumor (SFT)) (total n = 23 samples, which were each profiled by at least one of the four platform-tissue preparation combinations). Analysis of 3SEQ data revealed many genes differentially expressed between the tumor types (FDR<0.01) on both the frozen tissue (approximately 9.6K genes) and FFPET (approximately 8.1K genes). Analysis of microarray data from frozen tissue revealed fewer differentially expressed genes (approximately 4.64K), and analysis of microarray data on FFPET revealed very few (69) differentially expressed genes. Functional gene set analysis of 3SEQ data from both frozen tissue and FFPET identified biological pathways known to be important in DTF and SFT pathogenesis and suggested several additional candidate oncogenic pathways in these tumors. These findings demonstrate that 3SEQ is an effective technique for gene expression profiling from archival tumor samples and may facilitate significant advances in translational cancer research.

    View details for DOI 10.1371/journal.pone.0008768

    View details for PubMedID 20098735

  • Predicting Patient Survival from Longitudinal Gene Expression STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Zhang, Y., Tibshirani, R. J., Davis, R. W. 2010; 9 (1)

    Abstract

    Characterizing dynamic gene expression pattern and predicting patient outcome is now significant and will be of more interest in the future with large scale clinical investigation of microarrays. However, there is currently no method that has been developed for prediction of patient outcome using longitudinal gene expression, where gene expression of patients is being monitored across time. Here, we propose a novel prediction approach for patient survival time that makes use of time course structure of gene expression. This method is applied to a burn study. The genes involved in the final predictors are enriched in the inflammatory response and immune system related pathways. Moreover, our method is consistently better than prediction methods using individual time point gene expression or simply pooling gene expression from each time point.

    View details for DOI 10.2202/1544-6115.1617

    View details for Web of Science ID 000284905500002

    View details for PubMedID 21126232

    View details for PubMedCentralID PMC3004784

  • Lymphoma cell VEGFR2 expression detected by immunohistochemistry predicts poor overall survival in diffuse large B cell lymphoma treated with immunochemotherapy (R-CHOP) BRITISH JOURNAL OF HAEMATOLOGY Gratzinger, D., Advani, R., Zhao, S., Talreja, N., Tibshirani, R. J., Shyam, R., Horning, S., Sehn, L. H., Farinha, P., Briones, J., Lossos, I. S., Gascoyne, R. D., Natkunam, Y. 2010; 148 (2): 235-244

    Abstract

    Diffuse large B cell lymphoma (DLBCL) is clinically and biologically heterogeneous. In most cases of DLBCL, lymphoma cells co-express vascular endothelial growth factor (VEGF) and its receptors VEGFR1 and VEGFR2, suggesting autocrine in addition to angiogenic effects. We enumerated microvessel density and scored lymphoma cell expression of VEGF, VEGFR1, VEGFR2 and phosphorylated VEGFR2 in 162 de novo DLBCL patients treated with R-CHOP (rituximab, cyclophosphamide, vincristine, doxorubicin and prednisone)-like regimens. VEGFR2 expression correlated with shorter overall survival (OS) independent of International Prognostic Index (IPI) (P = 0.0028). Phosphorylated VEGFR2 (detected in 13% of cases) correlated with shorter progression-free survival (PFS, P = 0.044) and trended toward shorter OS on univariate analysis. VEGFR1 was not predictive of survival on univariate analysis, but it did correlate with better OS on multivariate analysis with VEGF, VEGFR2 and IPI (P = 0.036); in patients with weak VEGFR2, lack of VEGFR1 coexpression was significantly correlated with poor OS independent of IPI (P = 0.01). These results are concordant with our prior finding of an association of VEGFR1 with longer OS in DLBCL treated with chemotherapy alone. We postulate that VEGFR1 may oppose autocrine VEGFR2 signalling in DLBCL by competing for VEGF binding. In contrast to our prior results with chemotherapy alone, microvessel density was not prognostic of PFS or OS with R-CHOP-like therapy.

    View details for DOI 10.1111/j.1365-2141.2009.07942.x

    View details for PubMedID 19821819

  • Local false discovery rate facilitates comparison of different microarray experiments NUCLEIC ACIDS RESEARCH Hong, W., Tibshirani, R., Chu, G. 2009; 37 (22): 7483-7497

    Abstract

    The local false discovery rate (LFDR) estimates the probability of falsely identifying specific genes with changes in expression. In computer simulations, LFDR <10% successfully identified genes with changes in expression, while LFDR >90% identified genes without changes. We used LFDR to compare different microarray experiments quantitatively: (i) Venn diagrams of genes with and without changes in expression, (ii) scatter plots of the genes, (iii) correlation coefficients in the scatter plots and (iv) distributions of gene function. To illustrate, we compared three methods for pre-processing microarray data. Correlations between methods were high (r = 0.84-0.92). However, responses were often different in magnitude, and sometimes discordant, even though the methods used the same raw data. LFDR complements functional assessments like gene set enrichment analysis. To illustrate, we compared responses to ultraviolet radiation (UV), ionizing radiation (IR) and tobacco smoke. Compared to unresponsive genes, genes responsive to both UV and IR were enriched for cell cycle, mitosis, and DNA repair functions. Genes responsive to UV but not IR were depleted for cell adhesion functions. Genes responsive to tobacco smoke were enriched for detoxification functions. Thus, LFDR reveals differences and similarities among experiments.

    View details for DOI 10.1093/nar/gkp813

    View details for PubMedID 19825981

  • Relationship of differential gene expression profiles in CD34(+) myelodysplastic syndrome marrow cells to disease subtype and progression BLOOD Sridhar, K., Ross, D. T., Tibshirani, R., Butte, A. J., Greenberg, P. L. 2009; 114 (23): 4847-4858

    Abstract

    Microarray analysis with 40 000 cDNA gene chip arrays determined differential gene expression profiles (GEPs) in CD34(+) marrow cells from myelodysplastic syndrome (MDS) patients compared with healthy persons. Using focused bioinformatics analyses, we found 1175 genes significantly differentially expressed by MDS versus normal, requiring a minimum of 39 genes to separately classify these patients. Major GEP differences were demonstrated between healthy and MDS patients and between several MDS subgroups: (1) those whose disease remained stable and those who subsequently transformed (tMDS) to acute myeloid leukemia; (2) between del(5q) and other MDS patients. A 6-gene "poor risk" signature was defined, which was associated with acute myeloid leukemia transformation and provided additive prognostic information for International Prognostic Scoring System Intermediate-1 patients. Overexpression of genes generating ribosomal proteins and for other signaling pathways was demonstrated in the tMDS patients. Comparison of del(5q) with the remaining MDS patients showed 1924 differentially expressed genes, with underexpression of 1014 genes, 11 of which were within the 5q31-32 commonly deleted region. These data demonstrated (1) GEPs distinguishing MDS patients from healthy and between those with differing clinical outcomes (tMDS vs those whose disease remained stable) and cytogenetics [eg, del(5q)]; and (2) molecular criteria refining prognostic categorization and associated biologic processes in MDS.

    View details for DOI 10.1182/blood-2009-08-236422

    View details for PubMedID 19801443

  • Disease signatures are robust across tissues and experiments MOLECULAR SYSTEMS BIOLOGY Dudley, J. T., Tibshirani, R., Deshpande, T., Butte, A. J. 2009; 5

    Abstract

    Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.

    View details for DOI 10.1038/msb.2009.66

    View details for Web of Science ID 000270456400006

    View details for PubMedID 19756046

    View details for PubMedCentralID PMC2758720

  • A Network Model of a Cooperative Genetic Landscape in Brain Tumors JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Bredel, M., Scholtens, D. M., Harsh, G. R., Bredel, C., Chandler, J. P., Renfrow, J. J., Yadav, A. K., Vogel, H., Scheck, A. C., Tibshirani, R., Sikic, B. I. 2009; 302 (3): 261-275

    Abstract

    Gliomas, particularly glioblastomas, are among the deadliest of human tumors. Gliomas emerge through the accumulation of recurrent chromosomal alterations, some of which target yet-to-be-discovered cancer genes. A persistent question concerns the biological basis for the coselection of these alterations during gliomagenesis.To describe a network model of a cooperative genetic landscape in gliomas and to evaluate its clinical relevance.Multidimensional genomic profiles and clinical profiles of 501 patients with gliomas (45 tumors in an initial discovery set collected between 2001 and 2004 and 456 tumors in validation sets made public between 2006 and 2008) from multiple academic centers in the United States and The Cancer Genome Atlas Pilot Project (TCGA).Identification of genes with coincident genetic alterations, correlated gene dosage and gene expression, and multiple functional interactions; association between those genes and patient survival.Gliomas select for a nonrandom genetic landscape-a consistent pattern of chromosomal alterations-that involves altered regions ("territories") on chromosomes 1p, 7, 8q, 9p, 10, 12q, 13q, 19q, 20, and 22q (false-discovery rate-corrected P<.05). A network model shows that these territories harbor genes with putative synergistic, tumor-promoting relationships. The coalteration of the most interactive of these genes in glioblastoma is associated with unfavorable patient survival. A multigene risk scoring model based on 7 landscape genes (POLD2, CYCS, MYC, AKR1C3, YME1L1, ANXA7, and PDCD4) is associated with the duration of overall survival in 189 glioblastoma samples from TCGA (global log-rank P = .02 comparing 3 survival curves for patients with 0-2, 3-4, and 5-7 dosage-altered genes). Groups of patients with 0 to 2 (low-risk group) and 5 to 7 (high-risk group) dosage-altered genes experienced 49.24 and 79.56 deaths per 100 person-years (hazard ratio [HR], 1.63; 95% confidence interval [CI], 1.10-2.40; Cox regression model P = .02), respectively. These associations with survival are validated using gene expression data in 3 independent glioma studies, comprising 76 (global log-rank P = .003; 47.89 vs 15.13 deaths per 100 person-years for high risk vs low risk; Cox model HR, 3.04; 95% CI, 1.49-6.20; P = .002) and 70 (global log-rank P = .008; 83.43 vs 16.14 deaths per 100 person-years for high risk vs low risk; HR, 3.86; 95% CI, 1.59-9.35; P = .003) high-grade gliomas and 191 glioblastomas (global log-rank P = .002; 83.23 vs 34.16 deaths per 100 person-years for high risk vs low risk; HR, 2.27; 95% CI, 1.44-3.58; P<.001).The alteration of multiple networking genes by recurrent chromosomal aberrations in gliomas deregulates critical signaling pathways through multiple, cooperative mechanisms. These mutations, which are likely due to nonrandom selection of a distinct genetic landscape during gliomagenesis, are associated with patient prognosis.

    View details for Web of Science ID 000267948100020

    View details for PubMedID 19602686

  • A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis BIOSTATISTICS Witten, D. M., Tibshirani, R., Hastie, T. 2009; 10 (3): 515-534

    Abstract

    We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

    View details for DOI 10.1093/biostatistics/kxp008

    View details for Web of Science ID 000267213700010

    View details for PubMedID 19377034

    View details for PubMedCentralID PMC2697346

  • Alteration of Gene Expression Signatures of Cortical Differentiation and Wound Response in Lethal Clear Cell Renal Cell Carcinomas PLOS ONE Zhao, H., Ma, Z., Tibshirani, R., Higgins, J. P., Ljungberg, B., Brooks, J. D. 2009; 4 (6)

    Abstract

    Clear cell renal cell carcinoma (ccRCC) is the most common malignancy of the adult kidney and displays heterogeneity in clinical outcomes. Through comprehensive gene expression profiling, we have identified previously a set of transcripts that predict survival following nephrectomy independent of tumor stage, grade, and performance status. These transcripts, designated as the SPC (supervised principal components) gene set, show no apparent biological or genetic features that provide insight into renal carcinogenesis or tumor progression. We explored the relationship of this gene list to a set of genes expressed in different anatomical segments of the normal kidney including the cortex (cortex gene set) and the glomerulus (glomerulus gene set), and a gene set expressed after serum stimulation of quiescent fibroblasts (the core serum response or CSR gene set). Interestingly, the normal cortex, glomerulus (part of the normal renal cortex), and CSR gene sets captured more than 1/5 of the genes in the highly prognostic SPC gene set. Based on gene expression patterns alone, the SPC gene set could be used to sort samples from normal adult kidneys by the anatomical regions from which they were dissected. Tumors whose gene expression profiles most resembled the normal renal cortex or glomerulus showed better survival than those that did not, and those with expression features more similar to CSR showed poorer survival. While the cortex, glomerulus, and CSR signatures predicted survival independent of traditional clinical parameters, they were not independent of the SPC gene list. Our findings suggest that critical biological features of lethal ccRCC include loss of normal cortical differentiation and activation of programs associated with wound healing.

    View details for DOI 10.1371/journal.pone.0006039

    View details for Web of Science ID 000267356900003

    View details for PubMedID 19557179

    View details for PubMedCentralID PMC2698218

  • Anti-idiotype antibody response after vaccination correlates with better overall survival in follicular lymphoma BLOOD Ai, W. Z., Tibshirani, R., Taidi, B., Czerwinski, D., Levy, R. 2009; 113 (23): 5743-5746

    Abstract

    Previous studies demonstrated that vaccination-induced tumor-specific immune response is associated with superior clinical outcome in patients with follicular lymphoma. Here, we investigated whether this positive correlation extends to overall survival (OS). We analyzed 91 untreated patients who received CVP chemotherapy (cyclophosphamide, vincristine, and prednisone) followed by idiotype vaccination. Idiotype proteins were produced either by the hybridoma method or by expression of recombinant idiotype-encoding sequences in mammalian or plant-based expression systems. We found that achieving a complete response/complete response unconfirmed (CR/CRu) to CVP and making an anti-idiotype antibody are 2 independent factors that each correlated with longer OS at 10 years (89% vs 68% with or without a CR/CRu, P = .024; 90% vs 69% with or without tumor-specific antibody production; P = .027). In the subset of patients who received hybridoma-generated vaccines, we found that anti-idiotype production was even more highly associated with superior OS (P < .002); this was the case even in patients with a partial response (PR) to CVP (P < .001).

    View details for DOI 10.1182/blood-2009-01-201988

    View details for Web of Science ID 000266656100013

    View details for PubMedID 19346494

    View details for PubMedCentralID PMC2700314

  • A BIAS CORRECTION FOR THE MINIMUM ERROR RATE IN CROSS-VALIDATION ANNALS OF APPLIED STATISTICS Tibshirani, R. J., Tibshirani, R. 2009; 3 (2): 822-829

    View details for DOI 10.1214/08-AOAS224

    View details for Web of Science ID 000271979600014

  • Prognostic significance of vascular endothelial growth factor (VEGF), VEGF receptors (VEGFR), and vascularity in diffuse large B-cell lymphoma treated with immunochemotherapy (R-CHOP) 45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO) Gratzinger, D., Advani, R., Zhao, S., Talreja, N., Tibshirani, R. J., Horning, S. J., Levy, R., Lossos, I. S., Gascoyne, R. D., Natkunam, Y. AMER SOC CLINICAL ONCOLOGY. 2009
  • Correlation of RRM1 expression in muscle invasive locally advanced urothelial cancer with age 45th Annual Meeting of the American-Society-of-Clinical-Oncology (ASCO) Harshman, L. C., Bepler, G., Zheng, Z., Higgins, J. P., ALLEN, G. I., Tibshirani, R., Srinivas, S. AMER SOC CLINICAL ONCOLOGY. 2009
  • Differentiation stage-specific expression of microRNAs in B lymphocytes and diffuse large B-cell lymphomas BLOOD Malumbres, R., Sarosiek, K. A., Cubedo, E., Ruiz, J. W., Jiang, X., Gascoyne, R. D., Tibshirani, R., Lossos, I. S. 2009; 113 (16): 3754-3764

    Abstract

    miRNAs are small RNA molecules binding to partially complementary sites in the 3'-UTR of target transcripts and repressing their expression. miRNAs orchestrate multiple cellular functions and play critical roles in cell differentiation and cancer development. We analyzed miRNA profiles in B-cell subsets during peripheral B-cell differentiation as well as in diffuse large B-cell lymphoma (DLBCL) cells. Our results show temporal changes in the miRNA expression during B-cell differentiation with a highly unique miRNA profile in germinal center (GC) lymphocytes. We provide experimental evidence that these changes may be physiologically relevant by demonstrating that GC-enriched hsa-miR-125b down-regulates the expression of IRF4 and PRDM1/BLIMP1, and memory B cell-enriched hsa-miR-223 down-regulates the expression of LMO2. We further demonstrate that although an important component of the biology of a malignant cell is inherited from its nontransformed cellular progenitor-GC centroblasts-aberrant miRNA expression is acquired upon cell transformation. A 9-miRNA signature was identified that could precisely differentiate the 2 major subtypes of DLBCL. Finally, expression of some of the miRNAs in this signature is correlated with clinical outcome of uniformly treated DLBCL patients.

    View details for DOI 10.1182/blood-2008-10-184077

    View details for Web of Science ID 000265445900016

    View details for PubMedID 19047678

  • Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods JOURNAL OF MACHINE LEARNING RESEARCH Hoefling, H., Tibshirani, R. 2009; 10: 883-906

    Abstract

    We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

    View details for Web of Science ID 000270824600003

    View details for PubMedCentralID PMC3157941

  • Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods. Journal of machine learning research : JMLR Höfling, H., Tibshirani, R. 2009; 10: 883-906

    Abstract

    We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudo-likelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate.

    View details for PubMedID 21857799

    View details for PubMedCentralID PMC3157941

  • Covariance-regularized regression and classification for high-dimensional problems. Journal of the Royal Statistical Society. Series B, Statistical methodology Witten, D. M., Tibshirani, R. 2009; 71 (3): 615-636

    Abstract

    In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.

    View details for DOI 10.1111/j.1467-9868.2009.00699.x

    View details for PubMedID 20084176

    View details for PubMedCentralID PMC2806603

  • Temporal Changes in Gene Expression Induced by Sulforaphane in Human Prostate Cancer Cells PROSTATE Bhamre, S., Sahoo, D., Tibshirani, R., Dill, D. L., Brooks, J. D. 2009; 69 (2): 181-190

    Abstract

    Prostate cancer is thought to arise as a result of oxidative stresses and induction of antioxidant electrophile defense (phase 2) enzymes has been proposed as a prostate cancer prevention strategy. The isothiocyanate sulforaphane, derived from cruciferous vegetables like broccoli, potently induces surrogate markers of phase 2 enzyme activity in prostate cells in vitro and in vivo. To better understand the temporal effects of sulforaphane and broccoli sprouts on gene expression in prostate cells, we carried out comprehensive transcriptome analysis using cDNA microarrays.Transcripts significantly modulated by sulforaphane over time were identified using StepMiner analysis. Ingenuity Pathway Analysis (IPA) was used to identify biological pathways, networks, and functions significantly altered by sulforaphane treatment.StepMiner and IPA revealed significant changes in many transcripts associated with cell growth and cell cycle, as well as a significant number associated with cellular response to oxidative damage and stress. Comparison to an existing dataset suggested that sulforaphane blocked cell growth by inducing G2/M arrest. Cell growth assays and flow cytometry analysis confirmed that sulforaphane inhibited cell growth and induced cell cycle arrest.Our data suggest that in prostate cells sulforaphane primarily induces cellular defenses and inhibits cell growth by causing G2/M phase arrest. Furthermore, based on the striking similarities in the gene expression patterns induced across experiments in these cells, sulforaphane appears to be the primary bioactive compound present in broccoli sprouts, suggesting that broccoli sprouts can serve as a suitable source for sulforaphane in intervention trials.

    View details for DOI 10.1002/pros.20869

    View details for Web of Science ID 000262701200008

    View details for PubMedID 18973173

    View details for PubMedCentralID PMC2612096

  • Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Witten, D. M., Tibshirani, R. J. 2009; 8 (1)

    Abstract

    In recent work, several authors have introduced methods for sparse canonical correlation analysis (sparse CCA). Suppose that two sets of measurements are available on the same set of observations. Sparse CCA is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other. It has been shown to be useful in the analysis of high-dimensional genomic data, when two sets of assays are available on the same set of samples. In this paper, we propose two extensions to the sparse CCA methodology. (1) Sparse CCA is an unsupervised method; that is, it does not make use of outcome measurements that may be available for each observation (e.g., survival time or cancer subtype). We propose an extension to sparse CCA, which we call sparse supervised CCA, which results in the identification of linear combinations of the two sets of variables that are correlated with each other and associated with the outcome. (2) It is becoming increasingly common for researchers to collect data on more than two assays on the same set of samples; for instance, SNP, gene expression, and DNA copy number measurements may all be available. We develop sparse multiple CCA in order to extend the sparse CCA methodology to the case of more than two data sets. We demonstrate these new methods on simulated data and on a recently published and publicly available diffuse large B-cell lymphoma data set.

    View details for DOI 10.2202/1544-6115.1470

    View details for Web of Science ID 000267601500008

    View details for PubMedID 19572827

    View details for PubMedCentralID PMC2861323

  • CD81 Protein Is Expressed in Normal Germinal Center B-Cells and in Subtypes of Human Non-Hodgkin Lymphomas 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Luo, R. F., Zhao, S., Tibshirani, R., Lossos, I. S., Advani, R., Gratzinger, D., Wong, A., Talrega, N., Levy, R., Levy, S., Natkunam, Y. NATURE PUBLISHING GROUP. 2009: 275A–275A
  • Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Beck, A. H., Lee, C. H., WITTEN, D. M., Zhou, S., Montgomery, K., Tibshirani, R., Hastie, T., West, R. B., van de Rijn, M. NATURE PUBLISHING GROUP. 2009: 368A–368A
  • Covariance-regularized regression and classification for high dimensional problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Witten, D. M., Tibshirani, R. 2009; 71: 615-636

    Abstract

    In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.

    View details for DOI 10.1111/j.1467-9868.2009.00699.x

    View details for Web of Science ID 000266602200003

    View details for PubMedCentralID PMC2806603

  • Blood autoantibody and cytokine profiles predict response to anti-tumor necrosis factor therapy in rheumatoid arthritis ARTHRITIS RESEARCH & THERAPY Hueber, W., Tomooka, B. H., Batliwalla, F., Li, W., Monach, P. A., Tibshirani, R. J., Van Vollenhoven, R. F., Lampa, J., Saito, K., Tanaka, Y., Genovese, M. C., Klareskog, L., Gregersen, P. K., Robinson, W. H. 2009; 11 (3)

    Abstract

    Anti-TNF therapies have revolutionized the treatment of rheumatoid arthritis (RA), a common systemic autoimmune disease involving destruction of the synovial joints. However, in the practice of rheumatology approximately one-third of patients demonstrate no clinical improvement in response to treatment with anti-TNF therapies, while another third demonstrate a partial response, and one-third an excellent and sustained response. Since no clinical or laboratory tests are available to predict response to anti-TNF therapies, great need exists for predictive biomarkers.Here we present a multi-step proteomics approach using arthritis antigen arrays, a multiplex cytokine assay, and conventional ELISA, with the objective to identify a biomarker signature in three ethnically diverse cohorts of RA patients treated with the anti-TNF therapy etanercept.We identified a 24-biomarker signature that enabled prediction of a positive clinical response to etanercept in all three cohorts (positive predictive values 58 to 72%; negative predictive values 63 to 78%).We identified a multi-parameter protein biomarker that enables pretreatment classification and prediction of etanercept responders, and tested this biomarker using three independent cohorts of RA patients. Although further validation in prospective and larger cohorts is needed, our observations demonstrate that multiplex characterization of autoantibodies and cytokines provides clinical utility for predicting response to the anti-TNF therapy etanercept in RA patients.

    View details for DOI 10.1186/ar2706

    View details for PubMedID 19460157

  • Univariate Shrinkage in the Cox Model for High Dimensional Data STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Tibshirani, R. J. 2009; 8 (1)

    Abstract

    We propose a method for prediction in Cox's proportional model, when the number of features (regressors), p, exceeds the number of observations, n. The method assumes that the features are independent in each risk set, so that the partial likelihood factors into a product. As such, it is analogous to univariate thresholding in linear regression and nearest shrunken centroids in classification. We call the procedure Cox univariate shrinkage and demonstrate its usefulness on real and simulated data. The method has the attractive property of being essentially univariate in its operation: the features are entered into the model based on the size of their Cox score statistics. We illustrate the new method on real and simulated data, and compare it to other proposed methods for survival prediction with a large number of predictors.

    View details for Web of Science ID 000265689500003

    View details for PubMedID 19409065

  • Lymphoma-Expressed VEGF-a,VEGFR-1, VEGFR-2, and Microvessel Density Are Not Predictive of Overall Survival in Follicular Lymphoma. 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Gratzinger, D., Zhao, S., Ai, W., Tibshirani, R., Levy, R., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1290–90
  • Differentiation-Stage-Specific Expression of MicroRNAs in B-Lymphocytes and Diffuse Large B-Cell Lymphomas (DLBCL) 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Malumbres, R., Tibshirani, R., Cubedo, E., Sarosiek, K. A., Jiang, X., Ruiz, J., Lossos, I. AMER SOC HEMATOLOGY. 2008: 299–99
  • LMO2 Protein Expression Predicts Survival in Patients with Diffuse Large B-Cell Lymphoma Treated with Immunochemotherapy (RCHOP): A Multicenter Validation Study. 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Advani, R., Talreja, N., Tibshirani, R., Zhao, S., Alizadeh, A., Briones, J., Bordes, R., Cohen, J., Horning, S., Levy, R., Lossos, I. S., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1291–91
  • Neither CD68+Nor CD163+Macrophages Are Associated with Decreased Survival in Follicular Lymphoma 50th Annual Meeting of the American-Society-of-Hematology/ASH/ASCO Joint Symposium Gratzinger, D., Ai, W., Tibshirani, R., Levy, R., Natkunam, Y. AMER SOC HEMATOLOGY. 2008: 1284–84
  • TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS ANNALS OF APPLIED STATISTICS Witten, D. M., Tibshirani, R. 2008; 2 (3): 986-1012

    Abstract

    We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

    View details for DOI 10.1214/08-AOAS182

    View details for Web of Science ID 000261057900009

    View details for PubMedCentralID PMC2743444

  • TESTING SIGNIFICANCE OF FEATURES BY LASSOED PRINCIPAL COMPONENTS. The annals of applied statistics Witten, D. M., Tibshirani, R. 2008; 2 (3): 986-1012

    Abstract

    We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample t-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an L(1) penalty in order to de-noise the resulting projections. We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.

    View details for DOI 10.1214/08-AOAS182SUPP

    View details for PubMedID 19756232

    View details for PubMedCentralID PMC2743444

  • "Preconditioning" for feature selection and regression in high-dimensional problems' ANNALS OF STATISTICS Paul, D., Bair, E., Hastie, T., Tibshirani, R. 2008; 36 (4): 1595-1618
  • Complementary hierarchical clustering BIOSTATISTICS Nowak, G., Tibshirani, R. 2008; 9 (3): 467-483

    Abstract

    When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.

    View details for DOI 10.1093/biostatistics/kxm046

    View details for Web of Science ID 000256977000008

    View details for PubMedID 18093965

    View details for PubMedCentralID PMC3294318

  • Sparse inverse covariance estimation with the graphical lasso BIOSTATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2008; 9 (3): 432-441

    Abstract

    We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

    View details for DOI 10.1093/biostatistics/kxm045

    View details for Web of Science ID 000256977000005

    View details for PubMedID 18079126

    View details for PubMedCentralID PMC3019769

  • Paraffin-based 6-gene model predicts outcome in diffuse large B-cell lymphoma patients treated with R-CHOP BLOOD Malumbres, R., Chen, J., Tibshirani, R., Johnson, N. A., Sehn, L. H., Natkunam, Y., Briones, J., Advani, R., Connors, J. M., Byrne, G. E., Levy, R., Gascoyne, R. D., Lossos, I. S. 2008; 111 (12): 5509-5514

    Abstract

    Diffuse large B-cell lymphoma (DLBCL) is a heterogeneous disease characterized by variable clinical outcomes. Outcome prediction at the time of diagnosis is of paramount importance. Previously, we constructed a 6-gene model for outcome prediction of DLBCL patients treated with anthracycline-based chemotherapies. However, the standard therapy has evolved into rituximab, cyclophosphamide, doxorubicin, vincristine and prednisone (R-CHOP). Herein, we evaluated the predictive power of a paraffin-based 6-gene model in R-CHOP-treated DLBCL patients. RNA was successfully extracted from 132 formalin-fixed paraffin-embedded (FFPE) specimens. Expression of the 6 genes comprising the model was measured and the mortality predictor score was calculated for each patient. The mortality predictor score divided patients into low-risk (below median) and high-risk (above median) subgroups with significantly different overall survival (OS; P = .002) and progression-free survival (PFS; P = .038). The model also predicted OS and PFS when the mortality predictor score was considered as a continuous variable (P = .002 and .010, respectively) and was independent of the IPI for prediction of OS (P = .008). These findings demonstrate that the prognostic value of the 6-gene model remains significant in the era of R-CHOP treatment and that the model can be applied to routine FFPE tissue from initial diagnostic biopsies.

    View details for DOI 10.1182/blood-2008-02-136374

    View details for Web of Science ID 000256786500021

    View details for PubMedID 18445689

    View details for PubMedCentralID PMC2424149

  • A STUDY OF PRE-VALIDATION ANNALS OF APPLIED STATISTICS Hoefling, H., Tibshirani, R. 2008; 2 (2): 643-664

    View details for DOI 10.1214/07-AOAS152

    View details for Web of Science ID 000261057800015

  • An FLT3 gene-expression signature predicts clinical outcome in normal karyotype AML BLOOD Bullinger, L., Doehner, K., Kranz, R., Stirner, C., Froeling, S., Scholl, C., Kim, Y. H., Schlenk, R. F., Tibshirani, R., Doehner, H., Pollack, J. R. 2008; 111 (9): 4490-4495

    Abstract

    Acute myeloid leukemia with normal karyotype (NK-AML) represents a cytogenetic grouping with intermediate prognosis but substantial molecular and clinical heterogeneity. Within this subgroup, presence of FLT3 (FMS-like tyrosine kinase 3) internal tandem duplication (ITD) mutation predicts less favorable outcome. The goal of our study was to discover gene-expression patterns correlated with FLT3-ITD mutation and to evaluate the utility of a FLT3 signature for prognostication. DNA microarrays were used to profile gene expression in a training set of 65 NK-AML cases, and supervised analysis, using the Prediction Analysis of Microarrays method, was applied to build a gene expression-based predictor of FLT3-ITD mutation status. The optimal predictor, composed of 20 genes, was then evaluated by classifying expression profiles from an independent test set of 72 NK-AML cases. The predictor exhibited modest performance (73% sensitivity; 85% specificity) in classifying FLT3-ITD status. Remarkably, however, the signature outperformed FLT3-ITD mutation status in predicting clinical outcome. The signature may better define clinically relevant FLT3 signaling and/or alternative changes that phenocopy FLT3-ITD, whereas the signature genes provide a starting point to dissect these pathways. Our findings support the potential clinical utility of a gene expression-based measure of FLT3 pathway activation in AML.

    View details for DOI 10.1182/blood-2007-09-115055

    View details for Web of Science ID 000255387400016

    View details for PubMedID 18309032

  • IRF9 and STAT1 are required for IgG autoantibody production and B cell expression of TLR7 in mice JOURNAL OF CLINICAL INVESTIGATION Thibault, D. L., Chu, A. D., Graham, K. L., Balboni, I., Lee, L. Y., Kohlmoos, C., Landrigan, A., Higgins, J. P., Tibshirani, R., Utz, P. J. 2008; 118 (4): 1417-1426

    Abstract

    A hallmark of SLE is the production of high-titer, high-affinity, isotype-switched IgG autoantibodies directed against nucleic acid-associated antigens. Several studies have established a role for both type I IFN (IFN-I) and the activation of TLRs by nucleic acid-associated autoantigens in the pathogenesis of this disease. Here, we demonstrate that 2 IFN-I signaling molecules, IFN regulatory factor 9 (IRF9) and STAT1, were required for the production of IgG autoantibodies in the pristane-induced mouse model of SLE. In addition, levels of IgM autoantibodies were increased in pristane-treated Irf9 -/- mice, suggesting that IRF9 plays a role in isotype switching in response to self antigens. Upregulation of TLR7 by IFN-alpha was greatly reduced in Irf9 -/- and Stat1 -/- B cells. Irf9 -/- B cells were incapable of being activated through TLR7, and Stat1 -/- B cells were impaired in activation through both TLR7 and TLR9. These data may reveal a novel role for IFN-I signaling molecules in both TLR-specific B cell responses and production of IgG autoantibodies directed against nucleic acid-associated autoantigens. Our results suggest that IFN-I is upstream of TLR signaling in the activation of autoreactive B cells in SLE.

    View details for DOI 10.1172/JCI30065

    View details for Web of Science ID 000254588600035

    View details for PubMedID 18340381

    View details for PubMedCentralID PMC2267033

  • Multiplexed proximity ligation assays to profile putative plasma biomarkers relevant to pancreatic and ovarian cancer CLINICAL CHEMISTRY Fredriksson, S., Horecka, J., Brustugun, O. T., Schlingemann, J., Koong, A. C., Tibshirani, R., Davis, R. W. 2008; 54 (3): 582-589

    Abstract

    Sensitive methods are needed for biomarker discovery and validation. We tested one promising technology, multiplex proximity ligation assay (PLA), in a pilot study profiling plasma biomarkers in pancreatic and ovarian cancer.We used 4 panels of 6- and 7-plex PLAs to detect biomarkers, with each assay consuming 1 microL plasma and using either matched monoclonal antibody pairs or single batches of polyclonal antibody. Protein analytes were converted to unique DNA amplicons by proximity ligation and subsequently detected by quantitative PCR. We profiled 18 pancreatic cancer cases and 19 controls and 19 ovarian cancer cases and 20 controls for the following proteins: a disintegrin and metalloprotease 8, CA-125, CA 19-9, carboxypeptidase A1, carcinoembryonic antigen, connective tissue growth factor, epidermal growth factor receptor, epithelial cell adhesion molecule, Her2, galectin-1, insulin-like growth factor 2, interleukin-1alpha, interleukin-7, mesothelin, macrophage migration inhibitory factor, osteopontin, secretory leukocyte peptidase inhibitor, tumor necrosis factor alpha, vascular endothelial growth factor, and chitinase 3-like 1. Probes for CA-125 were present in 3 of the multiplex panels. We measured plasma concentrations of the CA-125-mesothelin complex by use of a triple-specific PLA with 2 ligation events among 3 probes.The assays displayed consistent measurements of CA-125 independent of which other markers were simultaneously detected and showed good correlation with Luminex data. In comparison to literature reports, we achieved expected results for other putative markers.Multiplex PLA using either matched monoclonal antibodies or single batches of polyclonal antibody should prove useful for identifying and validating sets of putative disease biomarkers and finding multimarker panels.

    View details for DOI 10.1373/clinchem.2007.093195

    View details for Web of Science ID 000253570400019

    View details for PubMedID 18171715

  • hCAP-D3 expression marks a prostate cancer subtype with favorable clinical behavior and androgen signaling signature AMERICAN JOURNAL OF SURGICAL PATHOLOGY Lapointe, J., Malhotra, S., Higgins, J. P., Bair, E., Thompson, M., Salari, K., Giacomini, C. P., Ferrari, M., Montgomery, K., Tibshirani, R., van de Rijn, M., Brooks, J. D., Pollack, J. R. 2008; 32 (2): 205-209

    Abstract

    Growing evidence suggests that only a fraction of prostate cancers detected clinically are potentially lethal. An important clinical issue is identifying men with indolent cancer who might be spared aggressive therapies with associated morbidities. Previously, using microarray analysis we defined 3 molecular subtypes of prostate cancer with different gene-expression patterns. One, subtype-1, displayed features consistent with more indolent behavior, where an immunohistochemical marker (AZGP1) for subtype-1 predicted favorable outcome after radical prostatectomy. Here we characterize a second candidate tissue biomarker, hCAP-D3, expressed in subtype-1 prostate tumors. hCAP-D3 expression, assayed by RNA in situ hybridization on a tissue microarray comprising 225 cases, was associated with decreased tumor recurrence after radical prostatectomy (P=0.004), independent of pathologic tumor stage, Gleason grade, and preoperative prostate-specific antigen levels. Simultaneous assessment of hCAP-D3 and AZGP1 expression in this tumor set improved outcome prediction. We have previously demonstrated that hCAP-D3 is induced by androgen in prostate cells. Extending this finding, Gene Set Enrichment Analysis revealed enrichment of androgen-responsive genes in subtype-1 tumors (P=0.019). Our findings identify hCAP-D3 as a new biomarker for subtype-1 tumors that improves prognostication, and reveal androgen signaling as an important biologic feature of this potentially clinically favorable molecular subtype.

    View details for PubMedID 18223322

  • LMO2 protein expression predicts survival in patients with diffuse large B-Cell lymphoma treated with anthracycline-based chemotherapy with and without rituximab JOURNAL OF CLINICAL ONCOLOGY Natkunam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Gratzinger, D., Rosado, M., Zhao, S., Pohlman, B., Wongchaowart, N., Bast, M., Avigdor, A., Schiby, G., Nagler, A., Byrne, G. E., Levy, R., Gascoyne, R. D., Lossos, I. S. 2008; 26 (3): 447-454

    Abstract

    The heterogeneity of diffuse large B-cell lymphoma (DLBCL) has prompted the search for new markers that can accurately separate prognostic risk groups. We previously showed in a multivariate model that LMO2 mRNA was a strong predictor of superior outcome in DLBCL patients. Here, we tested the prognostic impact of LMO2 protein expression in DLBCL patients treated with anthracycline-based chemotherapy with or without rituximab.DLBCL patients treated with anthracycline-based chemotherapy alone (263 patients) or with the addition of rituximab (80 patients) were studied using immunohistochemistry for LMO2 on tissue microarrays of original biopsies. Staining results were correlated with outcome.In anthracycline-treated patients, LMO2 protein expression was significantly correlated with improved overall survival (OS) and progression-free survival (PFS) in univariate analyses (OS, P = .018; PFS, P = .010) and was a significant predictor independent of the clinical International Prognostic Index (IPI) in multivariate analysis. Similarly, in patients treated with the combination of anthracycline-containing regimens and rituximab, LMO2 protein expression was also significantly correlated with improved OS and PFS (OS, P = .005; PFS, P = .009) and was a significant predictor independent of the IPI in multivariate analysis.We conclude that LMO2 protein expression is a prognostic marker in DLBCL patients treated with anthracycline-based regimens alone or in combination with rituximab. After further validation, immunohistologic analysis of LMO2 protein expression may become a practical assay for newly diagnosed DLBCL patients to optimize their clinical management.

    View details for DOI 10.1200/JCO.2007.13.0690

    View details for Web of Science ID 000254177200020

    View details for PubMedID 18086797

  • Boolean implication networks derived from large scale, whole genome microarray datasets GENOME BIOLOGY Sahoo, D., Dill, D. L., Gentles, A. J., Tibshirani, R., Plevritis, S. K. 2008; 9 (10)

    Abstract

    We describe a method for extracting Boolean implications (if-then relationships) in very large amounts of gene expression microarray data. A meta-analysis of data from thousands of microarrays for humans, mice, and fruit flies finds millions of implication relationships between genes that would be missed by other methods. These relationships capture gender differences, tissue differences, development, and differentiation. New relationships are discovered that are preserved across all three species.

    View details for PubMedID 18973690

  • LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma treated with anthracycline-based chemotherapy with or without rituximab 97th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Natkunam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Gratzinger, D., Zhan, S., Pohlman, B., Nagler, A., Levy, R., Gascoyne, R. D., Lossos, I. S. NATURE PUBLISHING GROUP. 2008: 267A–267A
  • Prognostic significance of VEGF, VEGF receptors, and microvessel density in diffuse large B cell lymphoma treated with anthracycline-based chemotherapy LABORATORY INVESTIGATION Gratzinger, D., Zhao, S., Tibshirani, R. J., Hsi, E. D., Hans, C. P., Pohlman, B., Bast, M., Avigdor, A., Schiby, G., Nagler, A., Byrne, G. E., Lossos, I. S., Natkunam, Y. 2008; 88 (1): 38-47

    Abstract

    Vascular endothelial growth factor-mediated signaling has at least two potential roles in diffuse large B cell lymphoma: potentiation of angiogenesis, and potentiation of lymphoma cell proliferation and/or survival induced by autocrine vascular endothelial growth factor receptor-mediated signaling. We have recently shown that diffuse large B cell lymphomas expressing high levels of vascular endothelial growth factor protein also express high levels of vascular endothelial growth factor receptor-1 and vascular endothelial growth factor receptor-2. We have now assessed a larger multi-institutional cohort of patients with de novo diffuse large B cell lymphoma treated with anthracycline-based therapy to address whether tumor vascularity, or expression of vascular endothelial growth factor protein and its receptors, contribute to patient outcomes. Our results show that increased tumor vascularity is associated with poor overall survival (P=0.047), and is independent of the international prognostic index. High expression of vascular endothelial growth factor receptor-1 by lymphoma cells by contrast is associated with improved overall survival (P=0.044). The combination of high vascular endothelial growth factor and vascular endothelial growth factor receptor-1 protein expression by lymphoma cells identifies a subgroup of patients with improved overall (P=0.003) and progression-free (P=0.026) survival; these findings are also independent of the international prognostic index. The prognostic significance of overexpression of this ligand-receptor pair suggests that autocrine signaling via vascular endothelial growth factor receptor-1 may represent a survival or proliferation pathway in diffuse large B cell lymphoma. Dependence on autocrine vascular endothelial growth factor receptor-1-mediated signaling may render a subset of diffuse large B-cell lymphomas susceptible to anthracycline-based therapy.

    View details for DOI 10.1038/labinvest.3700697

    View details for Web of Science ID 000251820600004

    View details for PubMedID 17998899

  • Spatial smoothing and hot spot detection for CGH data using the fused lasso BIOSTATISTICS Tibshirani, R., Wang, P. 2008; 9 (1): 18-29

    Abstract

    We apply the "fused lasso" regression method of (TSRZ2004) to the problem of "hot- spot detection", in particular, detection of regions of gain or loss in comparative genomic hybridization (CGH) data. The fused lasso criterion leads to a convex optimization problem, and we provide a fast algorithm for its solution. Estimates of false-discovery rate are also provided. Our studies show that the new method generally outperforms competing methods for calling gains and losses in CGH data.

    View details for DOI 10.1093/biostatistics/kxm013

    View details for Web of Science ID 000251679400002

    View details for PubMedID 17513312

  • Polymorphisms in hypoxia inducible factor 1 and the initial clinical presentation of coronary disease AMERICAN HEART JOURNAL Hlatky, M. A., Quertermous, T., Boothroyd, D. B., Priest, J. R., Glassford, A. J., Myers, R. M., Fortmann, S. P., Iribarren, C., Tabor, H. K., Assimes, T. L., Tibshirani, R. J., Go, A. S. 2007; 154 (6): 1035-1042

    Abstract

    Only some patients with coronary artery disease (CAD) develop acute myocardial infarction (MI), and emerging evidence suggests vulnerability to MI varies systematically among patients and may have a genetic component. The goal of this study was to assess whether polymorphisms in genes encoding elements of pathways mediating the response to ischemia affect vulnerability to MI among patients with underlying CAD.We prospectively identified patients at the time of their initial clinical presentation of CAD who had either an acute MI or stable exertional angina. We collected clinical data and genotyped 34 polymorphisms in 6 genes (ANGPT1, HIF1A, THBS1, VEGFA, VEGFC, VEGFR2).The 909 patients with acute MI were significantly more likely than the 466 patients with stable angina to be male, current smokers, and hypertensive, and less likely to be taking beta-blockers or statins. Three polymorphisms in HIF1A (Pro582Ser, rs11549465; rs1087314; and Thr418Ile, rs41508050) were significantly more common in patients who presented with stable exertional angina rather than acute MI, even after statistical adjustment for cardiac risk factors and medications. The HIF-mediated transcriptional activity was significantly lower when HIF1A null fibroblasts were transfected with variant HIF1A alleles than with wild-type HIF1A alleles.Polymorphisms in HIF1A were associated with development of stable exertional angina rather than acute MI as the initial clinical presentation of CAD.

    View details for DOI 10.1016/j.ahj.2007.07.042

    View details for Web of Science ID 000251396200006

    View details for PubMedID 18035072

  • PATHWISE COORDINATE OPTIMIZATION ANNALS OF APPLIED STATISTICS Friedman, J., Hastie, T., Hoefling, H., Tibshirani, R. 2007; 1 (2): 302-332

    View details for DOI 10.1214/07-AOAS131

    View details for Web of Science ID 000261057600003

  • Anti-idiotype antibody response afteir vaccination correlates with better overall survival in follicular lymphoma 49th Annual Meeting of the American-Society-of-Hematology Ai, W. Z., Tibshirani, R., Taidi, B., Czerwinski, D., Levy, R. AMER SOC HEMATOLOGY. 2007: 199A–199A
  • Survival in follicular lymphoma: The Stanford experience, 1960-2003. 49th Annual Meeting of the American-Society-of-Hematology Tan, D., Rosenberg, S. A., Levy, R., Lavori, P., Tibshirani, R., Hoppe, R. T., Warnke, R., Advani, R., Natkunam, Y., Yuen, A., Horning, S. J. AMER SOC HEMATOLOGY. 2007: 1005A–1005A
  • LMO2 protein expression predicts survival in patients with diffuse large B-cell lymphoma in, the pre- and post-rituximab treatment eras 49th Annual Meeting of the American-Society-of-Hematology Natkumam, Y., Farinha, P., Hsi, E. D., Hans, C. P., Tibshirani, R., Sehn, L. H., Connors, J. M., Zhao, S., Pohlman, B., Spinelli, J., Bast, M., Nagler, A., Levy, R., Gascoyne, R. D., Lossos, I. S. AMER SOC HEMATOLOGY. 2007: 24A–24A
  • Major histocomplatibility class II (MHCII) and germinal center associated gene expression correlate with overall survival in ritiximab and CHOP-like treated diffuse large B.cell lymphoma (DLBCL) patients, using 49th Annual Meeting of the American-Society-of-Hematology Malumbres, R., Johnson, N. A., Sehn, L. H., Natkunam, Y., Tibshirani, R., Briones, J., Connors, J. M., Levy, R., Gascoyne, R. D., Lossos, I. S. AMER SOC HEMATOLOGY. 2007: 23A–23A
  • Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins NATURE MEDICINE Ray, S., Britschgi, M., Herbert, C., Takeda-Uchimura, Y., Boxer, A., Blennow, K., Friedman, L. F., Galasko, D. R., Jutel, M., Karydas, A., Kaye, J. A., Leszek, J., Miller, B. L., Minthon, L., Quinn, J. F., Rabinovici, G. D., Robinson, W. H., Sabbagh, M. N., So, Y. T., Sparks, D. L., Tabaton, M., Tinklenberg, J., Yesavage, J. A., Tibshirani, R., Wyss-Coray, T. 2007; 13 (11): 1359-1362

    Abstract

    A molecular test for Alzheimer's disease could lead to better treatment and therapies. We found 18 signaling proteins in blood plasma that can be used to classify blinded samples from Alzheimer's and control subjects with close to 90% accuracy and to identify patients who had mild cognitive impairment that progressed to Alzheimer's disease 2-6 years later. Biological analysis of the 18 proteins points to systemic dysregulation of hematopoiesis, immune responses, apoptosis and neuronal support in presymptomatic Alzheimer's disease.

    View details for DOI 10.1038/nm1653

    View details for Web of Science ID 000250736900029

    View details for PubMedID 17934472

  • On the "degrees of freedom" of the lasso ANNALS OF STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2007; 35 (5): 2173-2192
  • Expression and prognostic significance of a panel of tissue hypoxia markers in head-and-neck squamous cell carcinomas 48th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO) Le, Q., Kong, C., Lavori, P. W., O'Byrne, K., Erler, J. T., Huang, X., Chen, Y., Cao, H., Tibshiran, R., Denko, N., Giaccia, A. J., Koong, A. C. ELSEVIER SCIENCE INC. 2007: 167–75

    Abstract

    To investigate the expression pattern of hypoxia-induced proteins identified as being involved in malignant progression of head-and-neck squamous cell carcinoma (HNSCC) and to determine their relationship to tumor pO(2) and prognosis.We performed immunohistochemical staining of hypoxia-induced proteins (carbonic anhydrase IX [CA IX], BNIP3L, connective tissue growth factor, osteopontin, ephrin A1, hypoxia inducible gene-2, dihydrofolate reductase, galectin-1, IkappaB kinase beta, and lysyl oxidase) on tumor tissue arrays of 101 HNSCC patients with pretreatment pO(2) measurements. Analysis of variance and Fisher's exact tests were used to evaluate the relationship between marker expression, tumor pO(2), and CA IX staining. Cox proportional hazard model and log-rank tests were used to determine the relationship between markers and prognosis.Osteopontin expression correlated with tumor pO(2) (Eppendorf measurements) (p = 0.04). However, there was a strong correlation between lysyl oxidase, ephrin A1, and galectin-1 and CA IX staining. These markers also predicted for cancer-specific survival and overall survival on univariate analysis. A hypoxia score of 0-5 was assigned to each patient, on the basis of the presence of strong staining for these markers, whereby a higher score signifies increased marker expression. On multivariate analysis, increasing hypoxia score was an independent prognostic factor for cancer-specific survival (p = 0.015) and was borderline significant for overall survival (p = 0.057) when adjusted for other independent predictors of outcomes (hemoglobin and age).We identified a panel of hypoxia-related tissue markers that correlates with treatment outcomes in HNSCC. Validation of these markers will be needed to determine their utility in identifying patients for hypoxia-targeted therapy.

    View details for DOI 10.1016/j.ijrobp.2007.01.071

    View details for PubMedID 17707270

  • Notch signals positively regulate activity of the mTOR pathway in T-cell acute lymphoblastic leukemia BLOOD Chan, S. M., Weng, A. P., Tibshirani, R., Aster, J. C., Utz, P. J. 2007; 110 (1): 278-286

    Abstract

    Constitutive Notch activation is required for the proliferation of a subgroup of T-cell acute lymphoblastic leukemia (T-ALL). Downstream pathways that transmit pro-oncogenic signals are not well characterized. To identify these pathways, protein microarrays were used to profile the phosphorylation state of 108 epitopes on 82 distinct signaling proteins in a panel of 13 T-cell leukemia cell lines treated with a gamma-secretase inhibitor (GSI) to inhibit Notch signals. The microarray screen detected GSI-induced hypophosphorylation of multiple signaling proteins in the mTOR pathway. This effect was rescued by expression of the intracellular domain of Notch and mimicked by dominant negative MAML1, confirming Notch specificity. Withdrawal of Notch signals prevented stimulation of the mTOR pathway by mitogenic factors. These findings collectively suggest that the mTOR pathway is positively regulated by Notch in T-ALL cells. The effect of GSI on the mTOR pathway was independent of changes in phosphatidylinositol-3 kinase and Akt activity, but was rescued by expression of c-Myc, a direct transcriptional target of Notch, implicating c-Myc as an intermediary between Notch and mTOR. T-ALL cell growth was suppressed in a highly synergistic manner by simultaneous treatment with the mTOR inhibitor rapamycin and GSI, which represents a rational drug combination for treating this aggressive human malignancy.

    View details for DOI 10.1182/blood-2006-08-039883

    View details for Web of Science ID 000247611000041

    View details for PubMedID 17363738

    View details for PubMedCentralID PMC1896117

  • Extracting binary signals from microarray time-course data NUCLEIC ACIDS RESEARCH Sahoo, D., Dill, D. L., Tibshirani, R., Plevritis, S. K. 2007; 35 (11): 3705-3712

    Abstract

    This article presents a new method for analyzing microarray time courses by identifying genes that undergo abrupt transitions in expression level, and the time at which the transitions occur. The algorithm matches the sequence of expression levels for each gene against temporal patterns having one or two transitions between two expression levels. The algorithm reports a P-value for the matching pattern of each gene, and a global false discovery rate can also be computed. After matching, genes can be sorted by the direction and time of transitions. Genes can be partitioned into sets based on the direction and time of change for further analysis, such as comparison with Gene Ontology annotations or binding site motifs. The method is evaluated on simulated and actual time-course data. On microarray data for budding yeast, it is shown that the groups of genes that change in similar ways and at similar times have significant and relevant Gene Ontology annotations.

    View details for DOI 10.1093/nar/gkm284

    View details for PubMedID 17517782

  • ON TESTING THE SIGNIFICANCE OF SETS OF GENES ANNALS OF APPLIED STATISTICS Efron, B., Tibshirani, R. 2007; 1 (1): 107-129

    View details for DOI 10.1214/07-AOAS101

    View details for Web of Science ID 000261050400006

  • Oncogenic regulators and substrates of the anaphase promoting complex/cyclosome are frequently overexpressed in malignant tumors AMERICAN JOURNAL OF PATHOLOGY Lehman, N. L., Tibshirani, R., Hsu, J. Y., Natkunam, Y., Harris, B. T., West, R. B., Masek, M. A., Montgomery, K., van de Rijn, M., Jackson, P. K. 2007; 170 (5): 1793-1805

    Abstract

    The fidelity of cell division is dependent on the accumulation and ordered destruction of critical protein regulators. By triggering the appropriately timed, ubiquitin-dependent proteolysis of the mitotic regulatory proteins securin, cyclin B, aurora A kinase, and polo-like kinase 1, the anaphase promoting complex/cyclosome (APC/C) ubiquitin ligase plays an essential role in maintaining genomic stability. Misexpression of these APC/C substrates, individually, has been implicated in genomic instability and cancer. However, no comprehensive survey of the extent of their misregulation in tumors has been performed. Here, we analyzed more than 1600 benign and malignant tumors by immunohistochemical staining of tissue microarrays and found frequent overexpression of securin, polo-like kinase 1, aurora A, and Skp2 in malignant tumors. Positive and negative APC/C regulators, Cdh1 and Emi1, respectively, were also more strongly expressed in malignant versus benign tumors. Clustering and statistical analysis supports the finding that malignant tumors generally show broad misregulation of mitotic APC/C substrates not seen in benign tumors, suggesting that a "mitotic profile" in tumors may result from misregulation of the APC/C destruction pathway. This profile of misregulated mitotic APC/C substrates and regulators in malignant tumors suggests that analysis of this pathway may be diagnostically useful and represent a potentially important therapeutic target.

    View details for DOI 10.2353/ajpath.2007.060767

    View details for PubMedID 17456782

  • Disease-specific genomic analysis: identifying the signature of pathologic biology BIOINFORMATICS Nicolau, M., Tibshirani, R., Borresen-Dale, A., Jeffrey, S. S. 2007; 23 (8): 957-965

    Abstract

    Genomic high-throughput technology generates massive data, providing opportunities to understand countless facets of the functioning genome. It also raises profound issues in identifying data relevant to the biology being studied.We introduce a method for the analysis of pathologic biology that unravels the disease characteristics of high dimensional data. The method, disease-specific genomic analysis (DSGA), is intended to precede standard techniques like clustering or class prediction, and enhance their performance and ability to detect disease. DSGA measures the extent to which the disease deviates from a continuous range of normal phenotypes, and isolates the aberrant component of data. In several microarray cancer datasets, we show that DSGA outperforms standard methods. We then use DSGA to highlight a novel subdivision of an important class of genes in breast cancer, the estrogen receptor (ER) cluster. We also identify new markers distinguishing ductal and lobular breast cancers. Although our examples focus on microarrays, DSGA generalizes to any high dimensional genomic/proteomic data.

    View details for DOI 10.1093/bioinformatics/btm033

    View details for Web of Science ID 000246293000006

    View details for PubMedID 17277331

  • Averaged gene expressions for regression BIOSTATISTICS Park, M. Y., Hastie, T., Tibshirani, R. 2007; 8 (2): 212-227

    Abstract

    Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

    View details for DOI 10.1093/biostatistics/kxl002

    View details for Web of Science ID 000245512000004

    View details for PubMedID 16698769

  • Microvessel density and expression of vascular endothelial growth factor and its receptors in diffuse large B-cell lymphoma subtypes AMERICAN JOURNAL OF PATHOLOGY Gratzinger, D., Zhao, S., Marinelli, R. J., Kapp, A. V., Tibshirani, R. J., Hammer, A. S., Hamilton-Dutoit, S., Natkunam, Y. 2007; 170 (4): 1362-1369

    Abstract

    Angiogenesis is known to play a major role in neoplasia, including hematolymphoid neoplasia. We assessed the relationships among angiogenesis and expression of vascular endothelial growth factor and its receptors in the context of clinically and biologically relevant subtypes of diffuse large B-cell lymphoma using immunohistochemical evaluation of tissue microarrays. We found that diffuse large B-cell lymphoma specimens showing higher local vascular endothelial growth factor expression showed correspondingly higher microvessel density, implying that lymphoma cells induce local tumor angiogenesis. In addition, local vascular endothelial growth factor expression was higher in those specimens showing higher expression of the receptors of the growth factor, suggesting an autocrine growth-promoting feedback loop. The germinal center-like and nongerminal center-like subtypes of diffuse large B-cell lymphoma were biologically and prognostically distinct. Interestingly, only in the more clinically aggressive nongerminal center-like subtype were microvessel densities significantly higher in specimens showing higher vascular endothelial growth factor expression; the same was true for the finding of higher vascular endothelial growth factor receptor-1 expression in conjunction with higher vascular endothelial growth factor expression. These differences may have important implications for the responsiveness of the two diffuse large B-cell lymphoma subtypes to anti-vascular endothelial growth factor and anti-angiogenic therapies.

    View details for DOI 10.2353/ajpath.2007.060901

    View details for Web of Science ID 000245233000022

    View details for PubMedID 17392174

    View details for PubMedCentralID PMC1829468

  • Margin trees for high-dimensional classification JOURNAL OF MACHINE LEARNING RESEARCH Tibshirani, R., Hastie, T. 2007; 8: 637-652
  • Outlier sums for differential gene expression analysis BIOSTATISTICS Tibshirani, R., Hastie, T. 2007; 8 (1): 2-8

    Abstract

    We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

    View details for DOI 10.1093/biostatistics/kx1005

    View details for Web of Science ID 000242715400001

    View details for PubMedID 16702229

  • Forward stagewise regression and the monotone lasso ELECTRONIC JOURNAL OF STATISTICS Hastie, T., Taylor, J., Tibshirani, R., Walther, G. 2007; 1: 1-29

    View details for DOI 10.1214/07-EJS004

    View details for Web of Science ID 000207854200001

  • Regularized linear discriminant analysis and its application in microarrays BIOSTATISTICS Guo, Y., Hastie, T., Tibshirani, R. 2007; 8 (1): 86-100

    Abstract

    In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN (http://www.r-project.org) for download and testing.

    View details for DOI 10.1093/biostatistics/kxj035

    View details for Web of Science ID 000242715400006

    View details for PubMedID 16603682

  • Are clusters found in one dataset present in another dataset? BIOSTATISTICS Kapp, A. V., Tibshirani, R. 2007; 8 (1): 9-31

    Abstract

    In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).

    View details for DOI 10.1093/biostatistics/kxj029

    View details for Web of Science ID 000242715400002

    View details for PubMedID 16613834

  • Tumor-infiltrating T cells are not predictive of clinical outcome in follicular lymphoma. 48th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Czerwinski, D., Horning, S. J., Allen, J., Tibshirani, R., Levy, R. AMER SOC HEMATOLOGY. 2006: 247A–248A
  • Preliminary report on a phase I/II study of intraturnoral injection of PF-3512676 (CpG 7909), a TLR9 agonist, combined with radiation in recurrent low-grade lymphomas. 48th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Kim, Y., Hoppe, R. T., Shah, S., Horning, S. J., Tibshirani, R., Levy, R. AMER SOC HEMATOLOGY. 2006: 767A–768A
  • Distinct patterns of DNA copy number alteration are associated with different clinicopathological features and gene-expression subtypes of breast cancer GENES CHROMOSOMES & CANCER Bergamaschi, A., Kim, Y. H., Wang, P., Sorlie, T., Hernandez-Boussard, T., Lonning, P. E., Tibshirani, R., Borresen-Dale, A., Pollack, J. R. 2006; 45 (11): 1033-1040

    Abstract

    Breast cancer is a leading cause of cancer-death among women, where the clinicopathological features of tumors are used to prognosticate and guide therapy. DNA copy number alterations (CNAs), which occur frequently in breast cancer and define key pathogenetic events, are also potentially useful prognostic or predictive factors. Here, we report a genome-wide array-based comparative genomic hybridization (array CGH) survey of CNAs in 89 breast tumors from a patient cohort with locally advanced disease. Statistical analysis links distinct cytoband loci harboring CNAs to specific clinicopathological parameters, including tumor grade, estrogen receptor status, presence of TP53 mutation, and overall survival. Notably, distinct spectra of CNAs also underlie the different subtypes of breast cancer recently defined by expression-profiling, implying these subtypes develop along distinct genetic pathways. In addition, higher numbers of gains/losses are associated with the "basal-like" tumor subtype, while high-level DNA amplification is more frequent in "luminal-B" subtype tumors, suggesting also that distinct mechanisms of genomic instability might underlie their pathogenesis. The identified CNAs may provide a basis for improved patient prognostication, as well as a starting point to define important genes to further our understanding of the pathobiology of breast cancer. This article contains Supplementary Material available at http://www.interscience.wiley.com/jpages/1045-2257/suppmat

    View details for DOI 10.1002/gcc.20366

    View details for Web of Science ID 000240601400005

    View details for PubMedID 16897746

  • Discovery and validation of breast cancer subtypes BMC GENOMICS Kapp, A. V., Jeffrey, S. S., Langerod, A., Borresen-Dale, A., Han, W., Noh, D., Bukholm, I. R., Nicolau, M., Brown, P. O., Tibshirani, R. 2006; 7

    Abstract

    Previous studies demonstrated breast cancer tumor tissue samples could be classified into different subtypes based upon DNA microarray profiles. The most recent study presented evidence for the existence of five different subtypes: normal breast-like, basal, luminal A, luminal B, and ERBB2+.Based upon the analysis of 599 microarrays (five separate cDNA microarray datasets) using a novel approach, we present evidence in support of the most consistently identifiable subtypes of breast cancer tumor tissue microarrays being: ESR1+/ERBB2-, ESR1-/ERBB2-, and ERBB2+ (collectively called the ESR1/ERBB2 subtypes). We validate all three subtypes statistically and show the subtype to which a sample belongs is a significant predictor of overall survival and distant-metastasis free probability.As a consequence of the statistical validation procedure we have a set of centroids which can be applied to any microarray (indexed by UniGene Cluster ID) to classify it to one of the ESR1/ERBB2 subtypes. Moreover, the method used to define the ESR1/ERBB2 subtypes is not specific to the disease. The method can be used to identify subtypes in any disease for which there are at least two independent microarray datasets of disease samples.

    View details for DOI 10.1186/1471-2164-7-231

    View details for Web of Science ID 000240732900001

    View details for PubMedID 16965636

    View details for PubMedCentralID PMC1574316

  • Global transcriptional response to interferon is a determinant of HCV treatment outcome and is modified by race HEPATOLOGY He, X., Ji, X., Hale, M. B., Cheung, R., Ahmed, A., Guo, Y., Nolan, G. P., Pfeffer, L. M., Wright, T. L., Risch, N., Tibshirani, R., Greenberg, H. B. 2006; 44 (2): 352-359

    Abstract

    Interferon (IFN)-alpha-based therapy for chronic hepatitis C is effective in fewer than 50% of all treated patients, with a substantially lower response rate in black patients. The goal of this study was to investigate the underlying host transcriptional response associated with interferon treatment outcomes. We collected peripheral blood mononuclear cells from chronic hepatitis C patients before initiation of IFN-alpha therapy and incubated the cells with or without IFN-alpha for 6 hours, followed by microarray assay to identify IFN-induced gene transcription. The microarray datasets were analyzed statistically according to the patients' race and virological responses to subsequent IFN-alpha treatment. The global induction of IFN-stimulated genes (ISGs) was significantly greater in sustained virological responders compared with nonresponders and in white patients compared with black patients. In addition, a significantly greater global induction of ISGs was observed in sustained virological responders compared with nonresponders within the group of white patients. The level of IFN-induced signal transducer and activator of transcription (STAT) 1 activation, a key component of the Janus kinase (JAK)-STAT signaling pathway, correlated with the global induction of ISGs and was significantly higher in white patients than in black patients. In conclusion, both treatment outcome and race are associated with different transcriptional responses to IFN-alpha. Because this difference is evident in the global induction of ISGs rather than a selective effect on a subset of such genes, key factors affecting the outcome of IFN-alpha therapy are likely to act at the JAK-STAT pathway that controls transcription of downstream ISGs.

    View details for DOI 10.1002/hep.21267

    View details for PubMedID 16871572

  • Sparse principal component analysis JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2006; 15 (2): 265-286
  • A tail strength measure for assessing the overall univariate significance in a dataset BIOSTATISTICS Taylor, J., Tibshirani, R. 2006; 7 (2): 167-181

    Abstract

    We propose an overall measure of significance for a set of hypothesis tests. The 'tail strength' is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.

    View details for DOI 10.1093/biostatistics/kxj009

    View details for Web of Science ID 000236436300001

    View details for PubMedID 16332926

  • Hybrid hierarchical clustering with applications to microarray data BIOSTATISTICS Chipman, H., Tibshirani, R. 2006; 7 (2): 286-301

    Abstract

    In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.

    View details for DOI 10.1093/biostatistics/kxj007

    View details for Web of Science ID 000236436300009

    View details for PubMedID 16301308

  • A simple method for assessing sample sizes in microarray experiments BMC BIOINFORMATICS Tibshirani, R. 2006; 7

    Abstract

    In this short article, we discuss a simple method for assessing sample size requirements in microarray experiments.Our method starts with the output from a permutation-based analysis for a set of pilot data, e.g. from the SAM package. Then for a given hypothesized mean difference and various samples sizes, we estimate the false discovery rate and false negative rate of a list of genes; these are also interpretable as per gene power and type I error. We also discuss application of our method to other kinds of response variables, for example survival outcomes.Our method seems to be useful for sample size assessment in microarray experiments.

    View details for DOI 10.1186/1471-2105-7-106

    View details for Web of Science ID 000237138600001

    View details for PubMedID 16512900

    View details for PubMedCentralID PMC1450307

  • An evaluation of tumor oxygenation and gene expression in patients with early stage non-small cell lung cancers CLINICAL CANCER RESEARCH Le, Q. T., Chen, E., Salim, A., Cao, H. B., Kong, C. S., Whyte, R., Donington, J., Cannon, W., Wakelee, H., Tibshirani, R., Mitchell, J. D., Richardson, D., O'Byrne, K. J., Koong, A. C., Giaccia, A. J. 2006; 12 (5): 1507-1514

    Abstract

    To directly assess tumor oxygenation in resectable non-small cell lung cancers (NSCLC) and to correlate tumor pO2 and the selected gene and protein expression to treatment outcomes.Twenty patients with resectable NSCLC were enrolled. Intraoperative measurements of normal lung and tumor pO2 were done with the Eppendorf polarographic electrode. All patients had plasma osteopontin measurements by ELISA. Carbonic anhydrase-IX (CA IX) staining of tumor sections was done in the majority of patients (n = 16), as was gene expression profiling (n = 12) using cDNA microarrays. Tumor pO2 was correlated with CA IX staining, osteopontin levels, and treatment outcomes.The median tumor pO2 ranged from 0.7 to 46 mm Hg (median, 16.6) and was lower than normal lung pO2 in all but one patient. Because both variables were affected by the completeness of lung deflation during measurement, we used the ratio of tumor/normal lung (T/L) pO2 as a reflection of tumor oxygenation. The median T/L pO2 was 0.13. T/L pO2 correlated significantly with plasma osteopontin levels (r = 0.53, P = 0.02) and CA IX expression (P = 0.006). Gene expression profiling showed that high CD44 expression was a predictor for relapse, which was confirmed by tissue staining of CD44 variant 6 protein. Other variables associated with the risk of relapse were T stage (P = 0.02), T/L pO2 (P = 0.04), and osteopontin levels (P = 0.001).Tumor hypoxia exists in resectable NSCLC and is associated with elevated expression of osteopontin and CA IX. Tumor hypoxia and elevated osteopontin levels and CD44 expression correlated with poor prognosis. A larger study is needed to confirm the prognostic significance of these factors.

    View details for DOI 10.1158/1078-0432.CCR-05-2049

    View details for PubMedID 16533775

  • Prediction by supervised principal components JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bair, E., Hastie, T., Paul, D., Tibshirani, R. 2006; 101 (473): 119-137
  • Changes of gene expression in gastric preneoplasia following Helicobacter pylori eradication therapy CANCER EPIDEMIOLOGY BIOMARKERS & PREVENTION Tsai, C. J., Herrera-Goepfert, R., Tibshirani, R. J., Yang, S. F., Mohar, A., Guarner, J., Parsonnet, J. 2006; 15 (2): 272-280

    Abstract

    Helicobacter pylori causes gastric preneoplasia and neoplasia. Eradicating H. pylori can result in partial regression of preneoplastic lesions; however, the molecular underpinning of this change is unknown. To identify molecular changes in the gastric mucosa following H. pylori eradication, we used cDNA microarrays (with each array containing approximately 30,300 genes) to analyze 54 gastric biopsies from a randomized, placebo-controlled trial of H. pylori therapy. The 54 biopsies were obtained from 27 subjects (13 from the treatment and 14 from the placebo group) with chronic gastritis, atrophy, and/or intestinal metaplasia. Each subject contributed one biopsy before and another biopsy 1 year after the intervention. Significant analysis of microarrays (SAM) was used to compare the gene expression profiles of pre-intervention and post-intervention biopsies. In the treatment group, SAM identified 30 genes whose expression changed significantly from baseline to 1 year after treatment (0 up-regulated and 30 down-regulated). In the placebo group, the expression of 55 genes differed significantly over the 1-year period (32 up-regulated and 23 down-regulated). Five genes involved in cell-cell adhesion and lining (TACSTD1 and MUC13), cell cycle differentiation (S100A10), and lipid metabolism and transport (FABP1 and MTP) were down-regulated over time in the treatment group but up-regulated in the placebo group. Immunohistochemistry for one of these differentially expressed genes (FABP1) confirmed the changes in gene expression observed by microarray. In conclusion, H. pylori eradication may stop or reverse ongoing molecular processes in the stomach. Further studies are needed to evaluate the use of these genes as markers for gastric cancer risk.

    View details for DOI 10.1158/1055-9965.EPI-05-0362

    View details for Web of Science ID 000235587200012

    View details for PubMedID 16492915

  • Gene expression profiling predicts survival in conventional renal cell carcinoma PLOS MEDICINE Zhao, H. J., Ljungberg, B., Grankvist, K., Rasmuson, T., Tibshirani, R., Brooks, J. D. 2006; 3 (1): 115-124

    Abstract

    Conventional renal cell carcinoma (cRCC) accounts for most of the deaths due to kidney cancer. Tumor stage, grade, and patient performance status are used currently to predict survival after surgery. Our goal was to identify gene expression features, using comprehensive gene expression profiling, that correlate with survival.Gene expression profiles were determined in 177 primary cRCCs using DNA microarrays. Unsupervised hierarchical clustering analysis segregated cRCC into five gene expression subgroups. Expression subgroup was correlated with survival in long-term follow-up and was independent of grade, stage, and performance status. The tumors were then divided evenly into training and test sets that were balanced for grade, stage, performance status, and length of follow-up. A semisupervised learning algorithm (supervised principal components analysis) was applied to identify transcripts whose expression was associated with survival in the training set, and the performance of this gene expression-based survival predictor was assessed using the test set. With this method, we identified 259 genes that accurately predicted disease-specific survival among patients in the independent validation group (p < 0.001). In multivariate analysis, the gene expression predictor was a strong predictor of survival independent of tumor stage, grade, and performance status (p < 0.001).cRCC displays molecular heterogeneity and can be separated into gene expression subgroups that correlate with survival after surgery. We have identified a set of 259 genes that predict survival after surgery independent of clinical prognostic factors.

    View details for DOI 10.1371/journal.pmed.0030013

    View details for Web of Science ID 000236342700020

    View details for PubMedID 16318415

    View details for PubMedCentralID PMC1298943

  • Autoantibody profiling of lupus mice deficient for interferon signaling components. 6th Annual Meeting of the Federation-of-Clinical-Immunology-Societies Thibault, D., Graham, K., Balboni, I., Lee, L., Kohlmoos, C., Tibshirani, R., Utz, P. ACADEMIC PRESS INC ELSEVIER SCIENCE. 2006: S72–S73
  • Combined microarray analysis of small cell lung cancer reveals altered apoptotic balance and distinct expression signatures of MYC family gene amplification ONCOGENE Kim, Y. H., Girard, L., Giacomini, C. P., Wang, P., Hernandez-Boussard, T., Tibshirani, R., Minna, J. D., Pollack, J. R. 2006; 25 (1): 130-138

    Abstract

    DNA amplifications and deletions frequently contribute to the development and progression of lung cancer. To identify such novel alterations in small cell lung cancer (SCLC), we performed comparative genomic hybridization on a set of 24 SCLC cell lines, using cDNA microarrays representing approximately 22,000 human genes (providing an average mapping resolution of <70 kb). We identified localized DNA amplifications corresponding to oncogenes known to be amplified in SCLC, including MYC (8q24), MYCN (2p24) and MYCL1 (1p34). Additional highly localized DNA amplifications suggested candidate oncogenes not previously identified as amplified in SCLC, including the antiapoptotic genes TNFRSF4 (1p36), DAD1 (14q11), BCL2L1 (20q11) and BCL2L2 (14q11). Likewise, newly discovered PCR-validated homozygous deletions suggested candidate tumor-suppressor genes, including the proapoptotic genes MAPK10 (4q21) and TNFRSF6 (10q23). To characterize the effect of DNA amplification on gene expression patterns, we performed expression profiling using the same microarray platform. Among our findings, we identified sets of genes whose expression correlated with MYC, MYCN or MYCL1 amplification, with surprisingly little overlap among gene sets. While both MYC and MYCN amplification were associated with increased and decreased expression of known MYC upregulated and downregulated targets, respectively, MYCL1 amplification was associated only with the latter. Our findings support a role of altered apoptotic balance in the pathogenesis of SCLC, and suggest that MYC family genes might affect oncogenesis through distinct sets of targets, in particular implicating the importance of transcriptional repression.

    View details for DOI 10.1038/sj.onc.1208997

    View details for Web of Science ID 000234406400014

    View details for PubMedID 16116477

  • Gene expression profiling differentiates germ cell tumors from other cancers and defines subtype-specific signatures PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Juric, D., Sale, S., Hromas, R. A., Yu, R., Wang, Y., Duran, G. E., Tibshirari, R., Einhorn, L. H., Sikic, B. I. 2005; 102 (49): 17763-17768

    Abstract

    Germ cell tumors (GCTs) of the testis are the predominant cancer among young men. We analyzed gene expression profiles of 50 GCTs of various subtypes, and we compared them with 443 other common malignant tumors of epithelial, mesenchymal, and lymphoid origins. Significant differences in gene expression were found among major histological subtypes of GCTs, and between them and other malignancies. We identified 511 genes, belonging to several critical functional groups such as cell cycle progression, cell proliferation, and apoptosis, to be significantly differentially expressed in GCTs compared with other tumor types. Sixty-five genes were sufficient for the construction of a GCT class predictor of high predictive accuracy (100% training set, 96% test set), which might be useful in the diagnosis of tumors of unknown primary origin. Previously described diagnostic and prognostic markers were found to be expressed by the appropriate GCT subtype (AFP, POU5F1, POV1, CCND2, and KIT). Several additional differentially expressed genes were identified in teratomas (EGR1 and MMP7), yolk sac tumors (PTPN13 and FN1), and seminomas (NR6A1, DPPA4, and IRX1). Dynamic computation of interaction networks and mapping to existing pathways knowledge databases revealed a potential role of EGR1 in p21-induced cell cycle arrest and intrinsic chemotherapy resistance of mature teratomas.

    View details for DOI 10.1073/pnas.0509082102

    View details for PubMedID 16306258

  • Differential gene expression profiles in CD34+myelodysplastic syndrome marrow cells. 47th Annual Meeting of the American-Society-of-Hematology Sridhar, K., Brown, P. O., Tibshirani, R., Jamieson, C., Weissman, I., Ross, D. T., Greenberg, P. L. AMER SOC HEMATOLOGY. 2005: 956A–956A
  • Gene expression profiling and FLT3 status correlate with outcome in de novo acute myeloid leukemia (AML) with normal karyotype: Results of children's oncology group (COG) study POG #9421. 47th Annual Meeting of the American-Society-of-Hematology Lacayo, N., Meshinchi, S., Raimondi, S., Saraiya, C., O'Brien, M., Yu, R., Juric, D., Chang, M., Willman, C., Tibshirani, R., Ravindranath, Y., Sikic, B., Weinstein, H., Dahl, G. V. AMER SOC HEMATOLOGY. 2005: 667A–667A
  • Cluster validation by prediction strength JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tibshirani, R., Walther, G. 2005; 14 (3): 511-528
  • Signature patterns of gene expression in mouse atherosclerosis and their correlation to human coronary disease PHYSIOLOGICAL GENOMICS Tabibiazar, R., Wagner, R. A., Ashley, E. A., King, J. Y., Ferrara, R., Spin, J. M., Sanan, D. A., Narasimhan, B., Tibshirani, R., Tsao, P. S., Efron, B., Quertermous, T. 2005; 22 (2): 213-226

    Abstract

    The propensity for developing atherosclerosis is dependent on underlying genetic risk and varies as a function of age and exposure to environmental risk factors. Employing three mouse models with different disease susceptibility, two diets, and a longitudinal experimental design, it was possible to manipulate each of these factors to focus analysis on genes most likely to have a specific disease-related function. To identify differences in longitudinal gene expression patterns of atherosclerosis, we have developed and employed a statistical algorithm that relies on generalized regression and permutation analysis. Comprehensive annotation of the array with ontology and pathway terms has allowed rigorous identification of molecular and biological processes that underlie disease pathophysiology. The repertoire of atherosclerosis-related immunomodulatory genes has been extended, and additional fundamental pathways have been identified. This highly disease-specific group of mouse genes was combined with an extensive human coronary artery data set to identify a shared group of genes differentially regulated among atherosclerotic tissues from different species and different vascular beds. A small core subset of these differentially regulated genes was sufficient to accurately classify various stages of the disease in mouse. The same gene subset was also found to accurately classify human coronary lesion severity. In addition, this classifier gene set was able to distinguish with high accuracy atherectomy specimens from native coronary artery disease vs. those collected from in-stent restenosis lesions, thus identifying molecular differences between these two processes. These studies significantly focus efforts aimed at identifying central gene regulatory pathways that mediate atherosclerotic disease, and the identification of classification gene sets offers unique insights into potential diagnostic and therapeutic strategies in atherosclerotic disease.

    View details for DOI 10.1152/physiolgenomics.00001.2005

    View details for Web of Science ID 000230987900011

    View details for PubMedID 15870398

  • Array-based comparative genomic hybridization identifies localized DNA amplifications and homozygous deletions in pancreatic cancer NEOPLASIA Bashyam, M. D., Bair, R., Kim, Y. H., Wang, P., Hernandez-Boussard, T., Karikari, C. A., Tibshirani, R., Maitra, A., Pollack, J. R. 2005; 7 (6): 556-562

    Abstract

    Pancreatic cancer, the fourth leading cause of cancer death in the United States, is frequently associated with the amplification and deletion of specific oncogenes and tumor-suppressor genes (TSGs), respectively. To identify such novel alterations and to discover the underlying genes, we performed comparative genomic hybridization on a set of 22 human pancreatic cancer cell lines, using cDNA microarrays measuring approximately 26,000 human genes (thereby providing an average mapping resolution of <60 kb). To define the subset of amplified and deleted genes with correspondingly altered expression, we also profiled mRNA levels in parallel using the same cDNA microarray platform. In total, we identified 14 high-level amplifications (38-4934 kb in size) and 15 homozygous deletions (46-725 kb). We discovered novel localized amplicons, suggesting previously unrecognized candidate oncogenes at 6p21, 7q21 (SMURF1, TRRAP), 11q22 (BIRC2, BIRC3), 12p12, 14q24 (TGFB3), 17q12, and 19q13. Likewise, we identified novel polymerase chain reaction-validated homozygous deletions indicating new candidate TSGs at 6q25, 8p23, 8p22 (TUSC3), 9q33 (TNC, TNFSF15), 10q22, 10q24 (CHUK), 11p15 (DKK3), 16q23, 18q23, 21q22 (PRDM15, ANKRD3), and Xp11. Our findings suggest candidate genes and pathways, which may contribute to the development or progression of pancreatic cancer.

    View details for DOI 10.1593/neo.04586

    View details for Web of Science ID 000230209600002

    View details for PubMedID 16036106

    View details for PubMedCentralID PMC1501288

  • Genome-wide characterization of gene expression variations and DNA copy number changes in prostate cancer cell lines PROSTATE Zhao, H. J., Kim, Y., Wang, P., Lapointe, J., Tibshirani, R., Pollack, J. R., Brooks, J. D. 2005; 63 (2): 187-197

    Abstract

    The aim of this study was to characterize gene expression and DNA copy number profiles in androgen sensitive (AS) and androgen insensitive (AI) prostate cancer cell lines on a genome-wide scale.Gene expression profiles and DNA copy number changes were examined using DNA microarrays in eight commonly used prostate cancer cell lines. Chromosomal regions with DNA copy number changes were identified using cluster along chromosome (CLAC).There were discrete differences in gene expression patterns between AS and AI cells that were not limited to androgen-responsive genes. AI cells displayed more DNA copy number changes, especially amplifications, than AS cells. The gene expression profiles of cell lines showed limited similarities to prostate tumors harvested at surgery.AS and AI cell lines are different in their transcriptional programs and degree of DNA copy number alterations. This dataset provides a context for the use of prostate cancer cell lines as models for clinical cancers.

    View details for DOI 10.1002/pros.20158

    View details for PubMedID 15486987

  • Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H. Y., He, Y. D., Van't Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O., van de Vijver, M. J. 2005; 102 (10): 3738-3743

    Abstract

    Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this "wound-response signature" to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering ("molecular subtypes") or supervised predictors of metastasis ("70-gene prognosis signature").

    View details for DOI 10.1073/pnas.0409462102

    View details for PubMedID 15701700

  • Mouse strain-specific differences in vascular wall gene expression and their relationship to vascular disease ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY Tabibiazar, R., Wagner, R. A., Spin, J. M., Ashley, E. A., Narasimhan, B., Rubin, E. M., Efron, B., Tsao, P. S., Tibshirani, R., Quertermous, T. 2005; 25 (2): 302-308

    Abstract

    Different strains of inbred mice exhibit different susceptibility to the development of atherosclerosis. The C3H/HeJ and C57Bl/6 mice have been used in several studies aimed at understanding the genetic basis of atherosclerosis. Under controlled environmental conditions, variations in susceptibility to atherosclerosis reflect differences in genetic makeup, and these differences must be reflected in gene expression patterns that are temporally related to the development of disease. In this study, we sought to identify the genetic pathways that are differentially activated in the aortas of these mice.We performed genome-wide transcriptional profiling of aortas from C3H/HeJ and C57Bl/6 mice. Differences in gene expression were identified at baseline as well as during normal aging and longitudinal exposure to high-fat diet. The significance of these genes to the development of atherosclerosis was evaluated by observing their temporal pattern of expression in the well-studied apolipoprotein E model of atherosclerosis.Gene expression differences between the 2 strains suggest that aortas of C57Bl/6 mice have a higher genetic propensity to develop inflammation in response to appropriate atherogenic stimuli. This study expands the repertoire of factors in known disease-related signaling pathways and identifies novel candidate genes for future study. To gain insights into the molecular pathways that are differentially activated in strains of mice with varied susceptibility to atherosclerosis, we performed comprehensive transcriptional profiling of their vascular wall. Genes identified through these studies expand the repertoire of factors in disease-related signaling pathways and identify novel candidate genes in atherosclerosis.

    View details for DOI 10.1161/011.ATV.0000151372.86863.a5

    View details for Web of Science ID 000226594000009

    View details for PubMedID 15550693

  • The 'miss rate' for the analysis of gene expression data BIOSTATISTICS Taylor, J., Tibshirani, R., Efron, B. 2005; 6 (1): 111-117

    Abstract

    Multiple testing issues are important in gene expression studies, where typically thousands of genes are compared over two or more experimental conditions. The false discovery rate has become a popular measure in this setting. Here we discuss a complementary measure, the 'miss rate', and show how to estimate it in practice.

    View details for DOI 10.1093/biostatistics/kxh021

    View details for Web of Science ID 000226346300009

    View details for PubMedID 15618531

  • Early detection of breast cancer based on gene-expression patterns in peripheral blood cells BREAST CANCER RESEARCH Sharma, P., Sahni, N. S., Tibshirani, R., Skaane, P., Urdal, P., Berghagen, H., Jensen, M., Kristiansen, L., Moen, C., Sharma, P., Zaka, A., Arnes, J., Sauer, T., Akslen, L. A., Schlichting, E., Borresen-Dale, A. L., Lonneborg, A. 2005; 7 (5): R634-R644

    Abstract

    Existing methods to detect breast cancer in asymptomatic patients have limitations, and there is a need to develop more accurate and convenient methods. In this study, we investigated whether early detection of breast cancer is possible by analyzing gene-expression patterns in peripheral blood cells.Using macroarrays and nearest-shrunken-centroid method, we analyzed the expression pattern of 1,368 genes in peripheral blood cells of 24 women with breast cancer and 32 women with no signs of this disease. The results were validated using a standard leave-one-out cross-validation approach.We identified a set of 37 genes that correctly predicted the diagnostic class in at least 82% of the samples. The majority of these genes had a decreased expression in samples from breast cancer patients, and predominantly encoded proteins implicated in ribosome production and translation control. In contrast, the expression of some defense-related genes was increased in samples from breast cancer patients.The results show that a blood-based gene-expression test can be developed to detect breast cancer early in asymptomatic patients. Additional studies with a large sample size, from women both with and without the disease, are warranted to confirm or refute this finding.

    View details for DOI 10.1186/bcr1203

    View details for Web of Science ID 000232332200021

    View details for PubMedID 16168108

    View details for PubMedCentralID PMC1242124

  • Sparsity and smoothness via the fused lasso JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K. 2005; 67: 91-108
  • CSF1 expression signature identifies a subset of breast carcinomas and influences outcome. 28th Annual San Antonio Breast Cancer Symposium West, R. B., Horlings, H., Nuyten, D. S., Subramanian, S., Zhu, S. X., Miller, M., Rubin, B. P., Nielsen, T. O., Gilks, C. B., Huntsman, D. G., Tibshirani, R., van De Vijver, M., van de Rijn, M. SPRINGER. 2005: S135–S135
  • A method for calling gains and losses in array CGH data BIOSTATISTICS Wang, P., Kim, Y., Pollack, J., Narasimhan, B., Tibshirani, R. 2005; 6 (1): 45-58

    Abstract

    Array CGH is a powerful technique for genomic studies of cancer. It enables one to carry out genome-wide screening for regions of genetic alterations, such as chromosome gains and losses, or localized amplifications and deletions. In this paper, we propose a new algorithm 'Cluster along chromosomes' (CLAC) for the analysis of array CGH data. CLAC builds hierarchical clustering-style trees along each chromosome arm (or chromosome), and then selects the 'interesting' clusters by controlling the False Discovery Rate (FDR) at a certain level. In addition, it provides a consensus summary across a set of arrays, as well as an estimate of the corresponding FDR. We illustrate the method using an application of CLAC on a lung cancer microarray CGH data set as well as a BAC array CGH data set of aneuploid cell strains.

    View details for DOI 10.1093/biostatistics/kxh017

    View details for Web of Science ID 000226346300005

    View details for PubMedID 15618527

  • Sample classification from protein mass spectrometry, by 'peak probability contrasts' BIOINFORMATICS Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G. Y., Koong, A., Le, Q. T. 2004; 20 (17): 3034-3044

    Abstract

    Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers.Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.

    View details for DOI 10.1093/bioinformatics/bth357

    View details for Web of Science ID 000225361400017

    View details for PubMedID 15226172

  • The percentage of tumor-infiltrating T cells is not correlated with overall survival in follicular B-cell lymphomas 46th Annual Meeting of the American-Society-of-Hematology Ai, W. Y., Czerwinski, D. K., Tibshirani, R., Horning, S. J., Levy, R. AMER SOC HEMATOLOGY. 2004: 891A–891A
  • Gene expression profiles at diagnosis in de novo childhood AML patients identify FLT3 mutations with good clinical outcomes BLOOD Lacayo, N. J., Meshinchi, S., Kinnunen, P., Yu, R., Wang, Y., Stuber, C. M., Douglas, L., Wahab, R., Becton, D. L., Weinstein, H., Chang, M. N., Willman, C. L., Radich, J. P., Tibshirani, R., Ravindranath, Y., Sikic, B. I., Dahl, G. V. 2004; 104 (9): 2646-2654

    Abstract

    Fms-like tyrosine kinase 3 (FLT3) mutations are associated with unfavorable outcomes in children with acute myeloid leukemia (AML). We used DNA microarrays to identify gene expression profiles related to FLT3 status and outcome in childhood AML. Among 81 diagnostic specimens, 36 had FLT3 mutations (FLT3-MUs), 24 with internal tandem duplications (ITDs) and 12 with activating loop mutations (ALMs). In addition, 8 of 19 specimens from patients with relapses had FLT3-MUs. Predictive analysis of microarrays (PAM) identified genes that differentiated FLT3-ITD from FLT3-ALM and FLT3 wild-type (FLT3-WT) cases. Among the 42 specimens with FLT3-MUs, PAM identified 128 genes that correlated with clinical outcome. Event-free survival (EFS) in FLT3-MU patients with a favorable signature was 45% versus 5% for those with an unfavorable signature (P = .018). Among FLT3-MU specimens, high expression of the RUNX3 gene and low expression of the ATRX gene were associated with inferior outcome. The ratio of RUNX3 to ATRX expression was used to classify FLT3-MU cases into 3 EFS groups: 70%, 37%, and 0% for low, intermediate, and high ratios, respectively (P < .0001). Thus, gene expression profiling identified AML patients with divergent prognoses within the FLT3-MU group, and the RUNX3 to ATRX expression ratio should be a useful prognostic indicator in these patients.

    View details for DOI 10.1182/blood-2004-12-4449

    View details for PubMedID 15251987

  • The entire regularization path for the support vector machine JOURNAL OF MACHINE LEARNING RESEARCH Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. 2004; 5: 1391-1415
  • Developmental response to hypoxia FASEB JOURNAL Huang, S. T., Vo, K. C., Lyell, D. J., Faessen, G. H., Tulac, S., Tibshirani, R., Giaccia, A. J., Giudice, L. C. 2004; 18 (12): 1348-1365

    Abstract

    Molecular mechanisms underlying fetal growth restriction due to placental insufficiency and in utero hypoxia are not well understood. In the current study, time-dependent (3 h-11 days) changes in fetal tissue gene expression in a rat model of in utero hypoxia compared with normoxic controls were investigated as an initial approach to understand molecular events underlying fetal development in response to hypoxia. Under hypoxic conditions, litter size was reduced and IGFBP-1 was up-regulated in maternal serum and in fetal liver and heart. Tissue-specific, distinct regulatory patterns of gene expression were observed under acute vs. chronic hypoxic conditions. Induction of glycolytic enzymes was an early event in response to hypoxia during organ development; consistently, tissue-specific induction of calcium homeostasis-related genes and suppression of growth-related genes were observed, suggesting mechanisms underlying hypoxia-related fetal growth restriction. Furthermore, induction of inflammation-related genes in placentas exposed to long-term hypoxia (11 days) suggests a mechanism for placental dysfunction and impaired pregnancy outcome accompanying in utero hypoxia.

    View details for DOI 10.1096/fj.03-1377com

    View details for Web of Science ID 000224243200054

    View details for PubMedID 15333578

  • Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage Annual Scientific Meeting on Exporing Genomics in Radiation Oncology Rieger, K. E., Hong, W. J., Tusher, V. G., Tang, J., Tibshirani, R., Chu, G. ELSEVIER IRELAND LTD. 2004: S29–S29
  • The use of plasma surface-enhanced laser desorption/ionization time-of-flight mass spectrometry proteomic patterns for detection of head and neck squamous cell cancers 45th Annual Meeting of the American-Society-for-Therapeutic-Radiology-and-Oncology (ASTRO) Soltys, S. G., Le, Q. T., Shi, G. Y., Tibshirani, R., Giaccia, A. J., Koong, A. C. AMER ASSOC CANCER RESEARCH. 2004: 4806–12

    Abstract

    Our study was undertaken to determine the utility of plasma proteomic profiling using surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry for the detection of head and neck squamous cell carcinomas (HNSCCs).Pretreatment plasma samples from HNSCC patients or controls without known neoplastic disease were analyzed on the Protein Biology System IIc SELDI-TOF mass spectrometer (Ciphergen Biosystems, Fremont, CA). Proteomic spectra of mass:charge ratio (m/z) were generated by the application of plasma to immobilized metal-affinity-capture (IMAC) ProteinChip arrays activated with copper. A total of 37356 data points were generated for each sample. A training set of spectra from 56 cancer patients and 52 controls were applied to the "Lasso" technique to identify protein profiles that can distinguish cancer from noncancer, and cross-validation was used to determine test errors in this training set. The discovery pattern was then used to classify a separate masked test set of 57 cancer and 52 controls. In total, we analyzed the proteomic spectra of 113 cancer patients and 104 controls.The Lasso approach identified 65 significant data points for the discrimination of normal from cancer profiles. The discriminatory pattern correctly identified 39 of 57 HNSCC patients and 40 of 52 noncancer controls in the masked test set. These results yielded a sensitivity of 68% and specificity of 73%. Subgroup analyses in the test set of four different demographic factors (age, gender, and cigarette and alcohol use) that can potentially confound the interpretation of the results suggest that this model tended to overpredict cancer in control smokers.Plasma proteomic profiling with SELDI-TOF mass spectrometry provides moderate sensitivity and specificity in discriminating HNSCC. Further improvement and validation of this approach is needed to determine its usefulness in screening for this disease.

    View details for Web of Science ID 000222840700027

    View details for PubMedID 15269156

  • Efficient quadratic regularization for expression arrays BIOSTATISTICS Hastie, T., Tibshirani, R. 2004; 5 (3): 329-340

    Abstract

    Gene expression arrays typically have 50 to 100 samples and 1000 to 20,000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.

    View details for DOI 10.1093/biostatistics/kxh010

    View details for Web of Science ID 000222723600001

    View details for PubMedID 15208198

  • Different gene expression patterns in invasive lobular and ductal carcinomas of the breast MOLECULAR BIOLOGY OF THE CELL Zhao, H. J., Langerod, A., Ji, Y., Nowels, K. W., Nesland, J. M., Tibshirani, R., Bukholm, I. K., Karesen, R., Botstein, D., Borresen-Dale, A. L., Jeffrey, S. S. 2004; 15 (6): 2523-2536

    Abstract

    Invasive ductal carcinoma (IDC) and invasive lobular carcinoma (ILC) are the two major histological types of breast cancer worldwide. Whereas IDC incidence has remained stable, ILC is the most rapidly increasing breast cancer phenotype in the United States and Western Europe. It is not clear whether IDC and ILC represent molecularly distinct entities and what genes might be involved in the development of these two phenotypes. We conducted comprehensive gene expression profiling studies to address these questions. Total RNA from 21 ILCs, 38 IDCs, two lymph node metastases, and three normal tissues were amplified and hybridized to approximately 42,000 clone cDNA microarrays. Data were analyzed using hierarchical clustering algorithms and statistical analyses that identify differentially expressed genes (significance analysis of microarrays) and minimal subsets of genes (prediction analysis for microarrays) that succinctly distinguish ILCs and IDCs. Eleven of 21 (52%) of the ILCs ("typical" ILCs) clustered together and displayed different gene expression profiles from IDCs, whereas the other ILCs ("ductal-like" ILCs) were distributed between different IDC subtypes. Many of the differentially expressed genes between ILCs and IDCs code for proteins involved in cell adhesion/motility, lipid/fatty acid transport and metabolism, immune/defense response, and electron transport. Many genes that distinguish typical and ductal-like ILCs are involved in regulation of cell growth and immune response. Our data strongly suggest that over half the ILCs differ from IDCs not only in histological and clinical features but also in global transcription programs. The remaining ILCs closely resemble IDCs in their transcription patterns. Further studies are needed to explore the differences between ILC molecular subtypes and to determine whether they require different therapeutic strategies.

    View details for DOI 10.1091/mbc.E03-11-0786

    View details for Web of Science ID 000221778300001

    View details for PubMedID 15034139

    View details for PubMedCentralID PMC420079

  • Prediction of survival in diffuse large-B-cell lymphoma based on the expression of six genes NEW ENGLAND JOURNAL OF MEDICINE Lossos, I. S., Czerwinski, D. K., Alizadeh, A. A., Wechser, M. A., Tibshirani, R., Botstein, D., Levy, R. 2004; 350 (18): 1828-1837

    Abstract

    Several gene-expression signatures can be used to predict the prognosis in diffuse large-B-cell lymphoma, but the lack of practical tests for a genome-scale analysis has restricted the use of this method.We studied 36 genes whose expression had been reported to predict survival in diffuse large-B-cell lymphoma. We measured the expression of each of these genes in independent samples of lymphoma from 66 patients by quantitative real-time polymerase-chain-reaction analyses and related the results to overall survival.In a univariate analysis, genes were ranked on the basis of their ability to predict survival. The genes that were the strongest predictors were LMO2, BCL6, FN1, CCND2, SCYA3, and BCL2. We developed a multivariate model that was based on the expression of these six genes, and we validated the model in two independent microarray data sets. The model was independent of the International Prognostic Index and added to its predictive power.Measurement of the expression of six genes is sufficient to predict overall survival in diffuse large-B-cell lymphoma.

    View details for Web of Science ID 000221080300006

    View details for PubMedID 15115829

  • Toxicity from radiation therapy associated with abnormal transcriptional responses to DNA damage PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Rieger, K. E., Hong, W. J., Tusher, V. G., Tang, J., Tibshirani, R., Chu, G. 2004; 101 (17): 6635-6640

    Abstract

    Toxicity from radiation therapy is a grave problem for cancer patients. We hypothesized that some cases of toxicity are associated with abnormal transcriptional responses to radiation. We used microarrays to measure responses to ionizing and UV radiation in lymphoblastoid cells derived from 14 patients with acute radiation toxicity. The analysis used heterogeneity-associated transformation of the data to account for a clinical outcome arising from more than one underlying cause. To compute the risk of toxicity for each patient, we applied nearest shrunken centroids, a method that identifies and cross-validates predictive genes. Transcriptional responses in 24 genes predicted radiation toxicity in 9 of 14 patients with no false positives among 43 controls (P = 2.2 x 10(-7)). The responses of these nine patients displayed significant heterogeneity. Of the five patients with toxicity and normal responses, two were treated with protocols that proved to be highly toxic. These results may enable physicians to predict toxicity and tailor treatment for individual patients.

    View details for DOI 10.1073/pnas.0307761101

    View details for PubMedID 15096622

  • Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia NEW ENGLAND JOURNAL OF MEDICINE Bullinger, L., Dohner, K., Bair, E., Frohling, S., Schlenk, R. F., Tibshirani, R., Dohner, H., Pollack, J. R. 2004; 350 (16): 1605-1616

    Abstract

    In patients with acute myeloid leukemia (AML), the presence or absence of recurrent cytogenetic aberrations is used to identify the appropriate therapy. However, the current classification system does not fully reflect the molecular heterogeneity of the disease, and treatment stratification is difficult, especially for patients with intermediate-risk AML with a normal karyotype.We used complementary-DNA microarrays to determine the levels of gene expression in peripheral-blood samples or bone marrow samples from 116 adults with AML (including 45 with a normal karyotype). We used unsupervised hierarchical clustering analysis to identify molecular subgroups with distinct gene-expression signatures. Using a training set of samples from 59 patients, we applied a novel supervised learning algorithm to devise a gene-expression-based clinical-outcome predictor, which we then tested using an independent validation group comprising the 57 remaining patients.Unsupervised analysis identified new molecular subtypes of AML, including two prognostically relevant subgroups in AML with a normal karyotype. Using the supervised learning algorithm, we constructed an optimal 133-gene clinical-outcome predictor, which accurately predicted overall survival among patients in the independent validation group (P=0.006), including the subgroup of patients with AML with a normal karyotype (P=0.046). In multivariate analysis, the gene-expression predictor was a strong independent prognostic factor (odds ratio, 8.8; 95 percent confidence interval, 2.6 to 29.3; P<0.001).The use of gene-expression profiling improves the molecular classification of adult AML.

    View details for Web of Science ID 000220819800005

    View details for PubMedID 15084693

  • Semi-supervised methods to predict patient survival from gene expression data. PLoS biology Bair, E., Tibshirani, R. 2004; 2 (4): E108-?

    Abstract

    An important goal of DNA microarray research is to develop tools to diagnose cancer more accurately based on the genetic profile of a tumor. There are several existing techniques in the literature for performing this type of diagnosis. Unfortunately, most of these techniques assume that different subtypes of cancer are already known to exist. Their utility is limited when such subtypes have not been previously identified. Although methods for identifying such subtypes exist, these methods do not work well for all datasets. It would be desirable to develop a procedure to find such subtypes that is applicable in a wide variety of circumstances. Even if no information is known about possible subtypes of a certain form of cancer, clinical information about the patients, such as their survival time, is often available. In this study, we develop some procedures that utilize both the gene expression data and the clinical data to identify subtypes of cancer and use this knowledge to diagnose future patients. These procedures were successfully applied to several publicly available datasets. We present diagnostic procedures that accurately predict the survival of future patients based on the gene expression profile and survival times of previous patients. This has the potential to be a powerful tool for diagnosing and treating cancer.

    View details for PubMedID 15094809

  • Semi-supervised methods to predict patient survival from gene expression data PLOS BIOLOGY Bair, E., Tibshirani, R. 2004; 2 (4): 511-522
  • Least angle regression ANNALS OF STATISTICS Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. 2004; 32 (2): 407-451
  • Cancer characterization and feature set extraction by discriminative margin clustering BMC BIOINFORMATICS Munagala, K., Tibshirani, R., O Brown, P. 2004; 5

    Abstract

    A central challenge in the molecular diagnosis and treatment of cancer is to define a set of molecular features that, taken together, distinguish a given cancer, or type of cancer, from all normal cells and tissues.Discriminative margin clustering is a new technique for analyzing high dimensional quantitative datasets, specially applicable to gene expression data from microarray experiments related to cancer. The goal of the analysis is find highly specialized sub-types of a tumor type which are similar in having a small combination of genes which together provide a unique molecular portrait for distinguishing the sub-type from any normal cell or tissue. Detection of the products of these genes can then, in principle, provide a basis for detection and diagnosis of a cancer, and a therapy directed specifically at the distinguishing constellation of molecular features can, in principle, provide a way to eliminate the cancer cells, while minimizing toxicity to any normal cell.The new methodology yields highly specialized tumor subtypes which are similar in terms of potential diagnostic markers.

    View details for Web of Science ID 000220984700002

    View details for PubMedID 15070405

  • Guidelines - Expression profiling - best practices for data generation and interpretation in clinical trials NATURE REVIEWS GENETICS Hoffman, E. P., Awad, T., Palma, J., Webster, T., Hubbell, E., Warrington, J. A., Spirais, A., Wright, G., Buckley, J., Triche, T., Davis, R., Tibshirani, R., Xiao, W. H., Jones, W., Tompkins, R., West, M. 2004; 5 (3): 229-237

    View details for DOI 10.1038/nrg1297

    View details for Web of Science ID 000189334500018

  • Gene expression profiling identifies clinically relevant subtypes of prostate cancer PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Lapointe, J., Li, C., Higgins, J. P., van de Rijn, M., Bair, E., Montgomery, K., Ferrari, M., Egevad, L., Rayford, W., Bergerheim, U., Ekman, P., DeMarzo, A. M., Tibshirani, R., Botstein, D., Brown, P. O., Brooks, J. D., Pollack, J. R. 2004; 101 (3): 811-816

    Abstract

    Prostate cancer, a leading cause of cancer death, displays a broad range of clinical behavior from relatively indolent to aggressive metastatic disease. To explore potential molecular variation underlying this clinical heterogeneity, we profiled gene expression in 62 primary prostate tumors, as well as 41 normal prostate specimens and nine lymph node metastases, using cDNA microarrays containing approximately 26,000 genes. Unsupervised hierarchical clustering readily distinguished tumors from normal samples, and further identified three subclasses of prostate tumors based on distinct patterns of gene expression. High-grade and advanced stage tumors, as well as tumors associated with recurrence, were disproportionately represented among two of the three subtypes, one of which also included most lymph node metastases. To further characterize the clinical relevance of tumor subtypes, we evaluated as surrogate markers two genes differentially expressed among tumor subgroups by using immunohistochemistry on tissue microarrays representing an independent set of 225 prostate tumors. Positive staining for MUC1, a gene highly expressed in the subgroups with "aggressive" clinicopathological features, was associated with an elevated risk of recurrence (P = 0.003), whereas strong staining for AZGP1, a gene highly expressed in the other subgroup, was associated with a decreased risk of recurrence (P = 0.0008). In multivariate analysis, MUC1 and AZGP1 staining were strong predictors of tumor recurrence independent of tumor grade, stage, and preoperative prostate-specific antigen levels. Our results suggest that prostate tumors can be usefully classified according to their gene expression patterns, and these tumor subtypes may provide a basis for improved prognostication and treatment stratification.

    View details for DOI 10.1073/pnas.0304146101

    View details for PubMedID 14711987

  • Central carbon metabolism genes that predict disease-free survival in hormone receptor negative tumors. 27th Annual San Antonio Breast Cancer Symposium Funari, V. A., Tibshirani, R., Ji, Y., Nicolau, M., Carlson, R. W., Brown, P. O., Noh, D. Y., Jeffrey, S. S. SPRINGER. 2004: S115–S115
  • 1-norm support vector machines 17th Annual Conference on Neural Information Processing Systems (NIPS) Zhu, J., Rosset, S., Hastie, T., Tibshirani, R. M I T PRESS. 2004: 49–56
  • Boosted PRIM with application to searching for oncogenic pathway of lung cancer IEEE Computational Systems Bioinformatics Conference (CSB 2004) Wang, P., Kim, Y., Pollack, J., Tibshirani, R. IEEE COMPUTER SOC. 2004: 604–609
  • Gene expression patterns in ovarian carcinomas MOLECULAR BIOLOGY OF THE CELL Schaner, M. E., Ross, D. T., Ciaravino, G., Sorlie, T., Troyanskaya, O., Diehn, M., Wang, Y. C., Duran, G. E., Sikic, T. L., Caldeira, S., Skomedal, H., Tu, I. P., Hernandez-Boussard, T., Johnson, S. W., O'Dwyer, P. J., Fero, M. J., Kristensen, G. B., Borresen-Dale, A. L., Hastie, T., Tibshirani, R., van de Rijn, M., Teng, N. N., Longacre, T. A., Botstein, D., Brown, P. O., Sikic, B. I. 2003; 14 (11): 4376-4386

    Abstract

    We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.

    View details for PubMedID 12960427

  • Changes in gene expression in intermediate endpoints of gastric cancer: A randomized, placebo-controlled trial of Helicobacter pylori eradication therapy. 2nd Annual Conference on Frontiers in Cancer Prevention Research Tsai, C. J., Yang, S. F., Tibshirani, R. J., Guarner, J., Mohar, A., Herrera-Goepfert, R., Parsonnet, J. AMER ASSOC CANCER RESEARCH. 2003: 1280S–1280S
  • Characterization of variant patterns of nodular lymphocyte predominant Hodgkin lymphoma with immunohistologic and clinical correlation AMERICAN JOURNAL OF SURGICAL PATHOLOGY Fan, Z., Natkunam, Y., Bair, E., Tibshirani, R., Warnke, R. A. 2003; 27 (10): 1346-1356

    Abstract

    Nodular lymphocyte predominant Hodgkin lymphoma (NLPHL) has traditionally been recognized as having two morphologic patterns, nodular and diffuse, and the current WHO definition of NLPHL requires at least a partial nodular pattern. Variant patterns have not been well documented. We analyzed retrospectively the morphologic and immunophenotypic patterns of NLPHL from 118 patients (total of 137 biopsy samples). Histology plus antibodies directed against CD20, CD3, and CD21 were used to evaluate the immunoarchitecture. We identified six distinct immunoarchitectural patterns in our cases of NLPHL: "classic" (B-cell-rich) nodular, serpiginous/interconnected nodular, nodular with prominent extranodular L&H cells, T-cell-rich nodular, diffuse with a T-cell-rich background (T-cell-rich B-cell lymphoma [TCRBCL]-like), and a (diffuse) B-cell-rich pattern. Small germinal centers within neoplastic nodules were found in approximately 15% of cases, a finding not previously emphasized in NLPHL. Prominent sclerosis was identified in approximately 20% of cases and was frequently seen in recurrent disease. Clinical follow-up was obtained on 56 patients, including 26 patients who had not had recurrence of disease and 30 patients who had recurrence. The follow-up period was 5 months to 16 years (median 2.5 years). The presence of a diffuse (TCRBCL-like) pattern was significantly more common in patients with recurrent disease than those without recurrence. Furthermore, the presence of a diffuse pattern (TCRBCL-like) was shown to be an independent predictor of recurrent disease (P = 0.00324). In addition, there is a tendency for progression to an increasingly more diffuse pattern over time. Analysis of sequential biopsies from patients with recurrent disease suggests that the presence of prominent extranodular L&H cells might represent early evolution to a diffuse (TCRBCL-like) pattern. We also report three patients who presented initially with diffuse large B-cell lymphoma and later developed NLPHL.

    View details for Web of Science ID 000185584800007

    View details for PubMedID 14508396

  • Statistical significance for genomewide studies PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Storey, J. D., Tibshirani, R. 2003; 100 (16): 9440-9445

    Abstract

    With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become commonplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily interpreted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage.

    View details for DOI 10.1073/pnas.1530509100

    View details for Web of Science ID 000184620000062

    View details for PubMedID 12883005

  • Repeated observation of breast tumor subtypes in independent gene expression data sets PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lonning, P. E., Brown, P. O., Borresen-Dale, A. L., Botstein, D. 2003; 100 (14): 8418-8423

    Abstract

    Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.

    View details for DOI 10.1073/pnas.0932692100

    View details for Web of Science ID 000184222500069

    View details for PubMedID 12829800

    View details for PubMedCentralID PMC166244

  • Note on "Comparison of model selection for regression" by Vladimir Cherkassky and Yunqian Ma NEURAL COMPUTATION Hastie, T., Tibshirani, R., Friedman, J. 2003; 15 (7): 1477-1480

    Abstract

    While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation--errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212).

    View details for Web of Science ID 000183421400002

    View details for PubMedID 12816562

  • Class prediction by nearest shrunken centroids, with applications to DNA microarrays STATISTICAL SCIENCE Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2003; 18 (1): 104-117
  • HGAL is a novel interleukin-4-inducible gene that strongly predicts survival in diffuse large B-cell lymphoma BLOOD Lossos, I. S., Alizadeh, A. A., Rajapaksa, R., Tibshirani, R., Levy, R. 2003; 101 (2): 433-440

    Abstract

    We have cloned and characterized a novel human gene, HGAL (human germinal center-associated lymphoma), which predicts outcome in patients with diffuse large B-cell lymphoma (DLBCL). The HGAL gene comprises 6 exons and encodes a cytoplasmic protein of 178 amino acids that contains an immunoreceptor tyrosine-based activation motif (ITAM). It is highly expressed in germinal center (GC) lymphocytes and GC-derived lymphomas and is homologous to the mouse GC-specific gene M17. Expression of the HGAL gene is specifically induced in B cells by interleukin-4 (IL-4). Patients with DLBCL expressing high levels of HGAL mRNA demonstrate significantly longer overall survival than do patients with low HGAL expression. This association was independent of the clinical international prognostic index. High HGAL mRNA expression should be used as a prognostic factor in DLBCL.

    View details for Web of Science ID 000180384800010

    View details for PubMedID 12509382

  • Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods in molecular biology (Clifton, N.J.) Storey, J. D., Tibshirani, R. 2003; 224: 149-157

    View details for PubMedID 12710672

  • Expression of cytokeratins 17 and 5 identifies a group of breast carcinomas with poor clinical outcome AMERICAN JOURNAL OF PATHOLOGY van de Rijn, M., Perou, C. M., Tibshirani, R., Haas, P., Kallioniemi, C., Kononen, J., Torhorst, J., Sauter, G., Zuber, M., Kochli, O. R., Mross, F., Dieterich, H., Seitz, R., Ross, D., Botstein, D., BROWN, P. 2002; 161 (6): 1991-1996

    Abstract

    While several prognostic factors have been identified in breast carcinoma, the clinical outcome remains hard to predict for individual patients. Better predictive markers are needed to help guide difficult treatment decisions. In a previous study of 78 breast carcinoma specimens, we noted an association between poor clinical outcome and the expression of cytokeratin 17 and/or cytokeratin 5 mRNAs. Here we describe the results of immunohistochemistry studies using monoclonal antibodies against these markers to analyze more than 600 paraffin-embedded breast tumors in tissue microarrays. We found that expression of cytokeratin 17 and/or cytokeratin 5/6 in tumor cells was associated with a poor clinical outcome. Moreover, multivariate analysis showed that in node-negative breast carcinoma, expression of these cytokeratins was a prognostic factor independent of tumor size and tumor grade.

    View details for PubMedID 12466114

  • Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Pollack, J. R., Sorlie, T., Perou, C. M., Rees, C. A., Jeffrey, S. S., Lonning, P. E., Tibshirani, R., Botstein, D., Borresen-Dale, A. L., Brown, P. O. 2002; 99 (20): 12963-12968

    Abstract

    Genomic DNA copy number alterations are key genetic events in the development and progression of human cancers. Here we report a genome-wide microarray comparative genomic hybridization (array CGH) analysis of DNA copy number variation in a series of primary human breast tumors. We have profiled DNA copy number alteration across 6,691 mapped human genes, in 44 predominantly advanced, primary breast tumors and 10 breast cancer cell lines. While the overall patterns of DNA amplification and deletion corroborate previous cytogenetic studies, the high-resolution (gene-by-gene) mapping of amplicon boundaries and the quantitative analysis of amplicon shape provide significant improvement in the localization of candidate oncogenes. Parallel microarray measurements of mRNA levels reveal the remarkable degree to which variation in gene copy number contributes to variation in gene expression in tumor cells. Specifically, we find that 62% of highly amplified genes show moderately or highly elevated expression, that DNA copy number influences gene expression across a wide range of DNA copy number alterations (deletion, low-, mid- and high-level amplification), that on average, a 2-fold change in DNA copy number is associated with a corresponding 1.5-fold change in mRNA levels, and that overall, at least 12% of all the variation in gene expression among the breast tumors is directly attributable to underlying variation in gene copy number. These findings provide evidence that widespread DNA copy number alteration can lead directly to global deregulation of gene expression, which may contribute to the development or progression of cancer.

    View details for DOI 10.1073/pnas.162471999

    View details for Web of Science ID 000178391700085

    View details for PubMedID 12297621

    View details for PubMedCentralID PMC130569

  • Empirical Bayes methods and false discovery rates for microarrays GENETIC EPIDEMIOLOGY Efron, B., Tibshirani, R. 2002; 23 (1): 70-86

    Abstract

    In a classic two-sample problem, one might use Wilcoxon's statistic to test for a difference between treatment and control subjects. The analogous microarray experiment yields thousands of Wilcoxon statistics, one for each gene on the array, and confronts the statistician with a difficult simultaneous inference situation. We will discuss two inferential approaches to this problem: an empirical Bayes method that requires very little a priori Bayesian modeling, and the frequentist method of "false discovery rates" proposed by Benjamini and Hochberg in 1995. It turns out that the two methods are closely related and can be used together to produce sensible simultaneous inferences.

    View details for DOI 10.1002/gepi.01124

    View details for Web of Science ID 000176697800006

    View details for PubMedID 12112249

  • Diagnosis of multiple cancer types by shrunken centroids of gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2002; 99 (10): 6567-6572

    Abstract

    We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

    View details for Web of Science ID 000175637300012

    View details for PubMedID 12011421

  • Precision and functional specificity in mRNA decay PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Wang, Y. L., Liu, C. L., Storey, J. D., Tibshirani, R. J., Herschlag, D., Brown, P. O. 2002; 99 (9): 5860-5865

    Abstract

    Posttranscriptional processing of mRNA is an integral component of the gene expression program. By using DNA microarrays, we precisely measured the decay of each yeast mRNA, after thermal inactivation of a temperature-sensitive RNA polymerase II. The half-lives varied widely, ranging from approximately 3 min to more than 90 min. We found no simple correlation between mRNA half-lives and ORF size, codon bias, ribosome density, or abundance. However, the decay rates of mRNAs encoding groups of proteins that act together in stoichiometric complexes were generally closely matched, and other evidence pointed to a more general relationship between physiological function and mRNA turnover rates. The results provide strong evidence that precise control of the decay of each mRNA is a fundamental feature of the gene expression program in yeast.

    View details for DOI 10.1073/pnas.092538799

    View details for Web of Science ID 000175377800023

    View details for PubMedID 11972065

    View details for PubMedCentralID PMC122867

  • Pre-validation and inference in microarrays. Statistical applications in genetics and molecular biology Tibshirani, R. J., Efron, B. 2002; 1: Article1-?

    Abstract

    In microarray studies, an important problem is to compare a predictor of disease outcome derived from gene expression levels to standard clinical predictors. Comparing them on the same dataset that was used to derive the microarray predictor can lead to results strongly biased in favor of the microarray predictor. We propose a new technique called "pre-validation'' for making a fairer comparison between the two sets of predictors. We study the method analytically and explore its application in a recent study on breast cancer.

    View details for PubMedID 16646777

  • Supervised learning from microarray data 15th Biannual Conference on Computational Statistics (COMPSTAT) Hastie, T., Tibshirani, R., Narasimhan, B., Chu, G. PHYSICA-VERLAG GMBH & CO. 2002: 67–77
  • Exploratory screening of genes and clusters from microarray experiments STATISTICA SINICA Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Botstein, D. 2002; 12 (1): 47-59
  • Transcriptional programs activated by exposure of human prostate cancer cells to androgen GENOME BIOLOGY DePrimo, S. E., Diehn, M., Nelson, J. B., Reiter, R. E., Matese, J., Fero, M., Tibshirani, R., Brown, P. O., Brooks, J. D. 2002; 3 (7)

    Abstract

    Androgens are required for both normal prostate development and prostate carcinogenesis. We used DNA microarrays, representing approximately 18,000 genes, to examine the temporal program of gene expression following treatment of the human prostate cancer cell line LNCaP with a synthetic androgen.We observed statistically significant changes in levels of transcripts of more than 500 genes. Many of these genes were previously reported androgen targets, but most were not previously known to be regulated by androgens. The androgen-induced expression programs in three additional androgen-responsive human prostate cancer cell lines, and in four androgen-independent subclones derived from LNCaP, shared many features with those observed in LNCaP, but some differences were observed. A remarkable fraction of the genes induced by androgen appeared to be related to production of seminal fluid and these genes included many with roles in protein folding, trafficking, and secretion.Prostate cancer cell lines retain features of androgen responsiveness that reflect normal prostatic physiology. These results provide a broad view of the effect of androgen signaling on the transcriptional program in these cancer cells, and a foundation for further studies of androgen action.

    View details for Web of Science ID 000207581200008

    View details for PubMedID 12184806

  • Empirical Bayes analysis of a microarray experiment 160th Annual Meeting of the American-Statistical-Association Efron, B., Tibshirani, R., Storey, J. D., Tusher, V. AMER STATISTICAL ASSOC. 2001: 1151–60
  • Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lonning, P. E., Borresen-Dale, A. L. 2001; 98 (19): 10869-10874

    Abstract

    The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.

    View details for Web of Science ID 000170966800067

    View details for PubMedID 11553815

    View details for PubMedCentralID PMC58566

  • Expression of a single gene, BCL-6, strongly predicts survival in patients with diffuse large B-cell lymphoma BLOOD Lossos, I. S., Jones, C. D., Warnke, R., Natkunam, Y., Kaizer, H., Zehnder, J. L., Tibshirani, R., Levy, R. 2001; 98 (4): 945-951

    Abstract

    Diffuse large B-cell lymphoma (DLBCL) is characterized by a marked degree of morphologic and clinical heterogeneity. Establishment of parameters that can predict outcome could help to identify patients who may benefit from risk-adjusted therapies. BCL-6 is a proto-oncogene commonly implicated in DLBCL pathogenesis. A real-time reverse transcription-polymerase chain reaction assay was established for accurate and reproducible determination of BCL-6 mRNA expression. The method was applied to evaluate the prognostic significance of BCL-6 expression in DLBCL. BCL-6 mRNA expression was assessed in tumor specimens obtained at the time of diagnosis from 22 patients with primary DLBCL. All patients were subsequently treated with anthracycline-based chemotherapy regimens. These patients could be divided into 2 DLBCL subgroups, one with high BCL-6 gene expression whose median overall survival (OS) time was 171 months and the other with low BCL-6 gene expression whose median OS was 24 months (P =.007). BCL-6 gene expression also predicted OS in an independent validation set of 39 patients with primary DLBCL (P =.01). BCL-6 protein expression, assessed by immunohistochemistry, also predicted longer OS in patients with DLBCL. BCL-6 gene expression was an independent survival predicting factor in multivariate analysis together with the elements of the International Prognostic Index (IPI) (P =.038). By contrast, the aggregate IPI score did not add further prognostic information to the patients' stratification by BCL-6 gene expression. High BCL-6 mRNA expression should be considered a new favorable prognostic factor in DLBCL and should be used in the stratification and the design of risk-adjusted therapies for patients with DLBCL. (Blood. 2001;98:945-951)

    View details for Web of Science ID 000170364100008

    View details for PubMedID 11493437

  • Missing value estimation methods for DNA microarrays BIOINFORMATICS Troyanskaya, O., Cantor, M., Sherlock, G., BROWN, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. B. 2001; 17 (6): 520-525

    Abstract

    Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

    View details for Web of Science ID 000169404700005

    View details for PubMedID 11395428

  • Significance analysis of microarrays applied to the ionizing radiation response PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Tusher, V. G., Tibshirani, R., Chu, G. 2001; 98 (9): 5116-5121

    Abstract

    Microarrays can measure the expression of thousands of genes to identify changes in expression between different biological states. Methods are needed to determine the significance of these changes while accounting for the enormous number of genes. We describe a method, Significance Analysis of Microarrays (SAM), that assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR). When the transcriptional response of human cells to ionizing radiation was measured by microarrays, SAM identified 34 genes that changed at least 1.5-fold with an estimated FDR of 12%, compared with FDRs of 60 and 84% by using conventional methods of analysis. Of the 34 genes, 19 were involved in cell cycle regulation and 3 in apoptosis. Surprisingly, four nucleotide excision repair genes were induced, suggesting that this repair pathway for UV-damaged DNA might play a previously unrecognized role in repairing DNA damaged by ionizing radiation.

    View details for Web of Science ID 000168311500058

    View details for PubMedID 11309499

  • Supervised harvesting of expression trees GENOME BIOLOGY Hastie, T., Tibshirani, R., Botstein, D., Brown, P. 2001; 2 (1)

    Abstract

    We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

    View details for Web of Science ID 000207583500011

    View details for PubMedID 11178280

  • Estimating the number of clusters in a data set via the gap statistic JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Walther, G., Hastie, T. 2001; 63: 411-423
  • The inference of antigen selection on Ig genes JOURNAL OF IMMUNOLOGY Lossos, I. S., Tibshirani, R., Narasimhan, B., Levy, R. 2000; 165 (9): 5122-5126

    Abstract

    Analysis of somatic mutations in V regions of Ig genes is important for understanding various biological processes. It is customary to estimate Ag selection on Ig genes by assessment of replacement (R) as opposed to silent (S) mutations in the complementary-determining regions and S as opposed to R mutations in the framework regions. In the past such an evaluation was performed using a binomial distribution model equation, which is inappropriate for Ig genes in which mutations have four different distribution possibilities (R and S mutations in the complementary-determining region and/or framework regions of the gene). In the present work, we propose a multinomial distribution model for assessment of Ag selection. Side-by-side application of multinomial and binomial models on 86 previously established Ig sequences disclosed 8 discrepancies, leading to opposite statistical conclusions about Ag selection. We suggest the use of the multinomial model for all future analysis of Ag selection.

    View details for Web of Science ID 000090076000047

    View details for PubMedID 11046043

  • Bayesian backfitting STATISTICAL SCIENCE Hastie, T., Tibshirani, R. 2000; 15 (3): 196-213
  • Bayesian backfitting - Comments and rejoinder STATISTICAL SCIENCE Cook, R. D., Pardoe, L., Gelfand, A. E., Green, P. J., Hastie, T., Tibshirani, R. 2000; 15 (3): 213-223
  • Additive logistic regression: A statistical view of boosting ANNALS OF STATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2000; 28 (2): 337-374
  • Molecular analysis of immunoglobulin genes in diffuse large B-cell lymphomas BLOOD Lossos, I. S., Okada, C. Y., Tibshirani, R., Warnke, R., Vose, J. M., Greiner, T. C., Levy, R. 2000; 95 (5): 1797-1803

    Abstract

    Diffuse large B-cell lymphoma (DLBCL) is a common type of non-Hodgkin's lymphoma (NHL) that is highly heterogeneous from both clinical and histopathologic viewpoints. The immunoglobulin (Ig) heavy (H) chain variable region genes were examined in 71 patients with untreated primary DLBCL. Fifty-eight potentially functional V(H) genes were detected in 53 DLBCL cases; V(H) genes were nonfunctional in 9 cases and were not detected in an additional 9 cases. The use of V(H) gene families by DLBCL tumors was unbiased without overrepresentation of any particular V(H) gene or gene family. Analysis of Ig mutations in comparison to the most closely related germline gene disclosed mutated V(H) genes in all but 1 DLBCL case. More than 2% difference from the most similar germline sequence was detected in 52 potentially functional and the 8 nonfunctional V(H) gene sequences, whereas less than 2% difference from the germline sequence was observed in 3 V(H) gene isolates. Only 3 V(H) gene isolates were unmutated. No correlation was found between V(H) gene use, mutation level, and International Prognostic Index (IPI) or survival. Six of 8 tested tumors showed evidence of ongoing somatic mutations. Evidence for positive or negative antigen selection pressure was observed in 65% of mutated DLBCL cases. Our findings indicate that the etiology and the driving forces for clonal expansion are heterogeneous, which may explain the well-known clinical and pathologic heterogeneity of DLBCL. (Blood. 2000;95:1797-1803)

    View details for Web of Science ID 000085564700037

    View details for PubMedID 10688840

  • Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling NATURE Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. G., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L. M., Marti, G. E., Moore, T., Hudson, J., Lu, L. S., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. 2000; 403 (6769): 503-511

    Abstract

    Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

    View details for Web of Science ID 000085227300039

    View details for PubMedID 10676951

  • 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome biology Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., BROWN, P. 2000; 1 (2): RESEARCH0003-?

    Abstract

    Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

    View details for PubMedID 11178228

  • Model search by bootstrap "bumping" JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Tibshirani, R., Knight, K. 1999; 8 (4): 671-686
  • Statistical measures for the computer-aided diagnosis of mammographic masses JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Hastie, T., Ikeda, D., Tibshirani, R. 1999; 8 (3): 531-543
  • The covariance inflation criterion for adaptive model selection JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Knight, K. 1999; 61: 529-546
  • The problem of regions ANNALS OF STATISTICS Efron, B., Tibshirani, R. 1998; 26 (5): 1687-1718
  • Classification by pairwise coupling ANNALS OF STATISTICS Hastie, T., Tibshirani, R. 1998; 26 (2): 451-471
  • Classification by pairwise coupling 11th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. MIT PRESS. 1998: 507–513
  • Improvements on cross-validation: The .632+ bootstrap method JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Efron, B., Tibshirani, R. 1997; 92 (438): 548-560
  • The lasso method for variable selection in the cox model STATISTICS IN MEDICINE Tibshirani, R. 1997; 16 (4): 385-395

    Abstract

    I propose a new method for variable selection and shrinkage in Cox's proportional hazards model. My proposal minimizes the log partial likelihood subject to the sum of the absolute values of the parameters being bounded by a constant. Because of the nature of this constraint, it shrinks coefficients and produces some coefficients that are exactly zero. As a result it reduces the estimation variance while providing an interpretable final model. The method is a variation of the 'lasso' proposal of Tibshirani, designed for the linear regression context. Simulations indicate that the lasso can be more accurate than stepwise selection in this setting.

    View details for Web of Science ID A1997WK01900006

    View details for PubMedID 9044528

  • Association between cellular phones and car collisions {\it New. Engl. J. Med} Tibshirani, R., Redelmeier, D. 1997
  • Using specially designed exponential families for density estimation ANNALS OF STATISTICS Efron, B., Tibshirani, R. 1996; 24 (6): 2431-2461
  • Discriminant adaptive nearest neighbor classification and regression 9th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. M I T PRESS. 1996: 409–415
  • Generalized additive models for medical research. Statistical methods in medical research Hastie, T., Tibshirani, R. 1995; 4 (3): 187-196

    Abstract

    This article reviews flexible statistical methods that are useful for characterizing the effect of potential prognostic factors on disease endpoints. Applications to survival models and binary outcome models are illustrated.

    View details for PubMedID 8548102

  • Flexible discriminant analysis {\it J. Amer. Statist. Assoc. } Tibshirani, R., Hastie, T., Buja, A. 1994
  • An Introduction to the Bootstrap Chapman and Hall, New York and London. Tibshirani. R., Efron, B. 1993
  • {\it Generalized additive models}, Chapman and Hall, London Tibshirani, R., Hastie, T. 1990