Administrative Appointments

  • Chair, Department of Statistics (2006 - 2009)

Honors & Awards

  • Fellow, Royal Statistical Society (1979)
  • Fellow, Institute of Mathematical Statistics (1996)
  • Elected Member, International Statistics Institute (1994)
  • Fellow, American Statistical Association (1998)
  • Craig Award, University of Iowa (1996)
  • Myrto Lefkopolou award, Harvard School of Public Health (1996)
  • Fellow, South African Statistical Association (2011)
  • The Emmanuel and Carol Parzen prize for Statistical Innovation, Texas A&M University. (2014)
  • Bernard G. Greenberg distinguished lecturer, Department of Biostatistics, University of North Carolina (2013)

Professional Education

  • Ph.D., Stanford University, Statistics (1984)
  • M.Sc, University of Cape Town, Statistics (1979)
  • B.Sc (hons), Rhodes University, Statistics (1976)

Current Research and Scholarly Interests

Trevor Hastie specializes in applied nonparametric regression and
classification, and he has written three books in this area:
"Generalized Additive Models" (with R. Tibshirani, Chapman and Hall,
1991), and "Elements of Statistical Learning (second edition)"
(with R. Tibshirani and J. Friedman, Springer 2009), and
"An Introduction to Statistical Learning" (with G. James, D. Witten and
R. Tibshirani, Springer 2013). He has also made contributions in
statistical computing, co-editing (with J. Chambers) a large software
library on modeling tools in the S language used in R and Splus
("Statistical Models in S", Wadsworth, 1992). His current research
focuses on applied problems in biology and genomics, medicine and
industry, in particular data mining, prediction and classification

2016-17 Courses

Graduate and Fellowship Programs

  • Biomedical Informatics (Phd Program)

All Publications

  • Learning Interactions via Hierarchical Group-Lasso Regularization JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Lim, M., Hastie, T. 2015; 24 (3): 627-654
  • Bias correction in species distribution models: pooling survey and collection data for multiple species METHODS IN ECOLOGY AND EVOLUTION Fithian, W., Elith, J., Hastie, T., Keith, D. A. 2015; 6 (4): 424-438
  • Learning the Structure of Mixed Graphical Models JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Lee, J. D., Hastie, T. J. 2015; 24 (1): 230-253
  • Elements of Statistical Learning: datamining, inference, and prediction (second edition) Springer T. Hastie, R. Tibshirani, J. Friedman 2009
  • PATHWISE COORDINATE OPTIMIZATION ANNALS OF APPLIED STATISTICS Friedman, J., Hastie, T., Hoefling, H., Tibshirani, R. 2007; 1 (2): 302-332

    View details for DOI 10.1214/07-AOAS131

    View details for Web of Science ID 000261057600003

  • Regularization and variable selection via the elastic net JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Zou, H., Hastie, T. 2005; 67: 301-320
  • The entire regularization path for the support vector machine JOURNAL OF MACHINE LEARNING RESEARCH Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. 2004; 5: 1391-1415
  • Boosting as a regularized path to a maximum margin classifier JOURNAL OF MACHINE LEARNING RESEARCH Rosset, S., Ji, Z., Hastie, T. 2004; 5: 941-973
  • Least angle regression ANNALS OF STATISTICS Efron, B., Hastie, T., Johnstone, I., Tibshirani, R. 2004; 32 (2): 407-451
  • Bayesian backfitting - Comments and rejoinder STATISTICAL SCIENCE Cook, R. D., Pardoe, L., Gelfand, A. E., Green, P. J., Hastie, T., Tibshirani, R. 2000; 15 (3): 213-223
  • Additive logistic regression: A statistical view of boosting ANNALS OF STATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2000; 28 (2): 337-374
  • Discriminant analysis by Gaussian mixtures JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL Hastie, T., Tibshirani, R. 1996; 58 (1): 155-176
  • Generalized Additive Models -- Hastie TJ, Chapman, Hall (with R. Tibshirani) 1990
  • ZeitZeiger: supervised learning for high-dimensional data from an oscillatory system NUCLEIC ACIDS RESEARCH Hughey, J. J., Hastie, T., Butte, A. J. 2016; 44 (8)


    Numerous biological systems oscillate over time or space. Despite these oscillators' importance, data from an oscillatory system is problematic for existing methods of regularized supervised learning. We present ZeitZeiger, a method to predict a periodic variable (e.g. time of day) from a high-dimensional observation. ZeitZeiger learns a sparse representation of the variation associated with the periodic variable in the training observations, then uses maximum-likelihood to make a prediction for a test observation. We applied ZeitZeiger to a comprehensive dataset of genome-wide gene expression from the mammalian circadian oscillator. Using the expression of 13 genes, ZeitZeiger predicted circadian time (internal time of day) in each of 12 mouse organs to within ∼1 h, resulting in a multi-organ predictor of circadian time. Compared to the state-of-the-art approach, ZeitZeiger was faster, more accurate and used fewer genes. We then validated the multi-organ predictor on 20 additional datasets comprising nearly 800 samples. Our results suggest that ZeitZeiger not only makes accurate predictions, but also gives insight into the behavior and structure of the oscillator from which the data originated. As our ability to collect high-dimensional data from various biological oscillators increases, ZeitZeiger should enhance efforts to convert these data to knowledge.

    View details for DOI 10.1093/nar/gkw030

    View details for Web of Science ID 000376389000011

    View details for PubMedID 26819407

  • Effect of long-term antibiotic use on weight in adolescents with acne. journal of antimicrobial chemotherapy Contopoulos-Ioannidis, D. G., Ley, C., Wang, W., Ma, T., Olson, C., Shi, X., Luft, H. S., Hastie, T., Parsonnet, J. 2016; 71 (4): 1098-1105


    Antibiotics increase weight in farm animals and may cause weight gain in humans. We used electronic health records from a large primary care organization to determine the effect of antibiotics on weight and BMI in healthy adolescents with acne.We performed a retrospective cohort study of adolescents with acne prescribed ≥4 weeks of oral antibiotics with weight measurements within 18 months pre-antibiotics and 12 months post-antibiotics. We compared within-individual changes in weight-for-age Z-scores (WAZs) and BMI-for-age Z-scores (BMIZs). We used: (i) paired t-tests to analyse changes between the last pre-antibiotics versus the first post-antibiotic measurements; (ii) piecewise-constant-mixed models to capture changes between mean measurements pre- versus post-antibiotics; (iii) piecewise-linear-mixed models to capture changes in trajectory slopes pre- versus post-antibiotics; and (iv) χ(2) tests to compare proportions of adolescents with ≥0.2 Z-scores WAZ or BMIZ increase or decrease.Our cohort included 1012 adolescents with WAZs; 542 also had BMIZs. WAZs decreased post-antibiotics in all analyses [change between last WAZ pre-antibiotics versus first WAZ post-antibiotics = -0.041 Z-scores (P < 0.001); change between mean WAZ pre- versus post-antibiotics = -0.050 Z-scores (P < 0.001); change in WAZ trajectory slopes pre- versus post-antibiotics = -0.025 Z-scores/6 months (P = 0.002)]. More adolescents had a WAZ decrease post-antibiotics ≥0.2 Z-scores than an increase (26% versus 18%; P < 0.001). Trends were similar, though not statistically significant, for BMIZ changes.Contrary to original expectations, long-term antibiotic use in healthy adolescents with acne was not associated with weight gain. This finding, which was consistent across all analyses, does not support a weight-promoting effect of antibiotics in adolescents.

    View details for DOI 10.1093/jac/dkv455

    View details for PubMedID 26782773


    View details for DOI 10.1214/15-AOAS866

    View details for Web of Science ID 000370445600001

  • Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares JOURNAL OF MACHINE LEARNING RESEARCH Hastie, T., Mazumder, R., Lee, J. D., Zadeh, R. 2015; 16: 3367-3402
  • Clinically Relevant Molecular Subtypes in Leiomyosarcoma. Clinical cancer research Guo, X., Jo, V. Y., Mills, A. M., Zhu, S. X., Lee, C., Espinosa, I., Nucci, M. R., Varma, S., Forgó, E., Hastie, T., Anderson, S., Ganjoo, K., Beck, A. H., West, R. B., Fletcher, C. D., van de Rijn, M. 2015; 21 (15): 3501-3511


    Leiomyosarcoma is a malignant neoplasm with smooth muscle differentiation. Little is known about its molecular heterogeneity and no targeted therapy currently exists for leiomyosarcoma. Recognition of different molecular subtypes is necessary to evaluate novel therapeutic options. In a previous study on 51 leiomyosarcomas, we identified three molecular subtypes in leiomyosarcoma. The current study was performed to determine whether the existence of these subtypes could be confirmed in independent cohorts.Ninety-nine cases of leiomyosarcoma were expression profiled with 3'end RNA-Sequencing (3SEQ). Consensus clustering was conducted to determine the optimal number of subtypes.We identified 3 leiomyosarcoma molecular subtypes and confirmed this finding by analyzing publically available data on 82 leiomyosarcoma from The Cancer Genome Atlas (TCGA). We identified two new formalin-fixed, paraffin-embedded tissue-compatible diagnostic immunohistochemical markers; LMOD1 for subtype I leiomyosarcoma and ARL4C for subtype II leiomyosarcoma. A leiomyosarcoma tissue microarray with known clinical outcome was used to show that subtype I leiomyosarcoma is associated with good outcome in extrauterine leiomyosarcoma while subtype II leiomyosarcoma is associated with poor prognosis in both uterine and extrauterine leiomyosarcoma. The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiomyosarcoma subtypes may respond differentially to these targeted therapies.We confirm the existence of 3 molecular subtypes in leiomyosarcoma using two independent datasets and show that the different molecular subtypes are associated with distinct clinical outcomes. The findings offer an opportunity for treating leiomyosarcoma in a subtype-specific targeted approach. Clin Cancer Res; 21(15); 3501-11. ©2015 AACR.

    View details for DOI 10.1158/1078-0432.CCR-14-3141

    View details for PubMedID 25896974

  • Effective degrees of freedom: a flawed metaphor BIOMETRIKA Janson, L., Fithian, W., Hastie, T. J. 2015; 102 (2): 479-485
  • Point process models for presence-only analysis METHODS IN ECOLOGY AND EVOLUTION Renner, I. W., Elith, J., Baddeley, A., Fithian, W., Hastie, T., Phillips, S. J., Popovic, G., Warton, D. I. 2015; 6 (4): 366-379
  • CATS regression - a model-based approach to studying trait-based community assembly METHODS IN ECOLOGY AND EVOLUTION Warton, D. I., Shipley, B., Hastie, T. 2015; 6 (4): 389-398
  • Bias correction in species distribution models: pooling survey and collection data for multiple species METHODS IN ECOLOGY AND EVOLUTION Fithian, W., Elith, J., Hastie, T., Keith, D. A. 2015

    View details for DOI 10.1111/2041-210X.12242


    View details for DOI 10.1214/14-AOS1220

    View details for Web of Science ID 000344632400001

  • Assessing the significance of global and local correlations under spatial autocorrelation: A nonparametric approach BIOMETRICS Viladomat, J., Mazumder, R., McInturff, A., McCauley, D. J., Hastie, T. 2014; 70 (2): 409-418


    We propose a method to test the correlation of two random fields when they are both spatially autocorrelated. In this scenario, the assumption of independence for the pair of observations in the standard test does not hold, and as a result we reject in many cases where there is no effect (the precision of the null distribution is overestimated). Our method recovers the null distribution taking into account the autocorrelation. It uses Monte-Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial autocorrelation. With this simulation model, any test based on the independence of two (or more) random fields can be constructed. This research was motivated by a project in biodiversity and conservation in the Biology Department at Stanford University.

    View details for DOI 10.1111/biom.12139

    View details for Web of Science ID 000337621000016

    View details for PubMedID 24571609

  • Boosted Varying-Coefficient Regression Models for Product Demand Prediction JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Wang, J. C., Hastie, T. 2014; 23 (2): 361-382
  • Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife JOURNAL OF MACHINE LEARNING RESEARCH Wager, S., Hastie, T., Efron, B. 2014; 15: 1625-1651
  • Bias correction in species distribution models: pooling survey and collection data for multiple species METHODS IN ECOLOGY AND EVOLUTION Fithian, W., Elith, J., Hastie, T., Keith, D. 2014; 6 (4): pages 424–438

    View details for DOI 10.1111/2041-210X.12242

  • Learning the Structure of Mixed Graphical Models Journal of Computational and Graphical Statistics Lee, J. D., Hastie, T. J. 2014; 24 (1): 230-253
  • CATS regression–a model‐based approach to studying trait‐based community assembly Methods & Statistics in Ecology: Methods in Ecology and Evolution Warton, D., Shipley, B., Hastie, T. 2014; 6 (4): 389-398

    View details for DOI 10.1111/2041-210X.12280

  • Learning interactions via hierarchical group-lasso regularization Journal of Computational and Graphical Statistics Lim, M., Hastie, T. 2014
  • Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares Technical Report, Statistics Department, Stanford University Hastie, T., Mazumder, R., Lee, J., Zadeh, R. 2014

    View details for DOI 10.1214/13-AOAS667

    View details for Web of Science ID 000330044900011

  • Inference from presence-only data; the ongoing controversy ECOGRAPHY Hastie, T., Fithian, W. 2013; 36 (8): 864-867
  • A Sparse-Group Lasso JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2013; 22 (2): 231-245
  • Boosted Varying-Coefficient Regression Models for Product Demand Prediction Journal of Computational and Graphical Statistics Wang, J., Hastie, T. 2013; 23 (2): 361-382
  • Effective degrees of freedom: a flawed metaphor Technical Report, Statistics Department, Stanford University Jansen, L., Fithian, W., Hastie, T. 2013
  • A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. Technical Report, Statistics Department, Stanford University Simon, N., Friedman, J., Hastie, T. 2013
  • Compressive Feature Learning Curran Associates, Inc., Paskov, H. S., West, R., Mitchell, J. C., Hastie, T. 2013: 2931–39
  • Structure Learning of Mixed Grpahical Models Proceedings of the 16th International Conference on Artificial Intelligence and Statistics Lee, J., Hastie, T. 2013: 388–396
  • An Introduction to Statistical Learning with Applications in R James, G., Witten, D., Hastie, T., Tibshirani, R. Springer Texts in Statistics. 2013
  • Coronary risk assessment among intermediate risk patients using a clinical and biomarker based algorithm developed and validated in two population cohorts CURRENT MEDICAL RESEARCH AND OPINION Cross, D. S., McCarty, C. A., Hytopoulos, E., Beggs, M., Nolan, N., Harrington, D. S., Hastie, T., Tibshirani, R., Tracy, R. P., Psaty, B. M., McClelland, R., Tsao, P. S., Quertermous, T. 2012; 28 (11): 1819-1830


    Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project (PMRP) individuals. A multivariable Cox regression model was fit using the most powerful risk predictors within the clinical and protein variables identified by repeated cross-validation. The resulting CHDRA algorithm was validated in a Multiple-Ethnic Study of Atherosclerosis (MESA) case-cohort sample.A CHDRA algorithm of age, sex, diabetes, and family history of MI, combined with serum levels of seven biomarkers (CTACK, Eotaxin, Fas Ligand, HGF, IL-16, MCP-3, and sFas) yielded a clinical net reclassification index of 42.7% (p < 0.001) for MESA patients with a recalibrated Framingham 5-year intermediate risk level. Across all patients, the model predicted acute coronary events (hazard ratio = 2.17, p < 0.001), and remained an independent predictor after Framingham risk factor adjustments.These include the slightly different event definition with the MESA samples and inability to include PMRP fatal CHD events.A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients. Improved accuracy in cardiovascular risk classification could lead to improved preventive care and fewer deaths.

    View details for DOI 10.1185/03007995.2012.742878

    View details for Web of Science ID 000310985600009

    View details for PubMedID 23092312

  • No increased mortality with early aortic aneurysm disease 26th Annual Meeting of the Western-Vascular-Society Mell, M., White, J. J., Hill, B. B., Hastie, T., Dalman, R. L. MOSBY-ELSEVIER. 2012: 1246–51


    In addition to increased risks for aneurysm-related death, previous studies have determined that all-cause mortality in abdominal aortic aneurysm (AAA) patients is excessive and equivalent to that associated with coronary heart disease. These studies largely preceded the current era of coronary heart disease risk factor management, however, and no recent study has examined contemporary mortality associated with early AAA disease (aneurysm diameter between 3 and 5 cm). As part of an ongoing natural history study of AAA, we report the mortality risk associated with presence of early disease.Participants were recruited from three distinct health care systems in Northern California between 2006 and 2011. Aneurysm diameter, demographic information, comorbidities, medication history, and plasma for biomarker analysis were collected at study entry. Survival status was determined at follow-up. Data were analyzed with t-tests or χ(2) tests where appropriate. Freedom from death was calculated via Cox proportional hazards modeling; the relevance of individual predictors on mortality was determined by log-rank test.The study enrolled 634 AAA patients; age 76.4 ± 8.0 years, aortic diameter 3.86 ± 0.7 cm. Participants were mostly male (88.8%), not current smokers (81.6%), and taking statins (76.7%). Mean follow-up was 2.1 ± 1.0 years. Estimated 1- and 3-year survival was 98.2% and 90.9%, respectively. Factors independently associated with mortality included larger aneurysm size (hazard ratio, 2.12; 95% confidence interval, 1.26-3.57 for diameter >4.0 cm) and diabetes (hazard ratio, 2.24; 95% confidence interval, 1.12-4.47). After adjusting for patient-level factors, health care system independently predicted mortality.Contemporary all-cause mortality for patients with early AAA disease is lower than that previously reported. Further research is warranted to determine important factors that contribute to improved survival in early AAA disease.

    View details for DOI 10.1016/j.jvs.2012.04.023

    View details for Web of Science ID 000310428200007

    View details for PubMedID 22832264

  • Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso JOURNAL OF MACHINE LEARNING RESEARCH Mazumder, R., Hastie, T. 2012; 13: 781-794
  • Strong rules for discarding predictors in lasso-type problems JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R. J. 2012; 74: 245-266
  • The graphical lasso: New insights and alternatives ELECTRONIC JOURNAL OF STATISTICS Mazumder, R., Hastie, T. 2012; 6: 2125-2149

    View details for DOI 10.1214/12-EJS740

    View details for Web of Science ID 000321016800001

  • Improved coronary risk assessment among intermediate risk patients using a clincial and biomarker based algorithm developed and validated in two popluation cohorts Current Medical Research and Opinion Cross, D. S., McCarty, C. A., Hytopoulos, E., Beggs, M., Nolan, N., Harrington, D. S., Hastie, T., Tibshirani, R., Tracy, R. P., Psaty, B. M., McClelland, R., Tsao, P. S., Quertermous, T. 2012
  • Sparse Discriminant Analysis TECHNOMETRICS Clemmensen, L., Hastie, T., Witten, D., Ersboll, B. 2011; 53 (4): 406-413
  • A fused lasso latent feature model for analyzing multi-sample aCGH data BIOSTATISTICS Nowak, G., Hastie, T., Pollack, J. R., Tibshirani, R. 2011; 12 (4): 776-791


    Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.

    View details for DOI 10.1093/biostatistics/kxr012

    View details for Web of Science ID 000294806800014

    View details for PubMedID 21642389

  • SparseNet: Coordinate Descent With Nonconvex Penalties JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Mazumder, R., Friedman, J. H., Hastie, T. 2011; 106 (495): 1125-1138
  • Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Simon, N., Friedman, J., Hastie, T., Tibshirani, R. 2011; 39 (5): 1-13
  • A statistical explanation of MaxEnt for ecologists DIVERSITY AND DISTRIBUTIONS Elith, J., Phillips, S. J., Hastie, T., Dudik, M., Chee, Y. E., Yates, C. J. 2011; 17 (1): 43-57
  • Spectral Regularization Algorithms for Learning Large Incomplete Matrices JOURNAL OF MACHINE LEARNING RESEARCH Mazumder, R., Hastie, T., Tibshirani, R. 2010; 11: 2287-2322


    We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.

    View details for Web of Science ID 000282523300010

  • Dynamic visualization of statistical learning in the context of high-dimensional textual data JOURNAL OF WEB SEMANTICS Greenacre, M., Hastie, T. 2010; 8 (2-3): 163-168
  • Likelihood-Based Sufficient Dimension Reduction JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Zhu, M., Hastie, T. J. 2010; 105 (490): 880-880
  • Cell type-specific gene expression differences in complex tissues NATURE METHODS Shen-Orr, S. S., Tibshirani, R., Khatri, P., Bodian, D. L., Staedtler, F., Perry, N. M., Hastie, T., Sarwal, M. M., Davis, M. M., Butte, A. J. 2010; 7 (4): 287-289


    We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.

    View details for DOI 10.1038/NMETH.1439

    View details for Web of Science ID 000276150600017

    View details for PubMedID 20208531

  • Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling ONCOGENE Beck, A. H., Lee, C., WITTEN, D. M., Gleason, B. C., Edris, B., Espinosa, I., Zhu, S., Li, R., Montgomery, K. D., Marinelli, R. J., Tibshirani, R., Hastie, T., Jablons, D. M., Rubin, B. P., Fletcher, C. D., West, R. B., van de Rijn, M. 2010; 29 (6): 845-854


    Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.

    View details for DOI 10.1038/onc.2009.381

    View details for Web of Science ID 000274397800007

    View details for PubMedID 19901961

    View details for PubMedCentralID PMC2820592

  • Regularization Paths for Generalized Linear Models via Coordinate Descent JOURNAL OF STATISTICAL SOFTWARE Friedman, J., Hastie, T., Tibshirani, R. 2010; 33 (1): 1-22


    We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.

    View details for Web of Science ID 000275203200001

  • Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets PLOS COMPUTATIONAL BIOLOGY Suthram, S., Dudley, J. T., Chiang, A. P., Chen, R., Hastie, T. J., Butte, A. J. 2010; 6 (2)


    Current work in elucidating relationships between diseases has largely been based on pre-existing knowledge of disease genes. Consequently, these studies are limited in their discovery of new and unknown disease relationships. We present the first quantitative framework to compare and contrast diseases by an integrated analysis of disease-related mRNA expression data and the human protein interaction network. We identified 4,620 functional modules in the human protein network and provided a quantitative metric to record their responses in 54 diseases leading to 138 significant similarities between diseases. Fourteen of the significant disease correlations also shared common drugs, supporting the hypothesis that similar diseases can be treated by the same drugs, allowing us to make predictions for new uses of existing drugs. Finally, we also identified 59 modules that were dysregulated in at least half of the diseases, representing a common disease-state "signature". These modules were significantly enriched for genes that are known to be drug targets. Interestingly, drugs known to target these genes/proteins are already known to treat significantly more diseases than drugs targeting other genes/proteins, highlighting the importance of these core modules as prime therapeutic opportunities.

    View details for DOI 10.1371/journal.pcbi.1000662

    View details for Web of Science ID 000275260000026

    View details for PubMedID 20140234

    View details for PubMedCentralID PMC2816673

  • A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis BIOSTATISTICS Witten, D. M., Tibshirani, R., Hastie, T. 2009; 10 (3): 515-534


    We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

    View details for DOI 10.1093/biostatistics/kxp008

    View details for Web of Science ID 000267213700010

    View details for PubMedID 19377034

  • Presence-Only Data and the EM Algorithm BIOMETRICS Ward, G., Hastie, T., Barry, S., Elith, J., Leathwick, J. R. 2009; 65 (2): 554-563


    In ecological modeling of the habitat of a species, it can be prohibitively expensive to determine species absence. Presence-only data consist of a sample of locations with observed presences and a separate group of locations sampled from the full landscape, with unknown presences. We propose an expectation-maximization algorithm to estimate the underlying presence-absence logistic model for presence-only data. This algorithm can be used with any off-the-shelf logistic model. For models with stepwise fitting procedures, such as boosted trees, the fitting process can be accelerated by interleaving expectation steps within the procedure. Preliminary analyses based on sampling from presence-absence records of fish in New Zealand rivers illustrate that this new procedure can reduce both deviance and the shrinkage of marginal effect estimates that occur in the naive model often used in practice. Finally, it is shown that the population prevalence of a species is only identifiable when there is some unrealistic constraint on the structure of the logistic model. In practice, it is strongly recommended that an estimate of population prevalence be provided.

    View details for DOI 10.1111/j.1541-0420.2008.01116.x

    View details for Web of Science ID 000266449900025

    View details for PubMedID 18759851

  • Genome-wide association analysis by lasso penalized logistic regression BIOINFORMATICS Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E., Lange, K. 2009; 25 (6): 714-721


    In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations.The present article evaluates the performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression.This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs.The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btp041

    View details for Web of Science ID 000264189600003

    View details for PubMedID 19176549

  • Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Beck, A. H., Lee, C. H., WITTEN, D. M., Zhou, S., Montgomery, K., Tibshirani, R., Hastie, T., West, R. B., van de Rijn, M. NATURE PUBLISHING GROUP. 2009: 368A–368A
  • Multi-class AdaBoost STATISTICS AND ITS INTERFACE Zhu, J., Zou, H., Rosset, S., Hastie, T. 2009; 2 (3): 349-360
  • The Elements of Statistical Learning: Prediction, Inference, and Data Mining Hastie, T., Tibshirani, R., Friedman, J. Springer Verlag. 2009
  • Multi-class AdaBoost STATISTICS AND ITS INTERFACE STATISTICS AND ITS INTERFACE Zhu, J., Zou, H., Rosset, S., Hastie, T. 2009; 2 (3): 349-360
  • Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling 98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Beck, A. H., Lee, C. H., WITTEN, D. M., Zhou, S., Montgomery, K., Tibshirani, R., Hastie, T., West, R. B., van de Rijn, M. NATURE PUBLISHING GROUP. 2009: 368A–368A

    View details for DOI 10.1214/08-AOAS198

    View details for Web of Science ID 000262731100009

  • New cutpoints to identify increased HER2 copy number: analysis of a large, population-based cohort with long-term follow-up BREAST CANCER RESEARCH AND TREATMENT Jensen, K. C., Turbin, D. A., Leung, S., Miller, M. A., Johnson, K., Norris, B., Hastie, T., McKinney, S., Nielsen, T. O., Huntsman, D. G., Gilks, C. B., West, R. B. 2008; 112 (3): 453-459


    HER2 gene amplification and/or protein overexpression in breast cancer is associated with a poor prognosis and predicts response to anti-HER2 therapy. We examine the natural history of breast cancers in relationship to increased HER2 copy numbers in a large population-based study.HER2 status was measured by fluorescence in situ hybridization (FISH) and immunohistochemistry (IHC) in approximately 1,400 breast cancer cases with greater than 15 years of follow-up. Protein expression was evaluated with two different commercially-available antibodies.We looked for subgroups of breast cancer with different clinical outcomes, based on HER2 FISH amplification ratio. The current HER2 ratio cut point for classifying HER2 positive and negative cases is 2.2. However, we found an increased risk of disease-specific death associated with FISH ratios of >1.5. An 'intermediate' group of cases with HER2 ratios between 1.5 and 2.2 was found to have a significantly better outcome than the conventional 'amplified' group (HER2 ratio >2.2) but a significantly worse outcome than groups with FISH ratios less than 1.5.Breast cancers with increased HER2 copy numbers (low level HER2 amplification), below the currently accepted positive threshold ratio of 2.2, showed a distinct, intermediate outcome when compared to HER2 unamplified tumors and tumors with HER2 ratios greater than 2.2. These findings suggest that a new cut point to determine HER2 positivity, at a ratio of 1.5 (well below the current recommended cut point of 2.2), should be evaluated.

    View details for DOI 10.1007/s10549-007-9887-y

    View details for Web of Science ID 000261951000007

    View details for PubMedID 18193353

  • Risk estimation of distant metastasis in node-negative, estrogen receptor-positive breast cancer patients using an RT-PCR based prognostic expression signature BMC CANCER Tutt, A., Wang, A., Rowland, C., Gillett, C., Lau, K., Chew, K., Dai, H., Kwok, S., Ryder, K., Shu, H., Springall, R., Cane, P., McCallie, B., Kam-Morgan, L., Anderson, S., Buerger, H., Gray, J., Bennington, J., Esserman, L., Hastie, T., Broder, S., Sninsky, J., Brandt, B., Waldman, F. 2008; 8


    Given the large number of genes purported to be prognostic for breast cancer, it would be optimal if the genes identified are not confounded by the continuously changing systemic therapies. The aim of this study was to discover and validate a breast cancer prognostic expression signature for distant metastasis in untreated, early stage, lymph node-negative (N-) estrogen receptor-positive (ER+) patients with extensive follow-up times.197 genes previously associated with metastasis and ER status were profiled from 142 untreated breast cancer subjects. A "metastasis score" (MS) representing fourteen differentially expressed genes was developed and evaluated for its association with distant-metastasis-free survival (DMFS). Categorical risk classification was established from the continuous MS and further evaluated on an independent set of 279 untreated subjects. A third set of 45 subjects was tested to determine the prognostic performance of the MS in tamoxifen-treated women.A 14-gene signature was found to be significantly associated (p < 0.05) with distant metastasis in a training set and subsequently in an independent validation set. In the validation set, the hazard ratios (HR) of the high risk compared to low risk groups were 4.02 (95% CI 1.91-8.44) for the endpoint of DMFS and 1.97 (95% CI 1.28 to 3.04) for overall survival after adjustment for age, tumor size and grade. The low and high MS risk groups had 10-year estimates (95% CI) of 96% (90-99%) and 72% (64-78%) respectively, for DMFS and 91% (84-95%) and 68% (61-75%), respectively for overall survival. Performance characteristics of the signature in the two sets were similar. Ki-67 labeling index (LI) was predictive for recurrent disease in the training set, but lost significance after adjustment for the expression signature. In a study of tamoxifen-treated patients, the HR for DMFS in high compared to low risk groups was 3.61 (95% CI 0.86-15.14).The 14-gene signature is significantly associated with risk of distant metastasis. The signature has a predominance of proliferation genes which have prognostic significance above that of Ki-67 LI and may aid in prioritizing future mechanistic studies and therapeutic interventions.

    View details for DOI 10.1186/1471-2407-8-339

    View details for Web of Science ID 000262700100001

    View details for PubMedID 19025599

  • Combining biological gene expression signatures in predicting outcome in breast cancer: An alternative to supervised classification EUROPEAN JOURNAL OF CANCER Nuyten, D. S., Hastie, T., Chi, J. A., Chang, H. Y., van de Vijver, M. J. 2008; 44 (15): 2319-2329


    Gene expression profiling has been extensively used to predict outcome in breast cancer patients. We have previously reported on biological hypothesis-driven analysis of gene expression profiling data and we wished to extend this approach through the combinations of various gene signatures to improve the prediction of outcome in breast cancer.We have used gene expression data (25.000 gene probes) from a previously published study of tumours from 295 early stage breast cancer patients from the Netherlands Cancer Institute using updated follow-up. Tumours were assigned to three prognostic groups using the previously reported Wound-response and hypoxia-response signatures, and the outcome in each of these subgroups was evaluated.We have assigned invasive breast carcinomas from 295 stages I and II breast cancer patients to three groups based on gene expression profiles subdivided by the wound-response signature (WS) and hypoxia-response signature (HS). These three groups are (1) quiescent WS/non-hypoxic HS; (2) activated WS/non-hypoxic HS or quiescent WS/hypoxic tumours and (3) activated WS/hypoxic HS. The overall survival at 15 years for patients with tumours in groups 1, 2 and 3 are 79%, 59% and 27%, respectively. In multivariate analysis, this signature is not only independent of clinical and pathological risk factors; it is also the strongest predictor of outcome. Compared to a previously identified 70-gene prognosis profile, obtained with supervised classification, the combination of signatures performs roughly equally well and might have additional value in the ER-negative subgroup. In the subgroup of lymph node positive patients, the combination signature outperforms the 70-gene signature in multivariate analysis. In addition, in multivariate analysis, the WS/HS combination is a stronger predictor of outcome compared to the recently reported invasiveness gene signature combined with the WS.A combination of biological gene expression signatures can be used to identify a powerful and independent predictor for outcome in breast cancer patients.

    View details for DOI 10.1016/j.ejca.2008.07.015

    View details for Web of Science ID 000261020800031

    View details for PubMedID 18715778

    View details for PubMedCentralID PMC3756930

  • "Preconditioning" for feature selection and regression in high-dimensional problems' ANNALS OF STATISTICS Paul, D., Bair, E., Hastie, T., Tibshirani, R. 2008; 36 (4): 1595-1618
  • Dispersal, disturbance and the contrasting biogeographies of New Zealand's diadromous and non-diadromous fish species JOURNAL OF BIOGEOGRAPHY Leathwick, J. R., Elith, J., Chadderton, W. L., Rowe, D., Hastie, T. 2008; 35 (8): 1481-1497
  • Sparse inverse covariance estimation with the graphical lasso BIOSTATISTICS Friedman, J., Hastie, T., Tibshirani, R. 2008; 9 (3): 432-441


    We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.

    View details for DOI 10.1093/biostatistics/kxm045

    View details for Web of Science ID 000256977000005

    View details for PubMedID 18079126

  • A working guide to boosted regression trees JOURNAL OF ANIMAL ECOLOGY Elith, J., Leathwick, J. R., Hastie, T. 2008; 77 (4): 802-813


    1. Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions. 2. This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model. Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion. 3. Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data. They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors. Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance. Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods. 4. The unique features of BRT raise a number of practical issues in model fitting. We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel (Anguilla australis Richardson), a native freshwater fish of New Zealand. We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data. We provide code and a tutorial to enable the wider use of BRT by ecologists.

    View details for DOI 10.1111/j.1365-2656.2008.01390.x

    View details for Web of Science ID 000256539800020

    View details for PubMedID 18397250

  • Novel methods for the design and evaluation of marine protected areas in offshore waters CONSERVATION LETTERS Leathwick, J., Moilanen, A., Francis, M., Elith, J., Taylor, P., Julian, K., Hastie, T., Duffy, C. 2008; 1 (2): 91-102
  • Radiation-induced gene expression in human subcutaneous fibroblasts is predictive of radiation-induced fibrosis RADIOTHERAPY AND ONCOLOGY Rodningen, A. K., Borresen-Dale, A., Alsner, J., Hastie, T., Overgaard, J. 2008; 86 (3): 314-320


    Breast cancer patients show a large variation in normal tissue reactions after ionizing radiation (IR) therapy. One of the most common long-term adverse effects of ionizing radiotherapy is radiation-induced fibrosis (RIF), and several attempts have been made over the last years to develop predictive assays for RIF. Our aim was to identify basal and radiation-induced transcriptional profiles in fibroblasts from breast cancer patients that might be related to the individual risk of RIF in these patients.Fibroblast cell lines from 31 individuals with variable risk of RIF (grouped into five classes from low to high risk) were irradiated with two different schemes: 1 x 3.5 Gy with RNA isolated 2 and 24h after irradiation, and a fractionated scheme with 3 x 3.5 Gy in intervals of 24h with RNA isolated 2h after the last dose. RNA was also isolated from non-treated fibroblasts. Transcriptional differences in basal and radiation-induced gene expression profiles were investigated using 15K cDNA microarrays, and results analyzed by both SAM and PAM.Sixty differentially expressed genes were identified by applying SAM on 10 patients with the highest risk of RIF and the four patients with the lowest risk of RIF after the fractionated scheme. The genes were associated with known functions in processes like apoptosis, extracellular matrix remodelling/cell adhesion, proliferation and ROS scavenging. A minimum set of 18 genes were identified that could differentiate high risk from low risk-patients after the fractionated scheme.The classifier of 18 genes may provide basis for a predictive assay for normal tissue reactions after radiotherapy, and provide new insight into the molecular mechanisms of RIF.

    View details for DOI 10.1016/j.radonc.2007.09.013

    View details for Web of Science ID 000255304300003

    View details for PubMedID 17963910

  • HER2 status in a large, population-based cohort: Analysis of distinct HER2 subgroups 97th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology Jensen, K. C., Turbin, D. A., Leung, S., Miller, M. A., Johnson, K., Norris, B., Hastie, T., McKinney, S., Nielsen, T. O., Huntsman, D. G., Gilks, C. B., West, R. B. NATURE PUBLISHING GROUP. 2008: 39A–39A
  • Penalized logistic regression for detecting gene interactions BIOSTATISTICS Park, M. Y., Hastie, T. 2008; 9 (1): 30-50


    We propose using a variant of logistic regression (LR) with (L)_(2)-regularization to fit gene-gene and gene-environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an (L)_(2)-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene-gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.

    View details for DOI 10.1093/biostatistics/kxm010

    View details for Web of Science ID 000251679400003

    View details for PubMedID 17429103

  • On the "degrees of freedom" of the lasso ANNALS OF STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2007; 35 (5): 2173-2192
  • Nonlinear estimators and tail bounds for dimension reduction in l(1) using Cauchy random projections JOURNAL OF MACHINE LEARNING RESEARCH Li, P., Hastie, T. J., Church, K. W. 2007; 8: 2497-2532
  • Gene expression programs of human smooth muscle cells: Tissue-specific differentiation and prognostic significance in breast cancers PLOS GENETICS Chi, J., Rodriguez, E. H., Wang, Z., Nuyten, D. S., Mukherjee, S., van de Rijn, M., van de Vijver, M. J., Hastie, T., Brown, P. O. 2007; 3 (9): 1770-1784


    Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression.

    View details for DOI 10.1371/journal.pgen.0030164

    View details for Web of Science ID 000249767800019

    View details for PubMedID 17907811

  • Averaged gene expressions for regression BIOSTATISTICS Park, M. Y., Hastie, T., Tibshirani, R. 2007; 8 (2): 212-227


    Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.

    View details for DOI 10.1093/biostatistics/kxl002

    View details for Web of Science ID 000245512000004

    View details for PubMedID 16698769

  • Margin trees for high-dimensional classification JOURNAL OF MACHINE LEARNING RESEARCH Tibshirani, R., Hastie, T. 2007; 8: 637-652
  • Outlier sums for differential gene expression analysis BIOSTATISTICS Tibshirani, R., Hastie, T. 2007; 8 (1): 2-8


    We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).

    View details for DOI 10.1093/biostatistics/kx1005

    View details for Web of Science ID 000242715400001

    View details for PubMedID 16702229

  • L-1-regularization path algorithm for generalized linear models JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Park, M. Y., Hastie, T. 2007; 69: 659-677
  • Nonlinear estimators and tail bounds for dimension reduction in l(1) using Cauchy random projections 20th Annual Conference on Learning Theory Li, P., Hastie, T. J., Church, K. W. SPRINGER-VERLAG BERLIN. 2007: 514–529
  • Automatic bias correction methods in semi-supervised learning AMS/IMS/SIAM Joint Summer Research Conference on Machine and Statistical Learning - Prediction and Discovery Zou, H., Zhu, J., Rosset, S., Hastie, T. AMER MATHEMATICAL SOC. 2007: 165–175
  • Does cancer risk affect health-related quality of life in patients with Barrett's esophagus? Digestive Disease Week Meeting/106th Annual Meeting of the American-Gastroenterological-Association Gerson, L. B., Ullah, N., Hastie, T., Goldstein, M. K. MOSBY-ELSEVIER. 2007: 16–25


    Health-related quality of life is decreased in patients with GERD and Barrett's esophagus (BE).To determine whether time-tradeoff (TTO) values would differ in patients with BE when patients were asked to trade away the potential risk of esophageal adenocarcinoma rather than chronic heartburn symptoms.A prospective clinical trial.Subjects with biopsy-proven BE.Custom-designed computer program to elicit health-state utility values, quality of life in reflux and dyspepsia (QOLRAD), and Medical Outcomes Survey short form-36 surveys.TTO utility values for the annual cancer-risk-associated current health state and for hypothetical scenarios of dysplasia and esophageal cancer.We studied 60 patients in the cancer-risk cohort (57 men, 92% veteran; mean age [standard deviation; SD], 65 years [11 years], mean GERD duration 17 years [12 years]). The heartburn cohort included 40 patients with GERD and BE with TTO values derived for GERD symptoms. The mean (SD) utility for nondysplastic BE was 0.91 (0.13) compared with 0.90 (0.12) for the heartburn cohort (P = .7). The mean utility values were significantly lower for scenarios of low-grade dysplasia (0.85 [0.12], P = .02) and high-grade dysplasia (0.77 [0.14], P < .005). The mean TTO was 0.67 (0.19) for the scenario of esophageal cancer. There was no correlation between the utility scores and the disease-specific survey scores.TTO values were hypothetical for states of dysplasia and cancer.TTO utility values based on heartburn symptoms or annual risk of cancer in patients with nondysplastic BE are roughly equivalent. However, TTO utility values are significantly lower for health states with increasing cancer risks.

    View details for DOI 10.1016/j.gie.2006.05.018

    View details for Web of Science ID 000243361000005

    View details for PubMedID 17185075

  • Characterization of heterotypic interaction effects in vitro to deconvolute global gene expression profiles in cancer GENOME BIOLOGY Buess, M., Nuyten, D. S., Hastie, T., Nielsen, T., Pesich, R., Brown, P. O. 2007; 8 (9)


    Perturbations in cell-cell interactions are a key feature of cancer. However, little is known about the systematic effects of cell-cell interaction on global gene expression in cancer.We used an ex vivo model to simulate tumor-stroma interaction by systematically co-cultivating breast cancer cells with stromal fibroblasts and determined associated gene expression changes with cDNA microarrays. In the complex picture of epithelial-mesenchymal interaction effects, a prominent characteristic was an induction of interferon-response genes (IRGs) in a subset of cancer cells. In close proximity to these cancer cells, the fibroblasts secreted type I interferons, which, in turn, induced expression of the IRGs in the tumor cells. Paralleling this model, immunohistochemical analysis of human breast cancer tissues showed that STAT1, the key transcriptional activator of the IRGs, and itself an IRG, was expressed in a subset of the cancers, with a striking pattern of elevated expression in the cancer cells in close proximity to the stroma. In vivo, expression of the IRGs was remarkably coherent, providing a basis for segregation of 295 early-stage breast cancers into two groups. Tumors with high compared to low expression levels of IRGs were associated with significantly shorter overall survival; 59% versus 80% at 10 years (log-rank p = 0.001).In an effort to deconvolute global gene expression profiles of breast cancer by systematic characterization of heterotypic interaction effects in vitro, we found that an interaction between some breast cancer cells and stromal fibroblasts can induce an interferon-response, and that this response may be associated with a greater propensity for tumor progression.

    View details for DOI 10.1186/gb-2007-8-9-r191

    View details for Web of Science ID 000252100800017

    View details for PubMedID 17868458

  • Forward stagewise regression and the monotone lasso ELECTRONIC JOURNAL OF STATISTICS Hastie, T., Taylor, J., Tibshirani, R., Walther, G. 2007; 1: 1-29

    View details for DOI 10.1214/07-EJS004

    View details for Web of Science ID 000207854200001

  • Regularized linear discriminant analysis and its application in microarrays BIOSTATISTICS Guo, Y., Hastie, T., Tibshirani, R. 2007; 8 (1): 86-100


    In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN ( for download and testing.

    View details for DOI 10.1093/biostatistics/kxj035

    View details for Web of Science ID 000242715400006

    View details for PubMedID 16603682

  • Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions 2nd Workshop on Advances in Predictive Species Distribution Models Leathwick, J. R., Elith, J., Hastie, T. ELSEVIER SCIENCE BV. 2006: 188–96
  • An RT-PCR-based multi-gene prognostic signature predicts distant metastasis of node negative, ER positive breast cancer from FFPE sections. 42nd Annual Meeting of the American-Society-of-Clinical-Oncology Lau, K. F., Wang, A., Chew, K., Dai, H., Hastie, T., Brandt, B., Waldman, F., Sninsky, J. AMER SOC CLINICAL ONCOLOGY. 2006: 4S–4S
  • Sparse principal component analysis JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Zou, H., Hastie, T., Tibshirani, R. 2006; 15 (2): 265-286
  • Gene expression programs in response to hypoxia: Cell type specificity and prognostic significance in human cancers PLOS MEDICINE Chi, J. T., Wang, Z., Nuyten, D. S., Rodriguez, E. H., Schaner, M. E., Salim, A., Wang, Y., Kristensen, G. B., Helland, A., Borresen-Dale, A. L., Giaccia, A., Longaker, M. T., Hastie, T., Yang, G. P., Van de Vijver, M. J., Brown, P. O. 2006; 3 (3): 395-409


    Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1alpha protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.We analyzed the temporal changes in global transcript levels in response to hypoxia in primary renal proximal tubule epithelial cells, breast epithelial cells, smooth muscle cells, and endothelial cells with DNA microarrays. The extent of the transcriptional response to hypoxia was greatest in the renal tubule cells. This heightened response was associated with a uniquely high level of HIF-1alpha RNA in renal cells, and it could be diminished by reducing HIF-1alpha expression via RNA interference. A gene-expression signature of the hypoxia response, derived from our studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers. In an analysis of a large, published gene-expression dataset from breast cancers, we found that the prognostic information in the hypoxia signature was virtually independent of that provided by the previously reported wound signature and more predictive of outcomes than any of the clinical parameters in current use.The transcriptional response to hypoxia varies among human cells. Some of this variation is traceable to variation in expression of the HIF1A gene. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.

    View details for DOI 10.1371/journal.pmed.0030047

    View details for Web of Science ID 000236897500020

    View details for PubMedID 16417408

  • Prediction by supervised principal components JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Bair, E., Hastie, T., Paul, D., Tibshirani, R. 2006; 101 (473): 119-137
  • Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees MARINE ECOLOGY PROGRESS SERIES Leathwick, J. R., Elith, J., Francis, M. P., Hastie, T., Taylor, P. 2006; 321: 267-281
  • Improving random projections using marginal information 19th Annual Conference on Learning Theory (COLT 2006) Li, P., Hastie, T. J., Church, K. W. SPRINGER-VERLAG BERLIN. 2006: 635–649
  • Representing cyclic human motion using functional analysis 14th Annual Neural Information Processing Systems Conference (NIPS) Ormoneit, D., Black, M. J., Hastie, T., Kjellstrom, H. ELSEVIER SCIENCE BV. 2005: 1264–76
  • Microarray analysis of the transcriptional response to single or multiple doses of ionizing radiation in human subcutaneous fibroblasts RADIOTHERAPY AND ONCOLOGY Rodningen, A. K., Overgaard, J., Alsner, J., Hastie, T., Borresen-Dale, A. L. 2005; 77 (3): 231-240


    Transcriptional profiling of fibroblasts derived from breast cancer patients might improve our understanding of subcutaneous radiation-induced fibrosis. The aim of this study was to get a comprehensive overview of the changes in gene expression in subcutaneous fibroblast cell lines after various ionizing radiation (IR) schemes in order to provide information on potential targets for prevention and to suggest candidate genes for SNP association studies aimed at predicting individual risk of radiation-induced morbidity.Thirty different human fibroblast cell lines were included in the study, and two different radiation schemes; single dose experiments with 3.5 Gy or fractionated with 3 x 3.5 Gy. Expression analyses were performed on unexposed and exposed cells after different time points. The IR response was analyzed using the statistical method Significance Analysis of Microarrays (SAM).While many of the identified genes were involved in known IR response pathways like cell cycle arrest, proliferation and detoxification, a substantial fraction of the genes were involved in processes not previously associated with IR response. Of particular interest is genes involved in ECM remodelling, Wnt signalling and IGF signalling. Many of the genes were identified after a single dose, but transcriptional changes in genes related to ROS scavenging and ECM remodelling were most profound after a fractionated scheme.We have identified a number of IR response pathways in fibroblasts derived from breast cancer patients. Besides previously identified pathways, we have identified new pathways and genes that could be relevant for prevention and intervention studies of subcutaneous radiation-induced fibrosis as well as being candidates for SNP association studies.

    View details for DOI 10.1016/j.radonc.2005.09.020

    View details for Web of Science ID 000234358900002

    View details for PubMedID 16297999

  • Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish FRESHWATER BIOLOGY Leathwick, J. R., Rowe, D., Richardson, J., Elith, J., Hastie, T. 2005; 50 (12): 2034-2052
  • Constrained ordination analysis with flexible response functions ECOLOGICAL MODELLING Zhu, M., Hastie, T. J., Walther, G. 2005; 187 (4): 524-536
  • Combination of two biological gene expression signatures in predicting outcome in breast cancer as an alternative for supervised classification Nuyten, D. S., Chang, H. Y., Chi, J. T., Sneddon, J. B., Bartelink, H., Hastie, T., Brown, P. O., Van de Vijver, M. J. PERGAMON-ELSEVIER SCIENCE LTD. 2005: 71–72
  • Quantitative measurements of alternating finger tapping in Parkinson's disease correlate with UPDRS motor disability and reveal the improvement in fine motor control from medication and deep brain stimulation. Movement disorders Taylor Tavares, A. L., Jefferis, G. S., Koop, M., Hill, B. C., Hastie, T., Heit, G., Bronte-Stewart, H. M. 2005; 20 (10): 1286-1298


    The Unified Parkinson's Disease Rating Scale (UPDRS) is the primary outcome measure in most clinical trials of Parkinson's disease (PD) therapeutics. Each subscore of the motor section (UPDRS III) compresses a wide range of motor performance into a coarse-grained scale from 0 to 4; the assessment of performance can also be subjective. Quantitative digitography (QDG) is an objective, quantitative assessment of digital motor control using a computer-interfaced musical keyboard. In this study, we show that the kinematics of a repetitive alternating finger-tapping (RAFT) task using QDG correlate with the UPDRS motor score, particularly with the bradykinesia subscore, in 33 patients with PD. We show that dopaminergic medication and an average of 9.5 months of bilateral subthalamic nucleus deep brain stimulation (B-STN DBS) significantly improve UPDRS and QDG scores but may have different effects on certain kinematic parameters. This study substantiates the use of QDG to measure motor outcome in trials of PD therapeutics and shows that medication and B-STN DBS both improve fine motor control.

    View details for PubMedID 16001401

  • Quantitative measurements of Parkinson's disease correlate alternating finger tapping in with UPDRS motor disability and reveal the improvement in fine motor control from medication and deep brain stimulation MOVEMENT DISORDERS Tavares, A. L., Jefferis, G. S., Koop, M., Hill, B. C., Hastie, T., Heit, G., Bronte-Stewart, H. M. 2005; 20 (10): 1286-1298

    View details for DOI 10.1002/mds.20556

    View details for Web of Science ID 000232749300005

  • Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H. Y., He, Y. D., Van't Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O., van de Vijver, M. J. 2005; 102 (10): 3738-3743


    Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this "wound-response signature" to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering ("molecular subtypes") or supervised predictors of metastasis ("70-gene prognosis signature").

    View details for DOI 10.1073/pnas.0409462102

    View details for Web of Science ID 000227533100040

    View details for PubMedID 15701700

  • Kernel logistic regression and the import vector machine JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Zhu, J., Hastie, T. 2005; 14 (1): 185-205
  • Patient-derived health state utilities for gastroesophageal reflux disease AMERICAN JOURNAL OF GASTROENTEROLOGY Gerson, L. B., Ullah, N., Hastie, T., Triadafilopoulos, G., Goldstein, M. 2005; 100 (3): 524-533


    Gastroesophageal reflux disease is a chronic disease that adversely affects health-related quality of life. The purpose of this study was to derive health state utilities for patients with chronic heartburn symptoms.We used a custom-designed computer program in order to elicit utilities with the time-tradeoff and standard-gamble techniques. Patients with chronic (more than 6 months) symptoms of gastroesophageal reflux disease entered the study. Two interviews were performed in random sequence either initially on medications for heartburn that adequately controlled symptoms, or off of medications for 1 wk while the patient was symptomatic. We also collected data using visual-analog scales, quality of life in reflux and dyspepsia (QOLRAD), and Gastrointestinal Symptom Rating Scale (GSRS) scores.We invited 222 patients to participate; 158 (71%) patients (129 men, 29 women) completed the study. Barrett's esophagus was present in 40 (25%), erosive disease in 17 (11%), and 118 (74%) had comorbid conditions. The mean (+/-SD) utility ratings were 0.94 +/- 0.09 on medical therapy and 0.90 +/- 0.12 off medications for patients with reflux alone using time tradeoff (p= 0.004), and 0.94 +/- 8.0 both on and off of antireflux medications with standard-gamble assessment (p= 0.96). Mean time-tradeoff scores were also significantly lower off of medications for patients with other comorbid conditions (p= 0.002). There was no significant difference between mean utility scores for patients with or without Barrett's esophagus or erosive disease.Gastroesophageal reflux disease adversely affects health-related quality of life. Time-tradeoff utility for patients with reflux disease is substantially higher when patients are on medication than off medications.

    View details for DOI 10.1111/j.1572-0241.40588.x

    View details for Web of Science ID 000227697900005

    View details for PubMedID 15743346

  • Sample classification from protein mass spectrometry, by 'peak probability contrasts' BIOINFORMATICS Tibshirani, R., Hastie, T., Narasimhan, B., Soltys, S., Shi, G. Y., Koong, A., Le, Q. T. 2004; 20 (17): 3034-3044


    Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers.Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.

    View details for DOI 10.1093/bioinformatics/bth357

    View details for Web of Science ID 000225361400017

    View details for PubMedID 15226172

  • Efficient quadratic regularization for expression arrays BIOSTATISTICS Hastie, T., Tibshirani, R. 2004; 5 (3): 329-340


    Gene expression arrays typically have 50 to 100 samples and 1000 to 20,000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.

    View details for DOI 10.1093/biostatistics/kxh010

    View details for Web of Science ID 000222723600001

    View details for PubMedID 15208198

  • Classification of gene microarrays by penalized logistic regression BIOSTATISTICS Zhu, J., Hastie, T. 2004; 5 (3): 427-443


    Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.

    View details for DOI 10.1093/biostatistics/kxg046

    View details for Web of Science ID 000222723600007

    View details for PubMedID 15208204

  • Microelectrode recording revealing a somatotopic body map in the subthalamic nucleus in humans with Parkinson disease JOURNAL OF NEUROSURGERY Romanelli, P., Heit, G., Hill, B. C., Kraus, A., Hastie, T., Bronte-Stewart, H. M. 2004; 100 (4): 611-618


    The subthalamic nucleus (STN) is a key structure for motor control through the basal ganglia. The aim of this study was to show that the STN in patients with Parkinson disease (PD) has a somatotopic organization similar to that in nonhuman primates.A functional map of the STN was obtained using electrophysiological microrecording during placement of deep brain stimulation (DBS) electrodes in patients with PD. Magnetic resonance imaging was combined with ventriculography and intraoperative x-ray film to assess the position of the electrodes and the STN units, which were activated by limb movements to map the sensorimotor region of the STN. Each activated cell was located relative to the anterior commissure-posterior commissure line. Three-dimensional coordinates of the cells were analyzed statistically to determine whether those cells activated by movements of the arm and leg were segregated spatially. Three hundred seventy-nine microelectrode tracks were created during placement of 71 DBS electrodes in 44 consecutive patients. Somatosensory driving was found in 288 tracks. The authors identified and localized 1213 movement-related cells and recorded responses from 29 orofacial cells, 480 arm-related cells, 558 leg-related cells, and 146 cells responsive to both arm and leg movements. Leg-related cells were localized in medial (p < 0.0001) and ventral (p < 0.0004) positions and tended to be situated anteriorly (p = 0.063) relative to arm-related cells.Evidence of somatotopic organization in the STN in patients with PD supports the current theory of highly segregated loops integrating cortex-basal ganglia connections. These loops are preserved in chronic degenerative diseases such as PD, but may subserve a distorted body map. This finding also supports the relevance of microelectrode mapping in the optimal placement of DBS electrodes along the subthalamic homunculus.

    View details for Web of Science ID 000220440900009

    View details for PubMedID 15070113

  • Mitral annular size predicts Alfieri stitch tension in mitral edge-to-edge repalir JOURNAL OF HEART VALVE DISEASE Timek, T. A., Nielsen, S. L., Lai, D. T., Tibayan, F., Liang, D., Daughters, G. T., Beineke, P., Hastie, T., Ingels, N. B., Miller, D. C. 2004; 13 (2): 165-173


    Whilst increased 'Alfieri stitch' tension may reduce the durability of 'edge-to-edge' mitral repair, the factors affecting suture tension are unknown. In order to study hemodynamics and left ventricular (LV) and annular dynamics that determine suture tension, the central edge of the mitral leaflets was approximated with a miniature force transducer to measure leaflet tension (T) at the leaflet approximation point.Eight sheep were studied under open-chest conditions immediately after surgical placement of a force transducer and implantation of radiopaque markers on the left ventricle and mitral annulus (MA). Hemodynamic variables were altered by two caval occlusion steps (deltaV1 and deltaV2) and dobutamine infusion. Three-dimensional marker coordinates were obtained by simultaneous biplane videofluoroscopy to measure LV volume, MA area (MAA) and septal-lateral (SL) annular dimension throughout the cardiac cycle.At baseline, peak Alfieri stitch tension (0.30 +/- 0.18 N) was observed 96 +/- 61 ms prior to end-diastole coincident with peak annular SL diameter (98 +/- 58 ms before end-diastole). Dobutamine infusion decreased suture tension (from 0.30 +/- 0.18 N to 0.20 +/- 0.12 N, p = 0.01), although peak systolic pressure increased significantly (138 +/- 19 versus 115 +/- 14 mmHg; p = 0.03). A regression model was fitted with the goal of interpreting the hemodynamic and geometric predictors of tension as their influence varied with time: Tt (N) = 0.1916 + 0.2115 x SL (cm) - 0.1996 x MAA/SL (cm2/cm) + ft x LVP (mmHg), where Tt is tension at any time during the cardiac cycle and ft is the time-varying coefficient of LVP.Tension on the leaflets in the edge-to-edge repair is determined primarily by MA SL size, and paradoxically is lower when the contractile state is enhanced. This indicates that annular and/or LV dilatation increase stitch tension and may adversely affect durability of the repair if concomitant ring annuloplasty is not performed.

    View details for Web of Science ID 000220417200003

    View details for PubMedID 15086253

  • 1-norm support vector machines 17th Annual Conference on Neural Information Processing Systems (NIPS) Zhu, J., Rosset, S., Hastie, T., Tibshirani, R. M I T PRESS. 2004: 49–56
  • Margin maximizing loss functions 17th Annual Conference on Neural Information Processing Systems (NIPS) Rosset, S., Zhu, J., Hastie, T. M I T PRESS. 2004: 1237–1244
  • Gene expression patterns in ovarian carcinomas MOLECULAR BIOLOGY OF THE CELL Schaner, M. E., Ross, D. T., Ciaravino, G., Sorlie, T., Troyanskaya, O., Diehn, M., Wang, Y. C., Duran, G. E., Sikic, T. L., Caldeira, S., Skomedal, H., Tu, I. P., Hernandez-Boussard, T., Johnson, S. W., O'Dwyer, P. J., Fero, M. J., Kristensen, G. B., Borresen-Dale, A. L., Hastie, T., Tibshirani, R., van de Rijn, M., Teng, N. N., Longacre, T. A., Botstein, D., Brown, P. O., Sikic, B. I. 2003; 14 (11): 4376-4386


    We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.

    View details for Web of Science ID 000186738300005

    View details for PubMedID 12960427

  • Repeated observation of breast tumor subtypes in independent gene expression data sets PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J. S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., Demeter, J., Perou, C. M., Lonning, P. E., Brown, P. O., Borresen-Dale, A. L., Botstein, D. 2003; 100 (14): 8418-8423


    Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.

    View details for DOI 10.1073/pnas.0932692100

    View details for Web of Science ID 000184222500069

    View details for PubMedID 12829800

  • Note on "Comparison of model selection for regression" by Vladimir Cherkassky and Yunqian Ma NEURAL COMPUTATION Hastie, T., Tibshirani, R., Friedman, J. 2003; 15 (7): 1477-1480


    While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation--errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212).

    View details for Web of Science ID 000183421400002

    View details for PubMedID 12816562

  • Post-transplantation lymphoproliferative disease in heart and heart-lung transplant recipients: 30-year experience at Stanford University 21st Annual Meeting of the International-Society-for-Heart-and-Lung-Transplantation Gao, S. Z., Chaparro, S. V., Perlroth, M., Montoya, J. G., Miller, J. L., DiMiceli, S., Hastie, T., Oyer, P. E., Schroeder, J. ELSEVIER SCIENCE INC. 2003: 505–14


    Post-transplantation lymphoproliferative disease (PTLD) is an important source of morbidity and mortality in transplant recipients, with a reported incidence of 0.8% to 20%. Risk factors are thought to include immunosuppressive agents and viral infection. This study attempts to evaluate the impact of different immunosuppressive regimens, ganciclovir prophylaxis and other potential risk factors in the development of PTLD.We reviewed the records of 1026 (874 heart, 152 heart-lung) patients who underwent transplantation at Stanford between 1968 and 1997. Of these, 57 heart and 8 heart-lung recipients developed PTLD. During this interval, 4 different immunosuppressive regimens were utilized sequentially. In January 1987, ganciclovir prophylaxis for cytomegalovirus serologic-positive patients was introduced. Other potential risk factors evaluated included age, gender, prior cardiac diagnoses, HLA match, rejection frequency and calcium-channel blockade.No correlation of development of PTLD was found with different immunosuppression regimens consisting of azathioprine, prednisone, cyclosporine, OKT3 induction, tacrolimus and mycophenolate mofetil. A trend suggesting an influence of ganciclovir on the prevention of PTLD was not statistically significant (p = 0.12). Recipient age and rejection frequency, as well as high-dose cyclosporine immunosuppression, were significantly (p < 0.02) associated with PTLD development. The prevalence of PTLD at 13.3 years was 15%.The overall incidence of PTLD was 6.3%. It was not altered by sequential modifications in treatment regimens. Younger recipient age and higher rejection frequency were associated with increased PTLD occurrence. The 15% prevalence of PTLD in 58 long-term survivors was unexpectedly high.

    View details for DOI 10.1016/S1053-2498(02)01229-9

    View details for Web of Science ID 000182805100002

    View details for PubMedID 12742411

  • Ischemia in three left ventricular regions: Insights into the pathogenesis of acute ischemic mitral regurgitation 82nd Annual Meeting of the American-Association-for-Thoracic-Surgery Timek, T. A., Lai, D. T., Tibayan, F., Liang, D., Daughters, G. T., Dagum, P., Zasio, M. K., Lo, S., Hastie, T., Ingels, N. B., Miller, D. C. MOSBY-ELSEVIER. 2003: 559–69


    Acute posterolateral left ventricular ischemia in sheep results in ischemic mitral regurgitation, but the effects of ischemia in other left ventricular regions on ischemic mitral regurgitation is unknown.Six adult sheep had radiopaque markers placed on the left ventricle, mitral annulus, and anterior and posterior mitral leaflets at the valve center and near the anterior and posterior commissures. After 6 to 8 days, animals were studied with biplane videofluoroscopy and transesophageal echocardiography before and during sequential balloon occlusion of the left anterior descending, distal left circumflex, and proximal left circumflex coronary arteries. Time of valve closure was defined as the time when the distance between leaflet edge markers reached its minimum plateau, and systolic leaflet edge separation distance was calculated on the basis of left ventricular ejection.Only proximal left circumflex coronary artery occlusion resulted in ischemic mitral regurgitation, which was central and holosystolic. Delayed valve closure (anterior commissure, 58 +/- 29 vs 92 +/- 24 ms; valve center, 52 +/- 26 vs 92 +/- 23 ms; posterior commissure, 60 +/- 30 vs 94 +/- 14 ms; all P <.05) and increased leaflet edge separation distance during ejection (mean increase, 2.2 +/- 1.5 mm, 2.1 +/- 1.9 mm, and 2.1 +/- 1.5 mm at the anterior commissure, valve center, and posterior commissure, respectively; P <.05 for all) was seen during proximal left circumflex coronary artery occlusion but not during left anterior descending or distal left circumflex coronary artery occlusion. Ischemic mitral regurgitation was associated with a 19% +/- 10% increase in mitral annular area, and displacement of both papillary muscle tips away from the septal annulus at end systole.Acute ischemic mitral regurgitation in sheep occurred only after proximal left circumflex coronary artery occlusion along with delayed valve closure in early systole and increased leaflet edge separation throughout ejection in all 3 leaflet coaptation sites. The degree of left ventricular systolic dysfunction induced did not correlate with ischemic mitral regurgitation, but both altered valvular and subvalvular 3-dimensional geometry were necessary to produce ischemic mitral regurgitation during acute left ventricular ischemia.

    View details for DOI 10.1067/mtc.2003.43

    View details for Web of Science ID 000181949800019

    View details for PubMedID 12658198

  • Feature extraction for nonparametric discriminant analysis JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Zhu, M., Hastie, T. J. 2003; 12 (1): 101-120
  • Class prediction by nearest shrunken centroids, with applications to DNA microarrays STATISTICAL SCIENCE Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2003; 18 (1): 104-117
  • Boosting and support vector machines as optimal separators Conference on Document Recognition and Retrieval X Rosset, S., Zhu, J., Hastie, T. SPIE-INT SOC OPTICAL ENGINEERING. 2003: 1–7
  • Generalized linear and generalized additive models in studies of species distributions: setting the scene ECOLOGICAL MODELLING Guisan, A., Edwards, T. C., Hastie, T. 2002; 157 (2-3): 89-100
  • Risk factors for progressive cartilage loss in the knee ARTHRITIS AND RHEUMATISM Biswal, S., Hastie, T., Andriacchi, T. P., Bergman, G. A., Dillingham, M. F., Lang, P. 2002; 46 (11): 2884-2892

    View details for DOI 10.1002/art.10573

    View details for Web of Science ID 000179239500008

  • Risk factors for progressive cartilage loss in the knee: a longitudinal magnetic resonance imaging study in forty-three patients. Arthritis and rheumatism Biswal, S., Hastie, T., Andriacchi, T. P., Bergman, G. A., Dillingham, M. F., Lang, P. 2002; 46 (11): 2884-2892


    To evaluate the rate of progression of cartilage loss in the knee joint using magnetic resonance imaging (MRI) and to evaluate potential risk factors for more rapid cartilage loss.We evaluated baseline and followup MRIs of the knees in 43 patients (minimum time interval of 1 year, mean 1.8 years, range 52-285 weeks). Cartilage loss was graded in the anterior, central, and posterior regions of the medial and lateral knee compartments. Knee joints were also evaluated for other pathology. Data were analyzed using analysis of variance models.Patients who had sustained meniscal tears showed a higher average rate of progression of cartilage loss (22%) than that seen in those who had intact menisci (14.9%) (P

    View details for PubMedID 12428228

  • Cortisol and behavior in fragile X syndrome PSYCHONEUROENDOCRINOLOGY Hessl, D., Glaser, B., Dyer-Friedman, J., Blasey, C., Hastie, T., Gunnar, M., Reiss, A. L. 2002; 27 (7): 855-872


    The purpose of this study was to determine if children with fragile X syndrome, who typically demonstrate a neurobehavioral phenotype that includes social anxiety, withdrawal, and hyper-arousal, have increased levels of cortisol, a hormone associated with stress. The relevance of adrenocortical activity to the fragile X phenotype also was examined.One hundred and nine children with the fragile X full mutation (70 males and 39 females) and their unaffected siblings (51 males and 58 females) completed an in-home evaluation including a cognitive assessment and a structured social challenge task. Multiple samples of salivary cortisol were collected throughout the evaluation day and on two typical non-school days. Measures of the fragile X mental retardation (FMR1) gene, child intelligence, the quality of the home environment, parental psychopathology, and the effectiveness of educational and therapeutic services also were collected. Linear mixed-effects analyses were used to examine differences in cortisol associated with the fragile X diagnosis and gender (fixed effects) and to estimate individual subject and familial variation (random effects) in cortisol hormone levels. Hierarchical multiple regression analyses were conducted to determine whether adrenocortical activity is associated with behavior problems after controlling for significant genetic and environmental factors.Results showed that children with fragile X, especially males, had higher levels of salivary cortisol on typical days and during the evaluation. Highly significant family effects on salivary cortisol were detected, consistent with previous work documenting genetic and environmental influences on adrenocortical activity. Increased cortisol was significantly associated with behavior problems in boys and girls with fragile X but not in their unaffected siblings.These results provide evidence that the function of the hypothalamic-pituitary-adrenal axis may have an independent association with behavioral problems in children with fragile X syndrome.

    View details for Web of Science ID 000178462800008

    View details for PubMedID 12183220

  • Degrees-of-freedom tests for smoothing splines BIOMETRIKA Cantoni, E., Hastie, T. 2002; 89 (2): 251-263
  • Diagnosis of multiple cancer types by shrunken centroids of gene expression PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Tibshirani, R., Hastie, T., Narasimhan, B., Chu, G. 2002; 99 (10): 6567-6572


    We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

    View details for Web of Science ID 000175637300012

    View details for PubMedID 12011421

  • Exploratory screening of genes and clusters from microarray experiments STATISTICA SINICA Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Botstein, D. 2002; 12 (1): 47-59
  • Kernel logistic regression and the import vector machine 15th Annual Conference on Neural Information Processing Systems (NIPS) Zhu, J., Hastie, T. M I T PRESS. 2002: 1081–1088
  • Optimization and evaluation of T7 based RNA linear amplification protocols for cDNA microarray analysis BMC GENOMICS Zhao, H. J., Hastie, T., Whitfield, M. L., Borresen-Dale, A. L., Jeffrey, S. S. 2002; 3


    T7 based linear amplification of RNA is used to obtain sufficient antisense RNA for microarray expression profiling. We optimized and systematically evaluated the fidelity and reproducibility of different amplification protocols using total RNA obtained from primary human breast carcinomas and high-density cDNA microarrays.Using an optimized protocol, the average correlation coefficient of gene expression of 11,123 cDNA clones between amplified and unamplified samples is 0.82 (0.85 when a virtual array was created using repeatedly amplified samples to minimize experimental variation). Less than 4% of genes show changes in expression level by 2-fold or greater after amplification compared to unamplified samples. Most changes due to amplification are not systematic both within one tumor sample and between different tumors. Amplification appears to dampen the variation of gene expression for some genes when compared to unamplified poly(A)+ RNA. The reproducibility between repeatedly amplified samples is 0.97 when performed on the same day, but drops to 0.90 when performed weeks apart. The fidelity and reproducibility of amplification is not affected by decreasing the amount of input total RNA in the 0.3-3 micrograms range. Adding template-switching primer, DNA ligase, or column purification of double-stranded cDNA does not improve the fidelity of amplification. The correlation coefficient between amplified and unamplified samples is higher when total RNA is used as template for both experimental and reference RNA amplification.T7 based linear amplification reproducibly generates amplified RNA that closely approximates original sample for gene expression profiling using cDNA microarrays.

    View details for Web of Science ID 000181477100031

    View details for PubMedID 12445333

  • Supervised learning from microarray data 15th Biannual Conference on Computational Statistics (COMPSTAT) Hastie, T., Tibshirani, R., Narasimhan, B., Chu, G. PHYSICA-VERLAG GMBH & CO. 2002: 67–77
  • Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Sorlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M. B., van de Rijn, M., Jeffrey, S. S., Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lonning, P. E., Borresen-Dale, A. L. 2001; 98 (19): 10869-10874


    The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.

    View details for Web of Science ID 000170966800067

    View details for PubMedID 11553815

  • Brain anatomy, gender and IQ in children and adolescents with fragile X syndrome BRAIN Eliez, S., Blasey, C. M., Freund, L. S., Hastie, T., Reiss, A. L. 2001; 124: 1610-1618


    This study utilized MRI data to describe neuroanatomical morphology in children and adolescents with fragile X syndrome, the most common inherited cause of developmental disability. The syndrome provides a model for understanding how specific genetic factors can influence both neuroanatomy and cognitive capacity. Thirty-seven children and adolescents with fragile X syndrome received an MRI scan and cognitive testing. Scanning procedures and analytical strategies were identical to those reported in an earlier study of 85 typically developing children, permitting a comparison with a previously published template of normal brain development. Regression analyses indicated that there was a normative age-related decrease in grey matter and an increase in white matter. However, caudate and ventricular CSF volumes were significantly enlarged, and caudate volumes decreased with age. Rates of reduction of cortical grey matter were different for males and females. IQ scores were not significantly correlated with volumes of cortical and subcortical grey matter, and these relationships were statistically different from the correlational patterns observed in typically developing children. Children with fragile X syndrome exhibited several typical neurodevelopmental patterns. Aberrations in volumes of subcortical nuclei, gender differences in rates of cortical grey matter reduction and an absence of correlation between grey matter and cognitive performance provided indices of the deleterious effects of the fragile X mutation on the brain's structural organization.

    View details for Web of Science ID 000170453400013

    View details for PubMedID 11459752

  • Missing value estimation methods for DNA microarrays BIOINFORMATICS Troyanskaya, O., Cantor, M., Sherlock, G., BROWN, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. B. 2001; 17 (6): 520-525


    Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

    View details for Web of Science ID 000169404700005

    View details for PubMedID 11395428

  • Posttransplantation lymphoproliferative disease in heart and heart-lung transplant recipients: thirty years experience at our hospital. journal of heart and lung transplantation Chaparro, S., Gao, S., Perlroth, M., Montoya, J., Hastie, T., Miller, J. L., Oyer, P. E., Schroeder, J. 2001; 20 (2): 258-?

    View details for PubMedID 11250519

  • Supervised harvesting of expression trees GENOME BIOLOGY Hastie, T., Tibshirani, R., Botstein, D., Brown, P. 2001; 2 (1)


    We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.

    View details for Web of Science ID 000207583500011

    View details for PubMedID 11178280

  • Learning and tracking cyclic human motion 14th Annual Neural Information Processing Systems Conference (NIPS) Ormoneit, D., Sidenbladh, H., Black, M. J., Hastie, T. M I T PRESS. 2001: 894–900
  • Functional linear discriminant analysis for irregularly sampled curves JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY James, G. M., Hastie, T. J. 2001; 63: 533-550
  • The Elements of Statistical Learning: Prediction, Inference and Data Mining Hastie, T., Tibshirani, R., Friedman, J. Springer Verlag. 2001
  • Estimating the number of clusters in a data set via the gap statistic JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Tibshirani, R., Walther, G., Hastie, T. 2001; 63: 411-423
  • Principal component models for sparse functional data BIOMETRIKA James, G. M., Hastie, T. J., Sugar, C. A. 2000; 87 (3): 587-602
  • Bayesian backfitting STATISTICAL SCIENCE Hastie, T., Tibshirani, R. 2000; 15 (3): 196-213
  • Prediction of risk for patients with unstable angina. Evidence report/technology assessment (Summary) Heidenreich, P. A., Go, A., Melsop, K. A., Alloggiamento, T., McDonald, K. M., Hagan, V., Hastie, T., Hlatky, M. A. 2000: 1-3

    View details for PubMedID 11013605

  • 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome biology Hastie, T., Tibshirani, R., Eisen, M. B., Alizadeh, A., Levy, R., Staudt, L., Chan, W. C., Botstein, D., BROWN, P. 2000; 1 (2): RESEARCH0003-?


    Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.

    View details for PubMedID 11178228

  • Optimal kernel shapes for local linear regression 13th Annual Conference on Neural Information Processing Systems (NIPS) Ormoneit, D., Hastie, T. M I T PRESS. 2000: 540–546
  • Bone mineral acquisition in healthy Asian, Hispanic, black, and Caucasian youth: A longitudinal study JOURNAL OF CLINICAL ENDOCRINOLOGY & METABOLISM Bachrach, L. K., Hastie, T., Wang, M. C., Narasimhan, B., Marcus, R. 1999; 84 (12): 4702-4712


    Ethnic and gender differences in bone mineral acquisition were examined in a longitudinal study of 423 healthy Asian, black, Hispanic, and white males and females (aged 9-25 yr). Bone mass of the spine, femoral neck, total hip, and whole body was measured annually for up to 4 yr by dual energy x-ray absorptiometry. Age-adjusted mean bone mineral curves for areal (BMD) and volumetric (BMAD) bone mineral density were compared for the 4 ethnic groups. Consistent differences in areal and volumetric bone density were observed only between black and nonblack subjects. Among females, blacks had greater mean levels of BMD and BMAD at all skeletal sites. Differences among Asians, Hispanics, and white females were significant for femoral neck BMD, whole body BMD, and whole body bone mineral content/height ratio, for which Asians had significantly lower values; femoral neck BMAD in Asian and white females was lower than that in Hispanics. Like the females, black males had consistently greater mean values than nonblacks for all BMD and BMAD measurements. A few differences were also observed among nonblack male subjects. Whites had greater mean total hip BMD, whole body BMD, and whole body bone mineral content/height ratio than Asian and Hispanic males; Hispanics had lower spine BMD than white and Asian males. The tempo of gains in BMD varied by gender and skeletal site. In females, total hip, spine, and whole body BMD reached a plateau at 14.1, 15.7, and 16.4 yr, respectively. For males, gains in BMD leveled off at 15.7 yr for total hip and at age 17.6 yr for spine and whole body. Black and Asian females and Asian males tended to reach a plateau in BMD earlier than the other ethnic groups. The use of gender- and ethnic-specific standards is recommended when interpreting pediatric bone densitometry data.

    View details for Web of Science ID 000084134100065

    View details for PubMedID 10599739

  • An evaluation of beta-blockers, calcium antagonists, nitrates, and alternative therapies for stable angina. Evidence report/technology assessment (Summary) Heidenreich, P. A., McDonald, K. M., Hastie, T., Fadel, B., Hagan, V., Lee, B. K., Hlatky, M. A. 1999: 1-2

    View details for PubMedID 11925969

  • Statistical measures for the computer-aided diagnosis of mammographic masses JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Hastie, T., Ikeda, D., Tibshirani, R. 1999; 8 (3): 531-543
  • Meta-analysis of trials comparing beta-blockers, calcium antagonists, and nitrates for stable angina JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION Heidenreich, P. A., McDonald, K. M., Hastie, T., Fadel, B., Hagan, V., Lee, B. K., Hlatky, M. A. 1999; 281 (20): 1927-1936


    Which drug is most effective as a first-line treatment for stable angina is not known.To compare the relative efficacy and tolerability of treatment with beta-blockers, calcium antagonists, and long-acting nitrates for patients who have stable angina.We identified English-language studies published between 1966 and 1997 by searching the MEDLINE and EMBASE databases and reviewing the bibliographies of identified articles to locate additional relevant studies.Randomized or crossover studies comparing antianginal drugs from 2 or 3 different classes (beta-blockers, calcium antagonists, and long-acting nitrates) lasting at least 1 week were reviewed. Studies were selected if they reported at least 1 of the following outcomes: cardiac death, myocardial infarction, study withdrawal due to adverse events, angina frequency, nitroglycerin use, or exercise duration. Ninety (63%) of 143 identified studies met the inclusion criteria.Two independent reviewers extracted data from selected articles, settling any differences by consensus. Outcome data were extracted a third time by 1 of the investigators. We combined results using odds ratios (ORs) for discrete data and mean differences for continuous data. Studies of calcium antagonists were grouped by duration and type of drug (nifedipine vs nonnifedipine).Rates of cardiac death and myocardial infarction were not significantly different for treatment with beta-blockers vs calcium antagonists (OR, 0.97; 95% confidence interval [CI], 0.67-1.38; P = .79). There were 0.31 (95% CI, 0.00-0.62; P = .05) fewer episodes of angina per week with beta-blockers than with calcium antagonists. beta-Blockers were discontinued because of adverse events less often than were calcium antagonists (OR, 0.72; 95% CI, 0.60-0.86; P<.001). The differences between beta-blockers and calcium antagonists were most striking for nifedipine (OR for adverse events with beta-blockers vs nifedipine, 0.60; 95% CI, 0.47-0.77). Too few trials compared nitrates with calcium antagonists or beta-blockers to draw firm conclusions about relative efficacy.beta-Blockers provide similar clinical outcomes and are associated with fewer adverse events than calcium antagonists in randomized trials of patients who have stable angina.

    View details for Web of Science ID 000080427300033

    View details for PubMedID 10349897

  • Regression analysis of multiple protein structures JOURNAL OF COMPUTATIONAL BIOLOGY Wu, T. D., Schmidler, S. C., Hastie, T., Brutlag, D. L. 1998; 5 (3): 585-595


    A general framework is presented for analyzing multiple protein structures using statistical regression methods. The regression approach can superimpose protein structures rigidly or with shear. Also, this approach can superimpose multiple structures explicitly, without resorting to pairwise superpositions. The algorithm alternates between matching corresponding landmarks among the protein structures and superimposing these landmarks. Matching is performed using a robust dynamic programming technique that uses gap penalties that adapt to the given data. Superposition is performed using either orthogonal transformations, which impose the rigid-body assumption, or affine transformations, which allow shear. The resulting regression model of a protein family measures the amount of structural variability at each landmark. A variation of our algorithm permits a separate weight for each landmark, thereby allowing one to emphasize particular segments of a protein structure or to compensate for variances that differ at various positions in a structure. In addition, a method is introduced for finding an initial correspondence, by measuring the discrete curvature along each protein backbone. Discrete curvature also characterizes the secondary structure of a protein backbone, distinguishing among helical, strand, and loop regions. An example is presented involving a set of seven globin structures. Regression analysis, using both affine and orthogonal transformations, reveals that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.

    View details for Web of Science ID 000075921100016

    View details for PubMedID 9773352

  • The error coding method and PICTs JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS James, G., Hastie, T. 1998; 7 (3): 377-387
  • Classification by pairwise coupling ANNALS OF STATISTICS Hastie, T., Tibshirani, R. 1998; 26 (2): 451-471
  • Classification by pairwise coupling 11th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. MIT PRESS. 1998: 507–513
  • The error coding and substitution PaCTs 11th Annual Conference on Neural Information Processing Systems (NIPS) James, G., Hastie, T. MIT PRESS. 1998: 542–548
  • Modeling and superposition of multiple protein structures using affine transformations: analysis of the globins. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Wu, T. D., Schmidler, S. C., Hastie, T., Brutlag, D. L. 1998: 509-520


    A novel approach for analyzing multiple protein structures is presented. A family of related protein structures may be characterized by an affine model, obtained by applying transformation matrices that permit both rotation and shear. The affine model and transformation matrices can be computed efficiently using a single eigen-decomposition. A novel method for finding correspondences is also introduced. This method matches curvatures along the protein backbone. The algorithm is applied to analyze a set of seven globin structures. Our method identifies 100 corresponding landmarks across all seven structures. Results show that most helices in globins can be identified by high curvature, with the exception of the C and D helices. Analysis of the superposition reveals that globins are most strongly conserved structurally in the mid-regions of the E and G helices.

    View details for PubMedID 9697208

  • Discriminant adaptive nearest neighbor classification IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Hastie, T., Tibshirani, R. 1996; 18 (6): 607-616
  • Discriminant adaptive nearest neighbor classification and regression 9th Annual Conference on Neural Information Processing Systems (NIPS) Hastie, T., Tibshirani, R. M I T PRESS. 1996: 409–415
  • Generalized additive models for medical research. Statistical methods in medical research Hastie, T., Tibshirani, R. 1995; 4 (3): 187-196


    This article reviews flexible statistical methods that are useful for characterizing the effect of potential prognostic factors on disease endpoints. Applications to survival models and binary outcome models are illustrated.

    View details for PubMedID 8548102

  • PENALIZED DISCRIMINANT-ANALYSIS ANNALS OF STATISTICS Hastie, T., Buja, A., Tibshirani, R. 1995; 23 (1): 73-102
  • WAVELET SHRINKAGE - ASYMPTOPIA - DISCUSSION JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL Speckman, P. L., Marron, J. S., Silverman, B., Nason, G., Wang, K. M., Seifert, B., Gasser, T., EFROMOVITCH, S., Nussbaum, M., Wang, Y. Z., VONSACHS, R., Brillinger, D. R., Neumann, M. H., Hastie, T., Fan, J. Q., Antoniadis, A., Birge, L., CRELLIN, N. J., Martin, M. A., Doukhan, P., Engel, J., GEORGIEV, A. A., Liu, H., Good, I. J., Hall, P., Patil, P., Herrmann, E., Kolaczyk, E. D., LEPSKII, O. V., Mammen, E., Spokoiny, V. G., LUCIER, B., McCullagh, P., Moulin, P., Muller, H. G., Olshen, R. A., Tsybakov, A. B., Wahba, G., Walter, G. G., Tibshirani, R. 1995; 57 (2): 337-369
  • NEURAL NETWORKS AND RELATED METHODS FOR CLASSIFICATION - DISCUSSION JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY Whittle, P., Kay, J., Hand, D. J., Tarassenko, L., Brown, P. J., Titterington, D. M., TAYLOR, C., Gilks, W. R., Critchley, F., Mayne, A. J., Wahba, G., Luttrell, S. P., Baczkowski, A. J., Mardia, K. V., Breiman, L., Buntine, W., Chatfield, C., DeVeaux, R. D., DARKEN, C. J., Ungar, L. H., Glendinning, R. H., Hastie, T., Tibshirani, R., McLachlan, G. J., Michie, D., Owen, A. B., Wolpert, D. H., Ripley, B. D. 1994; 56 (3): 437-456


    The proportional hazards model is frequently used in analyzing the results of clinical trials, when it is often the case that the outcomes are right-censored. This model allows one to measure treatment effects and simultaneously identify and adjust for prognostic factors that might influence the outcome. In this paper, we outline a class of semiparametric models that allows one to model prognostic factors nonlinearly, and have the data suggest the form of their effect. The methods are illustrated in an analysis of data from a breast cancer clinical trial.

    View details for Web of Science ID A1992JQ41500008

    View details for PubMedID 1391990

  • MULTIVARIATE ADAPTIVE REGRESSION SPLINES - DISCUSSION ANNALS OF STATISTICS Buja, A., Duffy, D., Hastie, T., Tibshirani, R. 1991; 19 (1): 93-99
  • Statistical Models in S Chambers, J., Hastie, T. Wadsworth/Brooks Cole, Pacific Grove, California . 1991


    We discuss an exploratory technique for investigating the nature of covariate effects in Cox's proportional hazards model. This technique features an additive term sigma p1 fj(chi ij), in place of the usual linear term sigma p1 chi ij beta j, where chi i1, chi i2,...,chi ip are covariate values for the ith individual. The fj(.) are unspecified smooth functions that are estimated using scatterplot smoothers. These functions can be used for descriptive purposes or to suggest transformations of the covariates. The estimation technique is a variation of the local scoring algorithm for generalized additive models (Hastie and Tibshirani, 1986, Statistical Science 1, 297-318).

    View details for Web of Science ID A1990EV52100010

    View details for PubMedID 1964808

  • Generalized Additive Models. Hastie, T., Tibshirani, R. Chapman and Hall. 1990


    The relationship between gestational age, neonatal size and neonatal death is complex. To date, most authors have used birth weight as a proxy for neonatal size and have neglected to examine head circumference and crown heel length. In addition, they have assumed the size and gestational age were linearly related to neonatal death. In this study we use nonparametric multiple logistic regression to examine the relationship between gestational age, neonatal size and neonatal death. On its own, gestational age was nonlinearly associated with neonatal death. This nonlinearity disappeared with the addition of birth weight, crown heel length and head circumference. Birth weight, head circumference and crown heel length all had significant nonlinear associations with neonatal death in univariate analysis. With all factors in the model, birth weight and head circumference were nonlinearly associated with neonatal death and crown heel length was linearly associated with neonatal death. The complex relations between gestational age, neonatal size and neonatal death were explored with greater ease with nonparametric logistic regression.

    View details for Web of Science ID A1990EK04600007

    View details for PubMedID 2243255

  • REGRESSION WITH AN ORDERED CATEGORICAL RESPONSE STATISTICS IN MEDICINE Hastie, T. J., Botha, J. L., Schnitzler, C. M. 1989; 8 (7): 785-794


    A survey on Mseleni joint disease in South Africa involved the scoring of pelvic X-rays of women to measure osteoporosis. The scores were ordinal by construction and ranged from 0 to 12. It is standard practice to use ordinary regression techniques with an ordinal response that has that many categories. We give evidence for these data that the constraints on the response result in a misleading regression analysis. McCullagh's proportional-odds model is designed specifically for the regression analysis of ordinal data. We demonstrate the technique on these data, and show how it fills the gap between ordinary regression and logistic regression (for discrete data with two categories). In addition, we demonstrate non-parametric versions of these models that do not make any linearity assumptions about the regression function.

    View details for Web of Science ID A1989AF73400002

    View details for PubMedID 2772438

  • LINEAR SMOOTHERS AND ADDITIVE-MODELS ANNALS OF STATISTICS Buja, A., Hastie, T., Tibshirani, R. 1989; 17 (2): 453-510
  • PROJECTION PURSUIT - DISCUSSION ANNALS OF STATISTICS Hastie, T., Tibshirani, R. 1985; 13 (2): 502-508


    The records of 654 patients with mitral stenosis who underwent closed mitral valvotomy over a 12-year period were submitted to actuarial analysis. This revealed a low (2.97%) operative mortality. At 12 years, the overall cumulative proportion surviving was 78%; 47% of patients survived without reoperation. The usual clinical indicators of suitability for closed valvotomy were successful in predicting improved survival. The surgeon's assessment of the suitability of the valve correlated well with outcome. Valvotomy during pregnancy was associated with a good long-term outlook. The presence of pulmonary hypertension and atrial fibrillation did not alter survival significantly. Sex ane age were not associated with adverse prognosis. We conclude that closed mitral valvotomy still has a place in the management of mobile mitral stenosis, particularly in areas where there is a high incidence of rheumatic heart disease and a large number of young patients have mobile mitral stenosis.

    View details for Web of Science ID A1982NP11000007

    View details for PubMedID 7082084

  • Risk of asbestosis in crocidolite and amosite mines in South Africa. Annals of the New York Academy of Sciences Irwig, L. M., du Toit, R. S., Sluis-Cremer, G. K., Solomon, A., Thomas, R. G., Hamel, P. P., Webster, I., Hastie, T. 1979; 330: 35-52


    X-rays of all while and mixed-race men employed in crocidolite and amosite mines and mills were read independently by three experienced readers according to the ILO U/C classification. Abnormality was regarded as present if reported by two or more readers. Parenchymal abnormality, defined as the presence of small irregular opacities of profusion 1/0 or greater, was found in 7.3% of the workers. Pleural thickening was found in 4.5% of the workers, costophrenic angle obliteration in 3.2%, and pleural calcification in 1.7%. The prevalences of both pleural and parenchymal abnormality were strongly related to the duration of exposure to asbestos at work. The overall prevalence of abnormality increase from 4.0% in men with exposure for 1 year or less to 47.9% in men with more than 15 years of exposure. After taking into account the effects of age and duration of asbestos exposure, the prevalence of pleural abnormality was not predicted by fiber concentration. However, white men working with amosite tended to develop a higher prevalence of pleural abnormality than did those working with crocidolite. Compared to whites, men of mixed race, who only work with crocidolite, had a high prevalence of pleural abnormality in each exposure duration category. In contrast to pleural abnormality, the prevalence of parenchymal abnormality, after taking into account the effects of age and duration of exposure, was significantly predicted by fiber concentration but not by race or asbestos type. Our results suggest that parenchymal abnormality in workers in South African asbestos mines could be largely prevented by reducing exposure to fibers visible under the light microscope. However, this may not be the case for pleural abnormality.

    View details for PubMedID 294187



    Antibiotic resistance pattern in clinical isolates of selected gram-negative bacteria at Groote Schuur Hospital during two three-month periods with a ten year interval were investigated. The antibiotic resistance is represented by means of the cross product, or odds ratio, using the log-linear model. This was found to be a simple method of monitoring the change or increase of antibiotic resistance, and enabled an overall analysis, catering for antibiotic and organism effects, to be performed

    View details for Web of Science ID A1979HC38800037

    View details for PubMedID 384718