Trevor Hastie
John A. Overdeck Professor, Professor of Statistics and of Biomedical Data Sciences
Web page: http://www-stat.stanford.edu/~hastie
Bio
Trevor Hastie is the John A Overdeck Professor of Statistics at
Stanford University. Hastie is known for his research in applied
statistics, particularly in the fields of statistical modeling, bioinformatics
and machine learning. He has published six books and over 200
research articles in these areas. Prior to joining Stanford
University in 1994, Hastie worked at AT&T Bell Laboratories for nine
years, where he contributed to the development of the statistical modeling environment
popular in the R computing system. He received a B.Sc. (hons) in statistics
from Rhodes University in 1976, a M.Sc. from the University of Cape
Town in 1979, and a Ph.D from Stanford in 1984. In 2018 he was elected
to the U.S. National Academy of Sciences. He is a dual citizen of the
United States and South Africa.
Academic Appointments
-
Professor, Statistics
-
Professor, Department of Biomedical Data Science
-
Member, Bio-X
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Administrative Appointments
-
Chair, Department of Statistics (2006 - 2009)
Honors & Awards
-
Breiman Award (senior), American Statistical Association (2020)
-
Elected Member, The Royal Netherlands Academy of Arts and Science. (2019)
-
Sigillum Magnum, Bologna University (2019)
-
Elected Member, United States National Academy of Sciences (2018)
-
Honorary Doctorate, Leuphana University of Luneburg, Germany (2018)
-
The Emmanuel and Carol Parzen prize for Statistical Innovation, Texas A&M University. (2014)
-
Bernard G. Greenberg distinguished lecturer, Department of Biostatistics, University of North Carolina (2013)
-
Fellow, South African Statistical Association (2011)
-
Fellow, American Statistical Association (1998)
-
Craig Award, University of Iowa (1996)
-
Fellow, Institute of Mathematical Statistics (1996)
-
Myrto Lefkopolou award, Harvard School of Public Health (1996)
-
Elected Member, International Statistics Institute (1994)
-
Fellow, Royal Statistical Society (1979)
Professional Education
-
Ph.D., Stanford University, Statistics (1984)
-
M.Sc, University of Cape Town, Statistics (1979)
-
B.Sc (hons), Rhodes University, Statistics (1976)
Current Research and Scholarly Interests
Trevor Hastie specializes in applied statistical modeling, and he has written five books in this area:
"Generalized Additive Models" (with R. Tibshirani, Chapman and Hall,
1991), "Elements of Statistical Learning (second edition)"
(with R. Tibshirani and J. Friedman, Springer 2009),
"An Introduction to Statistical Learning" (with G. James, D. Witten and
R. Tibshirani, Springer 2013),
"Statistical Learning with Sparsity" (with R. Tibshirani and M. Wainwright, CRC Press 2015)
and "Computer Age Statistical Inference" (with B. Efron, Cambridge, 2016). He has also made contributions in
statistical computing, co-editing (with J. Chambers) a large software
library on modeling tools in the S language used in the R computing environment
("Statistical Models in S", Wadsworth, 1992). His current research
focuses on applied problems in biology and genomics, medicine and
industry, in particular data modeling, prediction and classification
problems.
2024-25 Courses
- Applied Statistics I
STATS 305A (Aut) -
Independent Studies (9)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIODS 299 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Independent Study
STATS 299 (Win, Spr) - Industrial Research for Statisticians
STATS 398 (Aut, Win, Spr) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
HRP 370 (Aut, Win, Spr, Sum) - Ph.D. Research
CME 400 (Aut, Win, Spr) - Research
STATS 399 (Aut, Win, Spr)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2023-24 Courses
- Applied Statistics III
STATS 305C (Spr)
2022-23 Courses
- Applied Multivariate Analysis
STATS 206 (Spr)
2021-22 Courses
- Applied Multivariate Analysis
STATS 206 (Spr)
- Applied Statistics III
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Zhaomeng Chen, John Cherian, Paula Gablenz, Asher Spector, Mike Van Ness -
Doctoral Dissertation Advisor (AC)
Disha Ghandwani, Anav Sood, James Yang -
Master's Program Advisor
Moritz Bolling, Cam Burton, Sergio Charles, Dylan Chou, Salil Goyal, Martin Pollack, Malavi Ravindran, Charles Shaviro, Ivy Sun, Ella Yadav, Grace Yang, Xianchen Yang, Timothy Yao, Charlie Zhang, Minghe Zhang
Graduate and Fellowship Programs
-
Biomedical Informatics (Phd Program)
All Publications
-
Cross-Validation: What Does It Estimate and How Well Does It Do It?
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2023
View details for DOI 10.1080/01621459.2023.2197686
View details for Web of Science ID 000989697400001
-
Elastic Net Regularization Paths for All Generalized Linear Models.
Journal of statistical software
2023; 106
Abstract
The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models.
View details for DOI 10.18637/jss.v106.i01
View details for PubMedID 37138589
View details for PubMedCentralID PMC10153598
-
Elastic Net Regularization Paths for All Generalized Linear Models
JOURNAL OF STATISTICAL SOFTWARE
2023; 106 (1): 1-31
View details for DOI 10.18637/jss.v106.i01
View details for Web of Science ID 000957922400001
-
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays.
Journal of machine learning research : JMLR
2022; 23
Abstract
Unmeasured or latent variables are often the cause of correlations between multivariate measurements, which are studied in a variety of fields such as psychology, ecology, and medicine. For Gaussian measurements, there are classical tools such as factor analysis or principal component analysis with a well-established theory and fast algorithms. Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses. However, current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets with thousands of observational units or responses. In this article, we propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood and then using a Newton method and Fisher scoring to learn the model parameters. Computationally, our method is noticeably faster and more stable, enabling GLLVM fits to much larger matrices than previously possible. We apply our method on a dataset of 48,000 observational units with over 2,000 observed species in each unit and find that most of the variability can be explained with a handful of factors. We publish an easy-to-use implementation of our proposed fitting algorithm.
View details for PubMedID 37102181
-
LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK.
The annals of applied statistics
2022; 16 (3): 1891-1918
Abstract
In high-dimensional regression problems, often a relatively small subset of the features are relevant for predicting the outcome, and methods that impose sparsity on the solution are popular. When multiple correlated outcomes are available (multitask), reduced rank regression is an effective way to borrow strength and capture latent structures that underlie the data. Our proposal is motivated by the UK Biobank population-based cohort study, where we are faced with large-scale, ultrahigh-dimensional features, and have access to a large number of outcomes (phenotypes)-lifestyle measures, biomarkers, and disease outcomes. We are hence led to fit sparse reduced-rank regression models, using computational strategies that allow us to scale to problems of this size. We use a scheme that alternates between solving the sparse regression problem and solving the reduced rank decomposition. For the sparse regression component we propose a scalable iterative algorithm based on adaptive screening that leverages the sparsity assumption and enables us to focus on solving much smaller subproblems. The full solution is reconstructed and tested via an optimality condition to make sure it is a valid solution for the original problem. We further extend the method to cope with practical issues, such as the inclusion of confounding variables and imputation of missing values among the phenotypes. Experiments on both synthetic data and the UK Biobank data demonstrate the effectiveness of the method and the algorithm. We present multiSnpnet package, available at http://github.com/junyangq/multiSnpnet that works on top of PLINK2 files, which we anticipate to be a valuable tool for generating polygenic risk scores from human genetic studies.
View details for DOI 10.1214/21-aoas1575
View details for PubMedID 36091495
View details for PubMedCentralID PMC9454085
-
LARGE-SCALE MULTIVARIATE SPARSE REGRESSION WITH APPLICATIONS TO UK BIOBANK
ANNALS OF APPLIED STATISTICS
2022; 16 (3): 1891-1918
View details for DOI 10.1214/21-AOAS1575
View details for Web of Science ID 000828472200030
-
SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION.
Annals of statistics
2022; 50 (2): 949-986
Abstract
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
View details for DOI 10.1214/21-aos2133
View details for PubMedID 36120512
View details for PubMedCentralID PMC9481183
-
Significant sparse polygenic risk scores across 813 traits in UK Biobank.
PLoS genetics
2022; 18 (3): e1010105
Abstract
We present a systematic assessment of polygenic risk score (PRS) prediction across more than 1,500 traits using genetic and phenotype data in the UK Biobank. We report 813 sparse PRS models with significant (p < 2.5 x 10-5) incremental predictive performance when compared against the covariate-only model that considers age, sex, types of genotyping arrays, and the principal component loadings of genotypes. We report a significant correlation between the number of genetic variants selected in the sparse PRS model and the incremental predictive performance (Spearman's ⍴ = 0.61, p = 2.2 x 10-59 for quantitative traits, ⍴ = 0.21, p = 9.6 x 10-4 for binary traits). The sparse PRS model trained on European individuals showed limited transferability when evaluated on non-European individuals in the UK Biobank. We provide the PRS model weights on the Global Biobank Engine (https://biobankengine.stanford.edu/prs).
View details for DOI 10.1371/journal.pgen.1010105
View details for PubMedID 35324888
-
LinCDE: Conditional Density Estimation via Lindsey's Method
JOURNAL OF MACHINE LEARNING RESEARCH
2022; 23: 1-55
View details for Web of Science ID 000752280200001
-
Generalized Matrix Factorization: efficient algorithms for fitting generalized linear latent variable models to large data arrays
JOURNAL OF MACHINE LEARNING RESEARCH
2022; 23
View details for Web of Science ID 001003362700001
-
Transparency and reproducibility in artificial intelligence.
Nature
2020; 586 (7829): E14–E16
View details for DOI 10.1038/s41586-020-2766-y
View details for PubMedID 33057217
-
Ridge Regularization: An Essential Concept in Data Science.
Technometrics : a journal of statistics for the physical, chemical, and engineering sciences
2020; 62 (4): 426-433
Abstract
Ridge or more formally ℓ 2 regularization shows up in many areas of statistics and machine learning. It is one of those essential devices that any good data scientist needs to master for their craft. In this brief ridge fest, I have collected together some of the magic and beauty of ridge that my colleagues and I have encountered over the past 40 years in applied statistics.
View details for DOI 10.1080/00401706.2020.1791959
View details for PubMedID 36033922
View details for PubMedCentralID PMC9410599
-
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank.
PLoS genetics
2020; 16 (10): e1009141
Abstract
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
View details for DOI 10.1371/journal.pgen.1009141
View details for PubMedID 33095761
-
Learning Interactions via Hierarchical Group-Lasso Regularization
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2015; 24 (3): 627-654
Abstract
We introduce a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables as well. Our approach allows us to dispense with explicitly applying constraints on the main effects and interactions for identifiability, which results in interpretable interaction models. We compare our method with existing approaches on both simulated and real data, including a genome-wide association study, all using our R package glinternet.
View details for DOI 10.1080/10618600.2014.938812
View details for Web of Science ID 000361373800002
View details for PubMedCentralID PMC4706754
-
Bias correction in species distribution models: pooling survey and collection data for multiple species
METHODS IN ECOLOGY AND EVOLUTION
2015; 6 (4): 424-438
View details for DOI 10.1111/2041-210X.12242
View details for Web of Science ID 000352794100007
-
Learning the Structure of Mixed Graphical Models
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2015; 24 (1): 230-253
View details for DOI 10.1080/10618600.2014.900500
View details for Web of Science ID 000352298800011
- Statistical Learning with Sparsity: The Lasso and Generalizations CRC Press. 2015
- An Introduction to Statistical Learning, with Applications in R Springer Texts in Statistics Springer. 2013
- Elements of Statistical Learning: datamining, inference, and prediction (second edition) Springer 2009
-
PATHWISE COORDINATE OPTIMIZATION
ANNALS OF APPLIED STATISTICS
2007; 1 (2): 302-332
View details for DOI 10.1214/07-AOAS131
View details for Web of Science ID 000261057600003
-
Regularization and variable selection via the elastic net
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2005; 67: 301-320
View details for Web of Science ID 000227498200007
-
The entire regularization path for the support vector machine
JOURNAL OF MACHINE LEARNING RESEARCH
2004; 5: 1391-1415
View details for Web of Science ID 000236328300007
-
Boosting as a regularized path to a maximum margin classifier
JOURNAL OF MACHINE LEARNING RESEARCH
2004; 5: 941-973
View details for Web of Science ID 000236328000004
-
Least angle regression
ANNALS OF STATISTICS
2004; 32 (2): 407-451
View details for Web of Science ID 000221411000001
-
Bayesian backfitting - Comments and rejoinder
STATISTICAL SCIENCE
2000; 15 (3): 213-223
View details for Web of Science ID 000166404100003
-
Additive logistic regression: A statistical view of boosting
ANNALS OF STATISTICS
2000; 28 (2): 337-374
View details for Web of Science ID 000089669700001
-
Discriminant analysis by Gaussian mixtures
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL
1996; 58 (1): 155-176
View details for Web of Science ID A1996TU31400010
-
VARYING-COEFFICIENT MODELS
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
1993; 55 (4): 757-796
View details for Web of Science ID A1993LP19100001
- Generalized Additive Models -- 1990
-
The impact of exercise on gene regulation in association with complex trait genetics.
Nature communications
2024; 15 (1): 3346
Abstract
Endurance exercise training is known to reduce risk for a range of complex diseases. However, the molecular basis of this effect has been challenging to study and largely restricted to analyses of either few or easily biopsied tissues. Extensive transcriptome data collected across 15 tissues during exercise training in rats as part of the Molecular Transducers of Physical Activity Consortium has provided a unique opportunity to clarify how exercise can affect tissue-specific gene expression and further suggest how exercise adaptation may impact complex disease-associated genes. To build this map, we integrate this multi-tissue atlas of gene expression changes with gene-disease targets, genetic regulation of expression, and trait relationship data in humans. Consensus from multiple approaches prioritizes specific tissues and genes where endurance exercise impacts disease-relevant gene expression. Specifically, we identify a total of 5523 trait-tissue-gene triplets to serve as a valuable starting point for future investigations [Exercise; Transcription; Human Phenotypic Variation].
View details for DOI 10.1038/s41467-024-45966-w
View details for PubMedID 38693125
-
Temporal dynamics of the multi-omic response to endurance exercise training.
Nature
2024; 629 (8010): 174-183
Abstract
Regular exercise promotes whole-body health and prevents disease, but the underlying molecular mechanisms are incompletely understood1-3. Here, the Molecular Transducers of Physical Activity Consortium4 profiled the temporal transcriptome, proteome, metabolome, lipidome, phosphoproteome, acetylproteome, ubiquitylproteome, epigenome and immunome in whole blood, plasma and 18 solid tissues in male and female Rattus norvegicus over eight weeks of endurance exercise training. The resulting data compendium encompasses 9,466 assays across 19 tissues, 25 molecular platforms and 4 training time points. Thousands of shared and tissue-specific molecular alterations were identified, with sex differences found in multiple tissues. Temporal multi-omic and multi-tissue analyses revealed expansive biological insights into the adaptive responses to endurance training, including widespread regulation of immune, metabolic, stress response and mitochondrial pathways. Many changes were relevant to human health, including non-alcoholic fatty liver disease, inflammatory bowel disease, cardiovascular health and tissue injury and recovery. The data and analyses presented in this study will serve as valuable resources for understanding and exploring the multi-tissue molecular effects of endurance training and are provided in a public repository ( https://motrpac-data.org/ ).
View details for DOI 10.1038/s41586-023-06877-w
View details for PubMedID 38693412
View details for PubMedCentralID PMC11062907
-
In silico identification of putative causal genetic variants.
bioRxiv : the preprint server for biology
2024
Abstract
Understanding the causal genetic architecture of complex phenotypes is essential for future research into disease mechanisms and potential therapies. Despite the widespread availability of genome-wide data, existing methods to analyze genetic data still primarily focus on marginal association models, which fall short of fully capturing the polygenic nature of complex traits and elucidating biological causal mechanisms. Here we present a computationally efficient causal inference framework for genome-wide detection of putative causal variants underlying genetic associations. Our approach utilizes summary statistics from potentially overlapping studies as input, constructs in silico knockoff copies of summary statistics as negative controls to attenuate confounding effects induced by linkage disequilibrium, and employs efficient ultrahigh-dimensional sparse regression to jointly model all genetic variants across the genome. Our method is computationally efficient, requiring less than 15 minutes on a single CPU to analyze genome-wide summary statistics. In applications to a meta-analysis of ten large-scale genetic studies of Alzheimer's disease (AD) we identified 82 loci associated with AD, including 37 additional loci missed by conventional GWAS pipeline via marginal association testing. The identified putative causal variants achieve state-of-the-art agreement with massively parallel reporter assays and CRISPR-Cas9 experiments. Additionally, we applied the method to a retrospective analysis of large-scale genome-wide association studies (GWAS) summary statistics from 2013 to 2022. Results reveal the method's capacity to robustly discover additional loci for polygenic traits beyond conventional GWAS and pinpoint potential causal variants underpinning each locus (on average, 22.7% more loci and 78.7% fewer proxy variants), contributing to a deeper understanding of complex genetic architectures in post-GWAS analyses. We are making the discoveries and software freely available to the community and anticipate that routine end-to-end in silico identification of putative causal genetic variants will become an important tool that will facilitate downstream functional experiments and future research into disease etiology, as well as the exploration of novel therapeutic avenues.
View details for DOI 10.1101/2024.02.28.582621
View details for PubMedID 38464202
-
A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the USA.
BMC medical research methodology
2024; 24 (1): 27
Abstract
BACKGROUND: Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated whether this equation could be used to interpolate missing growth data in children in the first three years of life and compared this interpolation to several common interpolation methods and pediatric growth models.METHODS: We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695).RESULTS: The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit. Interpolation with this equation had comparable (for weight) or lower (for height) mean RMSE compared to the best performing alternative models.CONCLUSIONS: A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.
View details for DOI 10.1186/s12874-024-02145-1
View details for PubMedID 38302887
-
Smooth Multi-Period Forecasting With Application to Prediction of COVID-19 Cases
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2024
View details for DOI 10.1080/10618600.2023.2285337
View details for Web of Science ID 001138480600001
-
Modeling Longitudinal Data Using Matrix Completion
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2023
View details for DOI 10.1080/10618600.2023.2257257
View details for Web of Science ID 001102629900001
-
Defining Usual Oral Temperature Ranges in Outpatients Using an Unsupervised Learning Algorithm.
JAMA internal medicine
2023
Abstract
Although oral temperature is commonly assessed in medical examinations, the range of usual or "normal" temperature is poorly defined.To determine normal oral temperature ranges by age, sex, height, weight, and time of day.This cross-sectional study used clinical visit information from the divisions of Internal Medicine and Family Medicine in a single large medical care system. All adult outpatient encounters that included temperature measurements from April 28, 2008, through June 4, 2017, were eligible for inclusion. The LIMIT (Laboratory Information Mining for Individualized Thresholds) filtering algorithm was applied to iteratively remove encounters with primary diagnoses overrepresented in the tails of the temperature distribution, leaving only those diagnoses unrelated to temperature. Mixed-effects modeling was applied to the remaining temperature measurements to identify independent factors associated with normal oral temperature and to generate individualized normal temperature ranges. Data were analyzed from July 5, 2017, to June 23, 2023.Primary diagnoses and medications, age, sex, height, weight, time of day, and month, abstracted from each outpatient encounter.Normal temperature ranges by age, sex, height, weight, and time of day.Of 618 306 patient encounters, 35.92% were removed by LIMIT because they included diagnoses or medications that fell disproportionately in the tails of the temperature distribution. The encounters removed due to overrepresentation in the upper tail were primarily linked to infectious diseases (76.81% of all removed encounters); type 2 diabetes was the only diagnosis removed for overrepresentation in the lower tail (15.71% of all removed encounters). The 396 195 encounters included in the analysis set consisted of 126 705 patients (57.35% women; mean [SD] age, 52.7 [15.9] years). Prior to running LIMIT, the mean (SD) overall oral temperature was 36.71 °C (0.43 °C); following LIMIT, the mean (SD) temperature was 36.64 °C (0.35 °C). Using mixed-effects modeling, age, sex, height, weight, and time of day accounted for 6.86% (overall) and up to 25.52% (per patient) of the observed variability in temperature. Mean normal oral temperature did not reach 37 °C for any subgroup; the upper 99th percentile ranged from 36.81 °C (a tall man with underweight aged 80 years at 8:00 am) to 37.88 °C (a short woman with obesity aged 20 years at 2:00 pm).The findings of this cross-sectional study suggest that normal oral temperature varies in an expected manner based on sex, age, height, weight, and time of day, allowing individualized normal temperature ranges to be established. The clinical significance of a value outside of the usual range is an area for future study.
View details for DOI 10.1001/jamainternmed.2023.4291
View details for PubMedID 37669046
-
A modified Michaelis-Menten equation estimates growth from birth to 3 years in healthy babies in the US.
Research square
2023
Abstract
Standard pediatric growth curves cannot be used to impute missing height or weight measurements in individual children. The Michaelis-Menten equation, used for characterizing substrate-enzyme saturation curves, has been shown to model growth in many organisms including nonhuman vertebrates. We investigated this equation could be used to interpolate missing growth data in children in the first three years of life.We developed a modified Michaelis-Menten equation and compared expected to actual growth, first in a local birth cohort (N=97) then in a large, outpatient, pediatric sample (N=14,695).The modified Michaelis-Menten equation showed excellent fit for both infant weight (median RMSE: boys: 0.22kg [IQR:0.19; 90%<0.43]; girls: 0.20kg [IQR:0.17; 90%<0.39]) and height (median RMSE: boys: 0.93cm [IQR:0.53; 90%<1.0]; girls: 0.91cm [IQR:0.50;90%<1.0]). Growth data were modeled accurately with as few as four values from routine well-baby visits in year 1 and seven values in years 1-3; birth weight or length was essential for best fit.A modified Michaelis-Menten equation accurately describes growth in healthy babies aged 0-36 months, allowing interpolation of missing weight and height values in individual longitudinal measurement series. The growth pattern in healthy babies in resource-rich environments mirrors an enzymatic saturation curve.
View details for DOI 10.21203/rs.3.rs-2375831/v1
View details for PubMedID 36711501
View details for PubMedCentralID PMC9882604
-
Canonical correlation analysis in high dimensions with structured regularization.
Statistical modelling
2023; 23 (3): 203-227
Abstract
Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an ℓ2 penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.
View details for DOI 10.1177/1471082x211041033
View details for PubMedID 37334164
View details for PubMedCentralID PMC10274416
-
Reorienting Latent Variable Modeling for Supervised Learning.
Multivariate behavioral research
2023: 1-15
Abstract
Despite its potentials benefits, using prediction targets generated based on latent variable (LV) modeling is not a common practice in supervised learning, a dominating framework for developing prediction models. In supervised learning, it is typically assumed that the outcome to be predicted is clear and readily available, and therefore validating outcomes before predicting them is a foreign concept and an unnecessary step. The usual goal of LV modeling is inference, and therefore using it in supervised learning and in the prediction context requires a major conceptual shift. This study lays out methodological adjustments and conceptual shifts necessary for integrating LV modeling into supervised learning. It is shown that such integration is possible by combining the traditions of LV modeling, psychometrics, and supervised learning. In this interdisciplinary learning framework, generating practical outcomes using LV modeling and systematically validating them based on clinical validators are the two main strategies. In the example using the data from the Longitudinal Assessment of Manic Symptoms (LAMS) Study, a large pool of candidate outcomes is generated by flexible LV modeling. It is demonstrated that this exploratory situation can be used as an opportunity to tailor desirable prediction targets taking advantage of contemporary science and clinical insights.
View details for DOI 10.1080/00273171.2023.2182753
View details for PubMedID 37229653
-
A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy
SCIENCE ADVANCES
2023; 9 (3)
View details for Web of Science ID 000964550100033
-
Comparing spatial patterns of marine vessels between vessel-tracking data and satellite imagery
FRONTIERS IN MARINE SCIENCE
2023; 9
View details for DOI 10.3389/fmars.2022.1076775
View details for Web of Science ID 000924607800001
-
A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy.
Science advances
2023; 9 (3): eadd1166
Abstract
Although literature suggests that resistance to TNF inhibitor (TNFi) therapy in patients with ulcerative colitis (UC) is partially linked to immune cell populations in the inflamed region, there is still substantial uncertainty underlying the relevant spatial context. Here, we used the highly multiplexed immunofluorescence imaging technology CODEX to create a publicly browsable tissue atlas of inflammation in 42 tissue regions from 29 patients with UC and 5 healthy individuals. We analyzed 52 biomarkers on 1,710,973 spatially resolved single cells to determine cell types, cell-cell contacts, and cellular neighborhoods. We observed that cellular functional states are associated with cellular neighborhoods. We further observed that a subset of inflammatory cell types and cellular neighborhoods are present in patients with UC with TNFi treatment, potentially indicating resistant niches. Last, we explored applying convolutional neural networks (CNNs) to our dataset with respect to patient clinical variables. We note concerns and offer guidelines for reporting CNN-based predictions in similar datasets.
View details for DOI 10.1126/sciadv.add1166
View details for PubMedID 36662860
-
Feature-weighted elastic net: using "features of features" for better prediction.
Statistica Sinica
2023; 33 (1): 259-279
Abstract
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.
View details for DOI 10.5705/ss.202020.0226
View details for PubMedID 37102071
-
Principal component analysis
NATURE REVIEWS METHODS PRIMERS
2022; 2 (1)
View details for DOI 10.1038/s43586-022-00184-w
View details for Web of Science ID 000903008400001
-
Confounds in neuroimaging: A clear case of sex as a confound in brain-based prediction.
Frontiers in neurology
2022; 13: 960760
Abstract
Muscle weakness is common in many neurological, neuromuscular, and musculoskeletal conditions. Muscle size only partially explains muscle strength as adaptions within the nervous system also contribute to strength. Brain-based biomarkers of neuromuscular function could provide diagnostic, prognostic, and predictive value in treating these disorders. Therefore, we sought to characterize and quantify the brain's contribution to strength by developing multimodal MRI pipelines to predict grip strength. However, the prediction of strength was not straightforward, and we present a case of sex being a clear confound in brain decoding analyses. While each MRI modality-structural MRI (i.e., gray matter morphometry), diffusion MRI (i.e., white matter fractional anisotropy), resting state functional MRI (i.e., functional connectivity), and task-evoked functional MRI (i.e., left or right hand motor task activation)-and a multimodal prediction pipeline demonstrated significant predictive power for strength (R 2 = 0.108-0.536, p ≤ 0.001), after correcting for sex, the predictive power was substantially reduced (R 2 = -0.038-0.075). Next, we flipped the analysis and demonstrated that each MRI modality and a multimodal prediction pipeline could significantly predict sex (accuracy = 68.0%-93.3%, AUC = 0.780-0.982, p < 0.001). However, correcting the brain features for strength reduced the accuracy for predicting sex (accuracy = 57.3%-69.3%, AUC = 0.615-0.780). Here we demonstrate the effects of sex-correlated confounds in brain-based predictive models across multiple brain MRI modalities for both regression and classification models. We discuss implications of confounds in predictive modeling and the development of brain-based MRI biomarkers, as well as possible strategies to overcome these barriers.
View details for DOI 10.3389/fneur.2022.960760
View details for PubMedID 36601297
View details for PubMedCentralID PMC9806266
-
Shark detection and classification with machine learning
ECOLOGICAL INFORMATICS
2022; 69
View details for DOI 10.1016/j.ecoinf.2022.101673
View details for Web of Science ID 000911465500001
-
Multiclass-penalized logistic regression
COMPUTATIONAL STATISTICS & DATA ANALYSIS
2022; 169
View details for DOI 10.1016/j.csda.2021.107414
View details for Web of Science ID 000751459600011
-
SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION
ANNALS OF STATISTICS
2022; 50 (2): 949-986
View details for DOI 10.1214/21-AOS2133
View details for Web of Science ID 000780956100013
-
BACKFITTING FOR LARGE SCALE CROSSED RANDOM EFFECTS REGRESSIONS
ANNALS OF STATISTICS
2022; 50 (1): 560-583
View details for DOI 10.1214/21-AOS2121
View details for Web of Science ID 000758697800023
-
Scalable logistic regression with crossed random effects
ELECTRONIC JOURNAL OF STATISTICS
2022; 16 (2): 4604-4635
View details for DOI 10.1214/22-EJS2047
View details for Web of Science ID 000953164900017
-
Author Correction: Genetics of 35 blood and urine biomarkers in the UK Biobank.
Nature genetics
2021
View details for DOI 10.1038/s41588-021-00956-2
View details for PubMedID 34608296
-
Canonical correlation analysis in high dimensions with structured regularization
STATISTICAL MODELLING
2021
View details for DOI 10.1177/1471082X211041033
View details for Web of Science ID 000705084300001
-
Prediction of Cognitive Function with Multimodal Brain MRI
WILEY. 2021: S42
View details for Web of Science ID 000704705300063
-
Author Correction: An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging.
Nature aging
2021; 1 (8): 748
View details for DOI 10.1038/s43587-021-00102-x
View details for PubMedID 37117770
-
Corrigendum to: Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.
Biostatistics (Oxford, England)
2021
View details for DOI 10.1093/biostatistics/kxab019
View details for PubMedID 34269393
-
An inflammatory aging clock (iAge) based on deep learning tracks multimorbidity, immunosenescence, frailty and cardiovascular aging.
Nature aging
2021; 1: 598-615
Abstract
While many diseases of aging have been linked to the immunological system, immune metrics capable of identifying the most at-risk individuals are lacking. From the blood immunome of 1,001 individuals aged 8-96 years, we developed a deep-learning method based on patterns of systemic age-related inflammation. The resulting inflammatory clock of aging (iAge) tracked with multimorbidity, immunosenescence, frailty and cardiovascular aging, and is also associated with exceptional longevity in centenarians. The strongest contributor to iAge was the chemokine CXCL9, which was involved in cardiac aging, adverse cardiac remodeling and poor vascular function. Furthermore, aging endothelial cells in human and mice show loss of function, cellular senescence and hallmark phenotypes of arterial stiffness, all of which are reversed by silencing CXCL9. In conclusion, we identify a key role of CXCL9 in age-related chronic inflammation and derive a metric for multimorbidity that can be utilized for the early detection of age-related clinical phenotypes.
View details for DOI 10.1038/s43587-021-00082-y
View details for PubMedID 34888528
-
Fast Numerical Optimization for Genome Sequencing Data in Population Biobanks.
Bioinformatics (Oxford, England)
2021
Abstract
MOTIVATION: Large-scale and high-dimensional genome sequencing data poses computational challenges. General purpose optimization tools are usually not optimal in terms of computational and memory performance for genetic data.RESULTS: We develop two efficient solvers for optimization problems arising from large-scale regularized regressions on millions of genetic variants sequenced from hundreds of thousands of individuals. These genetic variants are encoded by the values in the set {0, 1, 2, NA}. We take advantage of this fact and use two bits to represent each entry in a genetic matrix, which reduces memory requirement by a factor of 32 compared to a double precision floating point representation. Using this representation, we implemented an iteratively reweighted least square algorithm to solve Lasso regressions on genetic matrices, which we name snpnet-2.0. When the dataset contains many rare variants, the predictors can be encoded in a sparse matrix. We utilize the sparsity in the predictor matrix to further reduce memory requirement and computational speed. Our sparse genetic matrix implementation uses both the compact 2-bit representation and a simplified version of compressed sparse block format so that matrix-vector multiplications can be effectively parallelized on multiple CPU cores. To demonstrate the effectiveness of this representation, we implement an accelerated proximal gradient method to solve group Lasso on these sparse genetic matrices. This solver is named sparse-snpnet, and will also be included as part of snpnet R package. Our implementation is able to solve Lasso and group Lasso, linear, logistic and Cox regression problems on sparse genetic matrices that contain 1,000,000 variants and almost 100,000 individuals within 10minutes and using less than 32GB of memory.AVAILABILITY: https://github.com/rivas-lab/snpnet/tree/compact.
View details for DOI 10.1093/bioinformatics/btab452
View details for PubMedID 34146108
-
Wearable sensors enable personalized predictions of clinical laboratory measurements.
Nature medicine
2021
Abstract
Vital signs, including heart rate and body temperature, are useful in detecting or monitoring medical conditions, but are typically measured in the clinic and require follow-up laboratory testing for more definitive diagnoses. Here we examined whether vital signs as measured by consumer wearable devices (that is, continuously monitored heart rate, body temperature, electrodermal activity and movement) can predict clinical laboratory test results using machine learning models, including random forest and Lasso models. Our results demonstrate that vital sign data collected from wearables give a more consistent and precise depiction of resting heart rate than do measurements taken in the clinic. Vital sign data collected from wearables can also predict several clinical laboratory measurements with lower prediction error than predictions made using clinically obtained vital sign measurements. The length of time over which vital signs are monitored and the proximity of the monitoring period to the date of prediction play a critical role in the performance of the machine learning models. These results demonstrate the value of commercial wearable devices for continuous and longitudinal assessment of physiological measurements that today can be measured only with clinical laboratory tests.
View details for DOI 10.1038/s41591-021-01339-0
View details for PubMedID 34031607
-
Relating whole-brain functional connectivity to self-reported negative emotion in a large sample of young adults using group regularized canonical correlation analysis.
NeuroImage
2021: 118137
Abstract
The goal of our study was to use functional connectivity to map brain function to self-reports of negative emotion. In a large dataset of healthy individuals derived from the Human Connectome Project (N=652), first we quantified functional connectivity during a negative face-matching task to isolate patterns induced by emotional stimuli. Then, we did the same in a complementary task-free resting state condition. To identify the relationship between functional connectivity in these two conditions and self-reports of negative emotion, we introduce group regularized canonical correlation analysis (GRCCA), a novel algorithm extending canonical correlations analysis to model the shared common properties of functional connectivity within established brain networks. To minimize overfitting, we optimized the regularization parameters of GRCCA using cross-validation and tested the significance of our results in a held-out portion of the data set using permutations. GRCCA consistently outperformed plain regularized canonical correlation analysis. The only canonical correlation that generalized to the held-out test set was based on resting state data (r=0.175, permutation test p=0.021). This canonical correlation loaded primarily on Anger-aggression. It showed high loadings in the cingulate, orbitofrontal, superior parietal, auditory and visual cortices, as well as in the insula. Subcortically, we observed high loadings in the globus pallidus. Regarding brain networks, it loaded primarily on the primary visual, orbito-affective and ventral multimodal networks. Here, we present the first neuroimaging application of GRCCA, a novel algorithm for regularized canonical correlation analyses that takes into account grouping of the variables during the regularization scheme. Using GRCCA, we demonstrate that functional connections involving the visual, orbito-affective and multimodal networks are promising targets for investigating functional correlates of subjective anger and aggression. Crucially, our approach and findings also highlight the need of cross-validation, regularization and testing on held out data for correlational neuroimaging studies to avoid inflated effects.
View details for DOI 10.1016/j.neuroimage.2021.118137
View details for PubMedID 33951512
-
Multi-Muscle Deep Learning Segmentation to Automate the Quantification of Muscle Fat Infiltration in Cervical Spine Conditions
CHURCHILL LIVINGSTONE. 2021: 600-601
View details for Web of Science ID 000661623200100
-
Assessment of heterogeneous treatment effect estimation accuracy via matching.
Statistics in medicine
2021
Abstract
We study the assessment of the accuracy of heterogeneous treatment effect (HTE) estimation, where the HTE is not directly observable so standard computation of prediction errors is not applicable. To tackle the difficulty, we propose an assessment approach by constructing pseudo-observations of the HTE based on matching. Our contributions are three-fold: first, we introduce a novel matching distance derived from proximity scores in random forests; second, we formulate the matching problem as an average minimum-cost flow problem and provide an efficient algorithm; third, we propose a match-then-split principle for the assessment with cross-validation. We demonstrate the efficacy of the assessment approach using simulations and a real dataset.
View details for DOI 10.1002/sim.9010
View details for PubMedID 33915600
-
Survival Analysis on Rare Events Using Group-Regularized Multi-Response Cox Regression.
Bioinformatics (Oxford, England)
2021
Abstract
MOTIVATION: The prediction performance of Cox proportional hazard model suffers when there are only few uncensored events in the training data.RESULTS: We propose a Sparse-Group regularized Cox regression method to improve the prediction performance of large-scale and high-dimensional survival data with few observed events. Our approach is applicable when there is one or more other survival responses that 1. has a large number of observed events; 2. share a common set of associated predictors with the rare event response. This scenario is common in the UK Biobank (Sudlow et al., 2015) dataset where records for a large number of common and less prevalent diseases of the same set of individuals are available. By analyzing these responses together, we hope to achieve higher prediction performance than when they are analyzed individually. To make this approach practical for large-scale data, we developed an accelerated proximal gradient optimization algorithm as well as a screening procedure inspired by Qian et al. (2020).AVAILABILITY: https://github.com/rivas-lab/multisnpnet-Cox.SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btab095
View details for PubMedID 33560296
-
Polygenic risk modeling with latent trait-related genetic components.
European journal of human genetics : EJHG
2021
Abstract
Polygenic risk models have led to significant advances in understanding complex diseases and their clinical presentation. While polygenic risk scores (PRS) can effectively predict outcomes, they do not generally account for disease subtypes or pathways which underlie within-trait diversity. Here, we introduce a latent factor model of genetic risk based on components from Decomposition of Genetic Associations (DeGAs), which we call the DeGAs polygenic risk score (dPRS). We compute DeGAs using genetic associations for 977 traits and find that dPRS performs comparably to standard PRS while offering greater interpretability. We show how to decompose an individual's genetic risk for a trait across DeGAs components, with examples for body mass index (BMI) and myocardial infarction (heart attack) in 337,151 white British individuals in the UK Biobank, with replication in a further set of 25,486 non-British white individuals. We find that BMI polygenic risk factorizes into components related to fat-free mass, fat mass, and overall health indicators like physical activity. Most individuals with high dPRS for BMI have strong contributions from both a fat-mass component and a fat-free mass component, whereas a few "outlier" individuals have strong contributions from only one of the two components. Overall, our method enables fine-scale interpretation of the drivers of genetic risk for complex traits.
View details for DOI 10.1038/s41431-021-00813-0
View details for PubMedID 33558700
-
Genetics of 35 blood and urine biomarkers in the UK Biobank.
Nature genetics
2021
Abstract
Clinical laboratory tests are a critical component of the continuum of care. We evaluate the genetic basis of 35 blood and urine laboratory measurements in the UK Biobank (n=363,228 individuals). We identify 1,857 loci associated with at least one trait, containing 3,374 fine-mapped associations and additional sets of large-effect (>0.1s.d.) protein-altering, human leukocyte antigen (HLA) and copy number variant (CNV) associations. Through Mendelian randomization (MR) analysis, we discover 51 causal relationships, including previously known agonistic effects of urate on gout and cystatin C on stroke. Finally, we develop polygenic risk scores (PRSs) for each biomarker and build 'multi-PRS' models for diseases using 35 PRSs simultaneously, which improved chronic kidney disease, type 2 diabetes, gout and alcoholic cirrhosis genetic risk stratification in an independent dataset (FinnGen; n=135,500) relative to single-disease PRSs. Together, our results delineate the genetic basis of biomarkers and their causal influences on diseases and improve genetic risk stratification for common diseases.
View details for DOI 10.1038/s41588-020-00757-z
View details for PubMedID 33462484
-
Multi-muscle deep learning segmentation to automate the quantification of muscle fat infiltration in cervical spine conditions.
Scientific reports
2021; 11 (1): 16567
Abstract
Muscle fat infiltration (MFI) has been widely reported across cervical spine disorders. The quantification of MFI requires time-consuming and rater-dependent manual segmentation techniques. A convolutional neural network (CNN) model was trained to segment seven cervical spine muscle groups (left and right muscles segmented separately, 14 muscles total) from Dixon MRI scans (n = 17, 17 scans < 2 weeks post motor vehicle collision (MVC), and 17 scans 12 months post MVC). The CNN MFI measures demonstrated high test reliability and accuracy in an independent testing dataset (n = 18, 9 scans < 2 weeks post MVC, and 9 scans 12 months post MVC). Using the CNN in 84 participants with scans < 2 weeks post MVC (61 females, 23 males, age = 34.2 ± 10.7 years) differences in MFI between the muscle groups and relationships between MFI and sex, age, and body mass index (BMI) were explored. Averaging across all muscles, females had significantly higher MFI than males (p = 0.026). The deep cervical muscles demonstrated significantly greater MFI than the more superficial muscles (p < 0.001), and only MFI within the deep cervical muscles was moderately correlated to age (r > 0.300, p ≤ 0.001). CNN's allow for the accurate and rapid, quantitative assessment of the composition of the architecturally complex muscles traversing the cervical spine. Acknowledging the wider reports of MFI in cervical spine disorders and the time required to manually segment the individual muscles, this CNN may have diagnostic, prognostic, and predictive value in disorders of the cervical spine.
View details for DOI 10.1038/s41598-021-95972-x
View details for PubMedID 34400672
-
Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron
INTERNATIONAL STATISTICAL REVIEW
2020; 88: S73–S74
View details for DOI 10.1111/insr.12414
View details for Web of Science ID 000603161400008
-
Principal curve approaches for inferring 3D chromatin architecture.
Biostatistics (Oxford, England)
2020
Abstract
Three-dimensional (3D) genome spatial organization is critical for numerous cellular processes, including transcription, while certain conformation-driven structural alterations are frequently oncogenic. Genome architecture had been notoriously difficult to elucidate, but the advent of the suite of chromatin conformation capture assays, notably Hi-C, has transformed understanding of chromatin structure and provided downstream biological insights. Although many findings have flowed from direct analysis of the pairwise proximity data produced by these assays, there is added value in generating corresponding 3D reconstructions deriving from superposing genomic features on the reconstruction. Accordingly, many methods for inferring 3D architecture from proximity data have been advanced. However, none of these approaches exploit the fact that single chromosome solutions constitute a one-dimensional (1D) curve in 3D. Rather, this aspect has either been addressed by imposition of constraints, which is both computationally burdensome and cell type specific, or ignored with contiguity imposed after the fact. Here, we target finding a 1D curve by extending principal curve methodology to the metric scaling problem. We illustrate how this approach yields a sequence of candidate solutions, indexed by an underlying smoothness or degrees-of-freedom parameter, and propose methods for selection from this sequence. We apply the methodology to Hi-C data obtained on IMR90 cells and so are positioned to evaluate reconstruction accuracy by referencing orthogonal imaging data. The results indicate the utility and reproducibility of our principal curve approach in the face of underlying structural variation.
View details for DOI 10.1093/biostatistics/kxaa046
View details for PubMedID 33221831
-
Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons
STATISTICAL SCIENCE
2020; 35 (4): 579–92
View details for DOI 10.1214/19-STS733
View details for Web of Science ID 000591728200002
-
Rejoinder: Best Subset, Forward Stepwise or Lasso? Analysis and Recommendations Based on Extensive Comparisons
STATISTICAL SCIENCE
2020; 35 (4): 625–26
View details for DOI 10.1214/20-STS733REJ
View details for Web of Science ID 000591728200007
-
Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions
NATURE MACHINE INTELLIGENCE
2020
View details for DOI 10.1038/s42256-020-00232-8
View details for Web of Science ID 000579336000001
-
Brain Strength: Multi-Modal Brain MRI Predicts Grip Strength
WILEY. 2020: S223–S224
View details for Web of Science ID 000572509100411
-
Integration of mechanistic immunological knowledge into a machine learning pipeline improves predictions.
Nature machine intelligence
2020; 2 (10): 619-628
Abstract
The dense network of interconnected cellular signalling responses that are quantifiable in peripheral immune cells provides a wealth of actionable immunological insights. Although high-throughput single-cell profiling techniques, including polychromatic flow and mass cytometry, have matured to a point that enables detailed immune profiling of patients in numerous clinical settings, the limited cohort size and high dimensionality of data increase the possibility of false-positive discoveries and model overfitting. We introduce a generalizable machine learning platform, the immunological Elastic-Net (iEN), which incorporates immunological knowledge directly into the predictive models. Importantly, the algorithm maintains the exploratory nature of the high-dimensional dataset, allowing for the inclusion of immune features with strong predictive capabilities even if not consistent with prior knowledge. In three independent studies our method demonstrates improved predictions for clinically relevant outcomes from mass cytometry data generated from whole blood, as well as a large simulated dataset. The iEN is available under an open-source licence.
View details for DOI 10.1038/s42256-020-00232-8
View details for PubMedID 33294774
View details for PubMedCentralID PMC7720904
-
Ridge Regularization: An Essential Concept in Data Science
TECHNOMETRICS
2020
View details for DOI 10.1080/00401706.2020.1791959
View details for Web of Science ID 000557953000001
-
Projected geographic disparities in healthcare worker absenteeism from COVID-19 school closures and the economic feasibility of child care subsidies: a simulation study.
BMC medicine
2020; 18 (1): 218
Abstract
BACKGROUND: School closures have been enacted as a measure of mitigation during the ongoing coronavirus disease 2019 (COVID-19) pandemic. It has been shown that school closures could cause absenteeism among healthcare workers with dependent children, but there remains a need for spatially granular analyses of the relationship between school closures and healthcare worker absenteeism to inform local community preparedness.METHODS: We provide national- and county-level simulations of school closures and unmet child care needs across the USA. We develop individual simulations using county-level demographic and occupational data, and model school closure effectiveness with age-structured compartmental models. We perform multivariate quasi-Poisson ecological regressions to find associations between unmet child care needs and COVID-19 vulnerability factors.RESULTS: At the national level, we estimate the projected rate of unmet child care needs for healthcare worker households to range from 7.4 to 8.7%, and the effectiveness of school closures as a 7.6% and 8.4% reduction in fewer hospital and intensive care unit (ICU) beds, respectively, at peak demand when varying across initial reproduction number estimates by state. At the county level, we find substantial variations of projected unmet child care needs and school closure effects, 9.5% (interquartile range (IQR) 8.2-10.9%) of healthcare worker households and 5.2% (IQR 4.1-6.5%) and 6.8% (IQR 4.8-8.8%) reduction in fewer hospital and ICU beds, respectively, at peak demand. We find significant positive associations between estimated levels of unmet child care needs and diabetes prevalence, county rurality, and race (p<0.05). We estimate costs of absenteeism and child care and observe from our models that an estimated 76.3 to 96.8% of counties would find it less expensive to provide child care to all healthcare workers with children than to bear the costs of healthcare worker absenteeism during school closures.CONCLUSIONS: School closures are projected to reduce peak ICU and hospital demand, but could disrupt healthcare systems through absenteeism, especially in counties that are already particularly vulnerable to COVID-19. Child care subsidies could help circumvent the ostensible trade-off between school closures and healthcare worker absenteeism.
View details for DOI 10.1186/s12916-020-01692-w
View details for PubMedID 32664927
-
Molecular Transducers of Physical Activity Consortium (MoTrPAC): Mapping the Dynamic Responses to Exercise.
Cell
2020; 181 (7): 1464–74
Abstract
Exercise provides a robust physiological stimulus that evokes cross-talk among multiple tissues that when repeated regularly (i.e., training) improves physiological capacity, benefits numerous organ systems, and decreases the risk for premature mortality. However, a gap remains in identifying the detailed molecular signals induced by exercise that benefits health and prevents disease. The Molecular Transducers of Physical Activity Consortium (MoTrPAC) was established to address this gap and generate a molecular map of exercise. Preclinical and clinical studies will examine the systemic effects of endurance and resistance exercise across a range of ages and fitness levels by molecular probing of multiple tissues before and after acute and chronic exercise. From this multi-omic and bioinformatic analysis, a molecular map of exercise will be established. Altogether, MoTrPAC will provide a public database that is expected to enhance our understanding of the health benefits of exercise and to provide insight into how physical activity mitigates disease.
View details for DOI 10.1016/j.cell.2020.06.004
View details for PubMedID 32589957
-
Discussion of "Prediction, Estimation, and Attribution" by Bradley Efron
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2020; 115 (530): 665–66
View details for DOI 10.1080/01621459.2020.1762617
View details for Web of Science ID 000538423300016
-
The human connectome project for disordered emotional states: Protocol and rationale for a research domain criteria study of brain connectivity in young adult anxiety and depression.
NeuroImage
2020: 116715
Abstract
Through the Human Connectome Project (HCP) our understanding of the functional connectome of the healthy brain has been dramatically accelerated. Given the pressing public health need, we must increase our understanding of how connectome dysfunctions give rise to disordered mental states. Mental disorders arising from high levels of negative emotion or from the loss of positive emotional experience affect over 400 million people globally. Such states of disordered emotion cut across multiple diagnostic categories of mood and anxiety disorders and are compounded by accompanying disruptions in cognitive function. Not surprisingly, these forms of psychopathology are the leading cause of disability worldwide. The Research Domain Criteria (RDoC) initiative spearheaded by NIMH offers a framework for characterizing the relations among connectome dysfunctions, anchored in neural circuits and phenotypic profiles of behavior and self-reported symptoms. Here, we report on our Connectomes Related to Human Disease protocol for integrating an RDoC framework with HCP protocols to characterize connectome dysfunctions in disordered emotional states, and present quality control data from a representative sample of participants. We focus on three RDoC domains and constructs most relevant to depression and anxiety: 1) loss and acute threat within the Negative Valence System (NVS) domain; 2) reward valuation and responsiveness within the Positive Valence System (PVS) domain; and 3) working memory and cognitive control within the Cognitive System (CS) domain. For 29 healthy controls, we present preliminary imaging data: functional magnetic resonance imaging collected in the resting state and in tasks matching our constructs of interest ("Emotion", "Gambling" and "Continuous Performance" tasks), as well as diffusion-weighted imaging. All functional scans demonstrated good signal-to-noise ratio. Established neural networks were robustly identified in the resting state condition by independent component analysis. Processing of negative emotional faces significantly activated the bilateral dorsolateral prefrontal and occipital cortices, fusiform gyrus and amygdalae. Reward elicited a response in the bilateral dorsolateral prefrontal, parietal and occipital cortices, and in the striatum. Working memory was associated with activation in the dorsolateral prefrontal, parietal, motor, temporal and insular cortices, in the striatum and cerebellum. Diffusion tractography showed consistent profiles of fractional anisotropy along known white matter tracts. We also show that results are comparable to those in a matched sample from the HCP Healthy Young Adult data release. These preliminary data provide the foundation for acquisition of 250 subjects who are experiencing disordered emotional states. When complete, these data will be used to develop a neurobiological model that maps connectome dysfunctions to specific behaviors and symptoms.
View details for DOI 10.1016/j.neuroimage.2020.116715
View details for PubMedID 32147367
-
Decreasing human body temperature in the United States since the industrial revolution.
eLife
2020; 9
Abstract
In the US, the normal, oral temperature of adults is, on average, lower than the canonical 37°C established in the 19th century. We postulated that body temperature has decreased over time. Using measurements from three cohorts--the Union Army Veterans of the Civil War (N = 23,710; measurement years 1860-1940), the National Health and Nutrition Examination Survey I (N = 15,301; 1971-1975), and the Stanford Translational Research Integrated Database Environment (N = 150,280; 2007-2017)--we determined that mean body temperature in men and women, after adjusting for age, height, weight and, in some models date and time of day, has decreased monotonically by 0.03°C per birth decade. A similar decline within the Union Army cohort as between cohorts, makes measurement error an unlikely explanation. This substantive and continuing shift in body temperature-a marker for metabolic rate-provides a framework for understanding changes in human health and longevity over 157 years.
View details for DOI 10.7554/eLife.49555
View details for PubMedID 31908267
-
Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank.
Biostatistics (Oxford, England)
2020
Abstract
We develop a scalable and highly efficient algorithm to fit a Cox proportional hazard model by maximizing the $L^1$-regularized (Lasso) partial likelihood function, based on the Batch Screening Iterative Lasso (BASIL) method developed in Qian and others (2019). Our algorithm is particularly suitable for large-scale and high-dimensional data that do not fit in the memory. The output of our algorithm is the full Lasso path, the parameter estimates at all predefined regularization parameters, as well as their validation accuracy measured using the concordance index (C-index) or the validation deviance. To demonstrate the effectiveness of our algorithm, we analyze a large genotype-survival time dataset across 306 disease outcomes from the UK Biobank (Sudlow and others, 2015). We provide a publicly available implementation of the proposed approach for genetics data on top of the PLINK2 package and name it snpnet-Cox.
View details for DOI 10.1093/biostatistics/kxaa038
View details for PubMedID 32989444
-
Projected geographic disparities in healthcare worker absenteeism from COVID-19 school closures and the economic feasibility of child care subsidies: a simulation study.
medRxiv : the preprint server for health sciences
2020
Abstract
School closures have been enacted as a measure of mitigation during the ongoing COVID-19 pandemic. It has been shown that school closures could cause absenteeism amongst healthcare workers with dependent children, but there remains a need for spatially granular analyses of the relationship between school closures and healthcare worker absenteeism to inform local community preparedness.We provide national- and county-level simulations of school closures and unmet child care needs across the United States. We develop individual simulations using county-level demographic and occupational data, and model school closure effectiveness with age-structured compartmental models. We perform multivariate quasi-Poisson ecological regressions to find associations between unmet child care needs and COVID-19 vulnerability factors.At the national level, we estimate the projected rate of unmet child care needs for healthcare worker households to range from 7.5% to 8.6%, and the effectiveness of school closures to range from 3.2% (R0 = 4) to 7.2% (R0 = 2) reduction in fewer ICU beds at peak demand. At the county-level, we find substantial variations of projected unmet child care needs and school closure effects, ranging from 1.9% to 18.3% of healthcare worker households and 5.7% to 8.8% reduction in fewer ICU beds at peak demand (R0 = 2). We find significant positive associations between estimated levels of unmet child care needs and diabetes prevalence, county rurality, and race (p < 0.05). We estimate costs of absenteeism and child care and observe from our models that an estimated 71.1% to 98.8% of counties would find it less expensive to provide child care to all healthcare workers with children than to bear the costs of healthcare worker absenteeism during school closures.School closures are projected to reduce peak ICU bed demand, but could disrupt healthcare systems through absenteeism, especially in counties that are already particularly vulnerable to COVID-19. Child care subsidies could help circumvent the ostensible tradeoff between school closures and healthcare worker absenteeism.
View details for DOI 10.1101/2020.03.19.20039404
View details for PubMedID 32511455
View details for PubMedCentralID PMC7239083
-
Perioperative analgesic administration during the 2018 parenteral opioid shortage in the United States - A retrospective analysis.
Journal of clinical anesthesia
2020; 66: 109892
View details for DOI 10.1016/j.jclinane.2020.109892
View details for PubMedID 32502773
-
Detection of Circulating Tumor DNA in Patients With Uterine Leiomyomas
JCO PRECISION ONCOLOGY
2019; 3
View details for DOI 10.1200/PO.18.00409
View details for Web of Science ID 000491150900001
-
Detection of Circulating Tumor DNA in Patients With Uterine Leiomyomas.
JCO precision oncology
2019; 3
Abstract
The preoperative distinction between uterine leiomyoma (LM) and leiomyosarcoma (LMS) is difficult, which may result in dissemination of an unexpected malignancy during surgery for a presumed benign lesion. An assay based on circulating tumor DNA (ctDNA) could help in the preoperative distinction between LM and LMS. This study addresses the feasibility of applying the two most frequently used approaches for detection of ctDNA: profiling of copy number alterations (CNAs) and point mutations in the plasma of patients with LM.By shallow whole-genome sequencing, we prospectively examined whether LM-derived ctDNA could be detected in plasma specimens of 12 patients. Plasma levels of lactate dehydrogenase, a marker suggested for the distinction between LM and LMS by prior studies, were also determined. We also profiled 36 LM tumor specimens by exome sequencing to develop a panel for targeted detection of point mutations in ctDNA of patients with LM.We identified tumor-derived CNAs in the plasma DNA of 50% (six of 12) of patients with LM. The lactate dehydrogenase levels did not allow for an accurate distinction between patients with LM and patients with LMS. We identified only two recurrently mutated genes in LM tumors (MED12 and ACLY).Our results show that LMs do shed DNA into the circulation, which provides an opportunity for the development of ctDNA-based testing to distinguish LM from LMS. Although we could not design an LM-specific panel for ctDNA profiling, we propose that the detection of CNAs or point mutations in selected tumor suppressor genes in ctDNA may favor a diagnosis of LMS, since these genes are not affected in LM.
View details for DOI 10.1200/po.18.00409
View details for PubMedID 32232185
View details for PubMedCentralID PMC7105159
-
Resting State Functional Connectivity Machine Learning Classification of Chronic Back Pain
WILEY. 2019: S266
View details for Web of Science ID 000488891800418
-
CAUSAL INTERPRETATIONS OF BLACK-BOX MODELS.
Journal of business & economic statistics : a publication of the American Statistical Association
2019; 2019
Abstract
The fields of machine learning and causal inference have developed many concepts, tools, and theory that are potentially useful for each other. Through exploring the possibility of extracting causal interpretations from black-box machine-trained models, we briefly review the languages and concepts in causal inference that may be interesting to machine learning researchers. We start with the curious observation that Friedman's partial dependence plot has exactly the same formula as Pearl's back-door adjustment and discuss three requirements to make causal interpretations: a model with good predictive performance, some domain knowledge in the form of a causal diagram and suitable visualization tools. We provide several illustrative examples and find some interesting and potentially causal relations using visualization tools for black-box models.
View details for DOI 10.1080/07350015.2019.1624293
View details for PubMedID 33132490
View details for PubMedCentralID PMC7597863
-
Causal Interpretations of Black-Box Models
JOURNAL OF BUSINESS & ECONOMIC STATISTICS
2019
View details for DOI 10.1080/07350015.2019.1624293
View details for Web of Science ID 000474238800001
-
Automated Survival Prediction in Metastatic Cancer Patients Using High-Dimensional Electronic Medical Record Data
JNCI-JOURNAL OF THE NATIONAL CANCER INSTITUTE
2019; 111 (6): 568–74
View details for DOI 10.1093/jnci/djy178
View details for Web of Science ID 000474267400007
-
Deep Learning Convolutional Neural Networks for the Automatic Quantification of Muscle Fat Infiltration Following Whiplash Injury.
Scientific reports
2019; 9 (1): 7973
Abstract
Muscle fat infiltration (MFI) of the deep cervical spine extensors has been observed in cervical spine conditions using time-consuming and rater-dependent manual techniques. Deep learning convolutional neural network (CNN) models have demonstrated state-of-the-art performance in segmentation tasks. Here, we train and test a CNN for muscle segmentation and automatic MFI calculation using high-resolution fat-water images from 39 participants (26 female, average = 31.7 ± 9.3 years) 3 months post whiplash injury. First, we demonstrate high test reliability and accuracy of the CNN compared to manual segmentation. Then we explore the relationships between CNN muscle volume, CNN MFI, and clinical measures of pain and neck-related disability. Across all participants, we demonstrate that CNN muscle volume was negatively correlated to pain (R = -0.415, p = 0.006) and disability (R = -0.286, p = 0.045), while CNN MFI tended to be positively correlated to disability (R = 0.214, p = 0.105). Additionally, CNN MFI was higher in participants with persisting pain and disability (p = 0.049). Overall, CNN's may improve the efficiency and objectivity of muscle measures allowing for the quantitative monitoring of muscle properties in disorders of and beyond the cervical spine.
View details for DOI 10.1038/s41598-019-44416-8
View details for PubMedID 31138878
-
Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology.
Nature communications
2019; 10 (1): 4064
Abstract
Population-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we apply truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study. We systematically identify key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 (GPR151) and rs150090666 (PDE3B), that substantially contribute to obesity-related traits and experimentally demonstrate the role of these genes in adipocyte biology. Our approach to dissect components of genetic associations across the human phenome will accelerate biomedical hypothesis generation by providing insights on previously unexplored latent structures.
View details for DOI 10.1038/s41467-019-11953-9
View details for PubMedID 31492854
-
A clinico-genomic analysis of soft tissue sarcoma patients reveals CDKN2A deletion as a biomarker for poor prognosis.
Clinical sarcoma research
2019; 9: 12
Abstract
Background: Sarcomas are a rare, heterogeneous group of tumors with variable tendencies for aggressive behavior. Molecular markers for prognosis are needed to risk stratify patients and identify those who might benefit from more intensive therapeutic strategies.Patients and methods: We analyzed somatic tumor genomic profiles and clinical outcomes of 152 soft tissue (STS) and bone sarcoma (BS) patients sequenced at Stanford Cancer Institute as well as 206 STS patients from The Cancer Genome Atlas. Genomic profiles of 7733 STS from the Foundation Medicine database were used to assess the frequency of CDKN2A alterations in histological subtypes of sarcoma.Results: Compared to all other tumor types, sarcomas were found to carry the highest relative percentage of gene amplifications/deletions/fusions and the lowest average mutation count. The most commonly altered genes in STS were TP53 (47%), CDKN2A (22%), RB1 (22%), NF1 (11%), and ATRX (11%). When all genomic alterations were tested for prognostic significance in the specific Stanford cohort of localized STS, only CDKN2A alterations correlated significantly with prognosis, with a hazard ratio (HR) of 2.83 for overall survival (p=0.017). These findings were validated in the TCGA dataset where CDKN2A altered patients had significantly worse overall survival with a HR of 2.7 (p=0.002). Analysis of 7733 STS patients from Foundation One showed high prevalence of CDKN2A alterations in malignant peripheral nerve sheath tumors, myxofibrosarcomas, and undifferentiated pleomorphic sarcomas.Conclusion: Our clinico-genomic profiling of STS shows that CDKN2A deletion was the most prevalent DNA copy number aberration and was associated with poor prognosis.
View details for DOI 10.1186/s13569-019-0122-5
View details for PubMedID 31528332
-
Aortic growth and development of partial false lumen thrombosis are associated with late adverse events in type B aortic dissection.
The Journal of thoracic and cardiovascular surgery
2019
Abstract
Patients with medically treated type B aortic dissection (TBAD) remain at significant risk for late adverse events (LAEs). We hypothesize that not only initial morphological features, but also their change over time at follow-up are associated with LAEs.Baseline and 188 follow-up computed tomography (CT) scans with a median follow-up time of 4 years (range, 10 days to 12.7 years) of 47 patients with acute uncomplicated TBAD were retrospectively reviewed. Morphological features (n = 8) were quantified at baseline and each follow-up. Medical records were reviewed for LAEs, which were defined according to current guidelines. To assess the effects of changes of morphological features over time, the linear mixed effects models were combined with Cox proportional hazards regression for the time-to-event outcome using a joint modeling approach.LAEs occurred in 21 of 47 patients at a median of 6.6 years (95% confidence interval [CI], 5.1-11.2 years). Among the 8 investigated morphological features, the following 3 features showed strong association with LAEs: increase in partial false lumen thrombosis area (hazard ratio [HR], 1.39; 95% CI, 1.18-1.66 per cm2 increase; P < .001), increase of major aortic diameter (HR, 1.24; 95% CI, 1.13-1.37 per mm increase; P < .001), and increase in the circumferential extent of false lumen (HR, 1.05; 95% CI, 1.01-1.10 per degree increase; P < .001).In medically treated TBAD, increases in aortic diameter, new or increased partial false lumen thrombosis area, and increases of circumferential extent of the false lumen are strongly associated with LAEs.
View details for DOI 10.1016/j.jtcvs.2019.10.074
View details for PubMedID 31839226
-
Association of cardiovascular events and lipoprotein particle size: Development of a risk score based on functional data analysis.
PloS one
2019; 14 (3): e0213172
Abstract
BACKGROUND: Functional data is data represented by functions (curves or surfaces of a low-dimensional index). Functional data often arise when measurements are collected over time or across locations. In the field of medicine, plasma lipoprotein particles can be quantified according to particle diameter by ion mobility.GOAL: We wanted to evaluate the utility of functional analysis for assessing the association of plasma lipoprotein size distribution with cardiovascular disease after adjustment for established risk factors including standard lipids.METHODS: We developed a model to predict risk of cardiovascular disease among participants in a case-cohort study of the Malmo Prevention Project. We used a linear model with 311 coefficients, corresponding to measures of lipoprotein mass at each of 311 diameters, and assumed these coefficients varied smoothly along the diameter index. The smooth function was represented as an expansion of natural cubic splines where the smoothness parameter was chosen by assessment of a series of nested splines. Cox proportional hazards models of time to a first cardiovascular disease event were used to estimate the smooth coefficient function among a training set consisting of one half of the participants. The resulting model was used to calculate a functional risk score for the remaining half of the participants (test set) and its association with events was assessed in Cox models that adjusted for traditional cardiovascular risk factors.RESULTS: In the test set, participants with a functional risk score in the highest quartile were found to be at increased risk of cardiovascular events compared with the lowest quartile (Hazard ratio = 1.34; 95% Confidence Interval: 1.05 to 1.70) after adjustment for established risk factors.CONCLUSION: In an independent test set of Malmo Prevention Project participants, the functional risk score was found to be associated with cardiovascular events after adjustment for traditional risk factors including standard lipids.
View details for PubMedID 30845215
-
Nuclear penalized multinomial regression with an application to predicting at bat outcomes in baseball
STATISTICAL MODELLING
2018; 18 (5-6): 388–410
View details for DOI 10.1177/1471082X18777669
View details for Web of Science ID 000452266900002
-
Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities
JOURNAL OF BIOMECHANICS
2018; 81: 1–11
View details for DOI 10.1016/j.jbiomech.2018.09.009
View details for Web of Science ID 000451104300001
-
Automated Survival Prediction in Metastatic Cancer Patients Using High-Dimensional Electronic Medical Record Data.
Journal of the National Cancer Institute
2018
Abstract
Background: Oncologists use patients' life expectancy to guide decisions and may benefit from a tool that accurately predicts prognosis. Existing prognostic models generally use only a few predictor variables. We used an electronic medical record dataset to train a prognostic model for patients with metastatic cancer.Methods: The model was trained and tested using 12588 patients treated for metastatic cancer in the Stanford Health Care system from 2008 to 2017. Data sources included provider note text, labs, vital signs, procedures, medication orders, and diagnosis codes. Patients were divided randomly into a training set used to fit the model coefficients and a test set used to evaluate model performance (80%/20% split). A regularized Cox model with 4126 predictor variables was used. A landmarking approach was used due to the multiple observations per patient, with t0 set to the time of metastatic cancer diagnosis. Performance was also evaluated using 399 palliative radiation courses in test set patients.Results: The C-index for overall survival was 0.786 in the test set (averaged across landmark times). For palliative radiation courses, the C-index was 0.745 (95% confidence interval [CI] = 0.715 to 0.775) compared with 0.635 (95% CI = 0.601 to 0.669) for a published model using performance status, primary tumor site, and treated site (two-sided P<.001). Our model's predictions were well-calibrated.Conclusions: The model showed high predictive performance, which will need to be validated using external data. Because it is fully automated, the model can be used to examine providers' practice patterns and could be deployed in a decision support tool to help improve quality of care.
View details for PubMedID 30346554
-
Automated survival prediction in metastatic cancer patients using high-dimensional electronic medical record data
OXFORD UNIV PRESS. 2018: 548
View details for Web of Science ID 000459277303282
-
Machine learning in human movement biomechanics: Best practices, common pitfalls, and new opportunities.
Journal of biomechanics
2018
Abstract
Traditional laboratory experiments, rehabilitation clinics, and wearable sensors offer biomechanists a wealth of data on healthy and pathological movement. To harness the power of these data and make research more efficient, modern machine learning techniques are starting to complement traditional statistical tools. This survey summarizes the current usage of machine learning methods in human movement biomechanics and highlights best practices that will enable critical evaluation of the literature. We carried out a PubMed/Medline database search for original research articles that used machine learning to study movement biomechanics in patients with musculoskeletal and neuromuscular diseases. Most studies that met our inclusion criteria focused on classifying pathological movement, predicting risk of developing a disease, estimating the effect of an intervention, or automatically recognizing activities to facilitate out-of-clinic patient monitoring. We found that research studies build and evaluate models inconsistently, which motivated our discussion of best practices. We provide recommendations for training and evaluating machine learning models and discuss the potential of several underutilized approaches, such as deep learning, to generate new knowledge about human movement. We believe that cross-training biomechanists in data science and a cultural shift toward sharing of data and tools are essential to maximize the impact of biomechanics research.
View details for PubMedID 30279002
-
Modeling and Predicting Osteoarthritis Progression: Data from the Osteoarthritis Initiative.
Osteoarthritis and cartilage
2018
Abstract
OBJECTIVE: The goal of this study was to model the longitudinal progression of knee osteoarthritis (OA) and build a prognostic tool that uses data collected in one year to predict disease progression over eight years.DESIGN: To model OA progression, we used a mixed-effects mixture model and eight-year data from the Osteoarthritis Initiative-specifically, joint space width measurements from X-rays and pain scores from the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) questionnaire. We included 1243 subjects who at enrollment were classified as being at high risk of developing OA based on age, body mass index, and medical and occupational histories. After clustering subjects based on radiographic and pain progression, we used clinical variables collected within the first year to build LASSO regression models for predicting the probabilities of belonging to each cluster. Areas under the receiver operating characteristic curve (AUC) represent predictive performance on held-out data.RESULTS: Based on joint space narrowing, subjects clustered as progressing or non-progressing. Based on pain scores, they clustered as stable, improving, or worsening. Radiographic progression could be predicted with high accuracy (AUC = .86) using data from two visits spanning one year, whereas pain progression could be predicted with high accuracy (AUC = .95) using data from a single visit. Joint space narrowing and pain progression were not associated.CONCLUSION: Statistical models for characterizing and predicting OA progression promise to improve clinical trial design and OA prevention efforts in the future.
View details for PubMedID 30130590
-
Proteomic analysis of monolayer-integrated proteins on lipid droplets identifies amphipathic interfacial alpha-helical membrane anchors.
Proceedings of the National Academy of Sciences of the United States of America
2018
Abstract
Despite not spanning phospholipid bilayers, monotopic integral proteins (MIPs) play critical roles in organizing biochemical reactions on membrane surfaces. Defining the structural basis by which these proteins are anchored to membranes has been hampered by the paucity of unambiguously identified MIPs and a lack of computational tools that accurately distinguish monolayer-integrating motifs from bilayer-spanning transmembrane domains (TMDs). We used quantitative proteomics and statistical modeling to identify 87 high-confidence candidate MIPs in lipid droplets, including 21 proteins with predicted TMDs that cannot be accommodated in these monolayer-enveloped organelles. Systematic cysteine-scanning mutagenesis showed the predicted TMD of one candidate MIP, DHRS3, to be a partially buried amphipathic alpha-helix in both lipid droplet monolayers and the cytoplasmic leaflet of endoplasmic reticulum membrane bilayers. Coarse-grained molecular dynamics simulations support these observations, suggesting that this helix is most stable at the solvent-membrane interface. The simulations also predicted similar interfacial amphipathic helices when applied to seven additional MIPs from our dataset. Our findings suggest that interfacial helices may be a common motif by which MIPs are integrated into membranes, and provide high-throughput methods to identify and study MIPs.
View details for PubMedID 30104359
-
Physical activity is associated with changes in knee cartilage microstructure
OSTEOARTHRITIS AND CARTILAGE
2018; 26 (6): 770–74
Abstract
The purpose of this study was to determine if there is an association between objectively measured physical activity and longitudinal changes in knee cartilage microstructure.We used accelerometry and T2-weighted magnetic resonance imaging (MRI) data from the Osteoarthritis Initiative, restricting the analysis to men aged 45-60 years, with a body mass index (BMI) of 25-27 kg/m2 and no radiographic evidence of knee osteoarthritis. After computing 4-year changes in mean T2 relaxation time for six femoral cartilage regions and mean daily times spent in the sedentary, light, moderate, and vigorous activity ranges, we performed canonical correlation analysis (CCA) to find a linear combination of times spent in different activity intensity ranges (Activity Index) that was maximally correlated with a linear combination of regional changes in cartilage microstructure (Cartilage Microstructure Index). We used leave-one-out pre-validation to test the robustness of the model on new data.Nineteen subjects satisfied the inclusion criteria. CCA identified an Activity Index and a Cartilage Microstructure Index that were significantly correlated (r = .82, P < .0001 on test data). Higher levels of sedentary time and vigorous activity were associated with greater medial-lateral differences in longitudinal T2 changes, whereas light activity was associated with smaller differences.Physical activity is better associated with an index that contrasts microstructural changes in different cartilage regions than it is with univariate or cumulative changes, likely because this index separates the effect of activity, which is greater in the medial loadbearing region, from that of patient-specific natural aging.
View details for PubMedID 29605382
-
Some methods for heterogeneous treatment effect estimation in high dimensions
STATISTICS IN MEDICINE
2018; 37 (11): 1767–87
Abstract
When devising a course of treatment for a patient, doctors often have little quantitative evidence on which to base their decisions, beyond their medical education and published clinical trials. Stanford Health Care alone has millions of electronic medical records that are only just recently being leveraged to inform better treatment recommendations. These data present a unique challenge because they are high dimensional and observational. Our goal is to make personalized treatment recommendations based on the outcomes for past patients similar to a new patient. We propose and analyze 3 methods for estimating heterogeneous treatment effects using observational data. Our methods perform well in simulations using a wide variety of treatment effect functions, and we present results of applying the 2 most promising methods to data from The SPRINT Data Analysis Challenge, from a large randomized trial of a treatment for high blood pressure.
View details for PubMedID 29508417
View details for PubMedCentralID PMC5938172
-
Gene expression profiling of low-grade endometrial stromal sarcoma indicates fusion protein-mediated activation of the Wnt signaling pathway
GYNECOLOGIC ONCOLOGY
2018; 149 (2): 388–93
Abstract
Low-grade endometrial stromal sarcomas (LGESS) harbor chromosomal translocations that affect proteins associated with chromatin remodeling Polycomb Repressive Complex 2 (PRC2), including SUZ12, PHF1 and EPC1. Roughly half of LGESS also demonstrate nuclear accumulation of β-catenin, which is a hallmark of Wnt signaling activation. However, the targets affected by the fusion proteins and the role of Wnt signaling in the pathogenesis of these tumors remain largely unknown.Here we report the results of a meta-analysis of three independent gene expression profiling studies on LGESS and immunohistochemical evaluation of nuclear expression of β-catenin and Lef1 in 112 uterine sarcoma specimens obtained from 20 LGESS and 89 LMS patients.Our results demonstrate that 143 out of 310 genes overexpressed in LGESS are known to be directly regulated by SUZ12. In addition, our gene expression meta-analysis shows activation of multiple genes implicated in Wnt signaling. We further emphasize the role of the Wnt signaling pathway by demonstrating concordant nuclear expression of β-catenin and Lef1 in 7/16 LGESS.Based on our findings, we suggest that LGESS-specific fusion proteins disrupt the repressive function of the PRC2 complex similar to the mechanism seen in synovial sarcoma, where the SS18-SSX fusion proteins disrupt the mSWI/SNF (BAF) chromatin remodeling complex. We propose that these fusion proteins in LGESS contribute to overexpression of Wnt ligands with subsequent activation of Wnt signaling pathway and formation of an active β-catenin/Lef1 transcriptional complex. These observations could lead to novel therapeutic approaches that focus on the Wnt pathway in LGESS.
View details for PubMedID 29544705
-
Saturating Splines and Feature Selection
JOURNAL OF MACHINE LEARNING RESEARCH
2018; 18
View details for Web of Science ID 000435454900001
-
Saturating Splines and Feature Selection.
Journal of machine learning research : JMLR
2018; 18
Abstract
We extend the adaptive regression spline model by incorporating saturation, the natural requirement that a function extend as a constant outside a certain range. We fit saturating splines to data via a convex optimization problem over a space of measures, which we solve using an efficient algorithm based on the conditional gradient method. Unlike many existing approaches, our algorithm solves the original infinite-dimensional (for splines of degree at least two) optimization problem without pre-specified knot locations. We then adapt our algorithm to fit generalized additive models with saturating splines as coordinate functions and show that the saturation requirement allows our model to simultaneously perform feature selection and nonlinear function fitting. Finally, we briefly sketch how the method can be extended to higher order splines and to different requirements on the extension outside the data range.
View details for PubMedID 31007630
-
CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING
ANNALS OF STATISTICS
2017; 45 (5): 1863–94
View details for DOI 10.1214/16-AOS1511
View details for Web of Science ID 000416455300002
-
CONFOUNDER ADJUSTMENT IN MULTIPLE HYPOTHESIS TESTING.
Annals of statistics
2017; 45 (5): 1863-1894
Abstract
We consider large-scale studies in which thousands of significance tests are performed simultaneously. In some of these studies, the multiple testing procedure can be severely biased by latent confounding factors such as batch effects and unmeasured covariates that correlate with both primary variable(s) of interest (e.g., treatment variable, phenotype) and the outcome. Over the past decade, many statistical methods have been proposed to adjust for the confounders in hypothesis testing. We unify these methods in the same framework, generalize them to include multiple primary variables and multiple nuisance variables, and analyze their statistical properties. In particular, we provide theoretical guarantees for RUV-4 [Gagnon-Bartsch, Jacob and Speed (2013)] and LEAPP [Ann. Appl. Stat.6 (2012) 1664-1688], which correspond to two different identification conditions in the framework: the first requires a set of "negative controls" that are known a priori to follow the null distribution; the second requires the true nonnulls to be sparse. Two different estimators which are based on RUV-4 and LEAPP are then applied to these two scenarios. We show that if the confounding factors are strong, the resulting estimators can be asymptotically as powerful as the oracle estimator which observes the latent confounding factors. For hypothesis testing, we show the asymptotic z-tests based on the estimators can control the type I error. Numerical experiments show that the false discovery rate is also controlled by the Benjamini-Hochberg procedure when the sample size is reasonably large.
View details for DOI 10.1214/16-AOS1511
View details for PubMedID 31439967
View details for PubMedCentralID PMC6706069
-
Accuracy in Wrist-Worn, Sensor-Based Measurements of Heart Rate and Energy Expenditure in a Diverse Cohort.
Journal of personalized medicine
2017; 7 (2)
Abstract
The ability to measure physical activity through wrist-worn devices provides an opportunity for cardiovascular medicine. However, the accuracy of commercial devices is largely unknown. The aim of this work is to assess the accuracy of seven commercially available wrist-worn devices in estimating heart rate (HR) and energy expenditure (EE) and to propose a wearable sensor evaluation framework. We evaluated the Apple Watch, Basis Peak, Fitbit Surge, Microsoft Band, Mio Alpha 2, PulseOn, and Samsung Gear S2. Participants wore devices while being simultaneously assessed with continuous telemetry and indirect calorimetry while sitting, walking, running, and cycling. Sixty volunteers (29 male, 31 female, age 38 ± 11 years) of diverse age, height, weight, skin tone, and fitness level were selected. Error in HR and EE was computed for each subject/device/activity combination. Devices reported the lowest error for cycling and the highest for walking. Device error was higher for males, greater body mass index, darker skin tone, and walking. Six of the devices achieved a median error for HR below 5% during cycling. No device achieved an error in EE below 20 percent. The Apple Watch achieved the lowest overall error in both HR and EE, while the Samsung Gear S2 reported the highest. In conclusion, most wrist-worn devices adequately measure HR in laboratory-based activities, but poorly estimate EE, suggesting caution in the use of EE measurements as part of health improvement programs. We propose reference standards for the validation of consumer health devices (http://precision.stanford.edu/).
View details for DOI 10.3390/jpm7020003
View details for PubMedID 28538708
-
Targeted use of growth mixture modeling: a learning perspective
STATISTICS IN MEDICINE
2017; 36 (4): 671-686
Abstract
From the statistical learning perspective, this paper shows a new direction for the use of growth mixture modeling (GMM), a method of identifying latent subpopulations that manifest heterogeneous outcome trajectories. In the proposed approach, we utilize the benefits of the conventional use of GMM for the purpose of generating potential candidate models based on empirical model fitting, which can be viewed as unsupervised learning. We then evaluate candidate GMM models on the basis of a direct measure of success; how well the trajectory types are predicted by clinically and demographically relevant baseline features, which can be viewed as supervised learning. We examine the proposed approach focusing on a particular utility of latent trajectory classes, as outcomes that can be used as valid prediction targets in clinical prognostic models. Our approach is illustrated using data from the Longitudinal Assessment of Manic Symptoms study. Copyright © 2016 John Wiley & Sons, Ltd.
View details for DOI 10.1002/sim.7152
View details for Web of Science ID 000393304400008
View details for PubMedCentralID PMC5217165
-
Selection of effects in Cox frailty models by regularization methods.
Biometrics
2017
Abstract
In all sorts of regression problems, it has become more and more important to deal with high-dimensional data with lots of potentially influential covariates. A possible solution is to apply estimation methods that aim at the detection of the relevant effect structure by using penalization methods. In this article, the effect structure in the Cox frailty model, which is the most widely used model that accounts for heterogeneity in survival data, is investigated. Since in survival models one has to account for possible variation of the effect strength over time the selection of the relevant features has to distinguish between several cases, covariates can have time-varying effects, time-constant effects, or be irrelevant. A penalization approach is proposed that is able to distinguish between these types of effects to obtain a sparse representation that includes the relevant effects in a proper form. It is shown in simulations that the method works well. The method is applied to model the time until pregnancy, illustrating that the complexity of the influence structure can be strongly reduced by using the proposed penalty approach.
View details for DOI 10.1111/biom.12637
View details for PubMedID 28085181
-
FIRE: functional inference of genetic variants that regulate gene expression.
Bioinformatics (Oxford, England)
2017; 33 (24): 3895–3901
Abstract
Interpreting genetic variation in noncoding regions of the genome is an important challenge for personal genome analysis. One mechanism by which noncoding single nucleotide variants (SNVs) influence downstream phenotypes is through the regulation of gene expression. Methods to predict whether or not individual SNVs are likely to regulate gene expression would aid interpretation of variants of unknown significance identified in whole-genome sequencing studies.We developed FIRE (Functional Inference of Regulators of Expression), a tool to score both noncoding and coding SNVs based on their potential to regulate the expression levels of nearby genes. FIRE consists of 23 random forests trained to recognize SNVs in cis-expression quantitative trait loci (cis-eQTLs) using a set of 92 genomic annotations as predictive features. FIRE scores discriminate cis-eQTL SNVs from non-eQTL SNVs in the training set with a cross-validated area under the receiver operating characteristic curve (AUC) of 0.807, and discriminate cis-eQTL SNVs shared across six populations of different ancestry from non-eQTL SNVs with an AUC of 0.939. FIRE scores are also predictive of cis-eQTL SNVs across a variety of tissue types.FIRE scores for genome-wide SNVs in hg19/GRCh37 are available for download at https://sites.google.com/site/fireregulatoryvariation/.nilah@stanford.edu.Supplementary data are available at Bioinformatics online.
View details for PubMedID 28961785
-
Sparse EEG/MEG source estimation via a group lasso.
PloS one
2017; 12 (6): e0176835
Abstract
Non-invasive recordings of human brain activity through electroencephalography (EEG) or magnetoencelphalography (MEG) are of value for both basic science and clinical applications in sensory, cognitive, and affective neuroscience. Here we introduce a new approach to estimating the intra-cranial sources of EEG/MEG activity measured from extra-cranial sensors. The approach is based on the group lasso, a sparse-prior inverse that has been adapted to take advantage of functionally-defined regions of interest for the definition of physiologically meaningful groups within a functionally-based common space. Detailed simulations using realistic source-geometries and data from a human Visual Evoked Potential experiment demonstrate that the group-lasso method has improved performance over traditional ℓ2 minimum-norm methods. In addition, we show that pooling source estimates across subjects over functionally defined regions of interest results in improvements in the accuracy of source estimates for both the group-lasso and minimum-norm approaches.
View details for PubMedID 28604790
-
Combinatorial Extracellular Matrix Microenvironments for Probing Endothelial Differentiation of Human Pluripotent Stem Cells.
Scientific reports
2017; 7 (1): 6551
Abstract
Endothelial cells derived from human pluripotent stem cells are a promising cell type for enhancing angiogenesis in ischemic cardiovascular tissues. However, our understanding of microenvironmental factors that modulate the process of endothelial differentiation is limited. We examined the role of combinatorial extracellular matrix (ECM) proteins on endothelial differentiation systematically using an arrayed microscale platform. Human pluripotent stem cells were differentiated on the arrayed ECM microenvironments for 5 days. Combinatorial ECMs composed of collagen IV + heparan sulfate + laminin (CHL) or collagen IV + gelatin + heparan sulfate (CGH) demonstrated significantly higher expression of CD31, compared to single-factor ECMs. These results were corroborated by fluorescence activated cell sorting showing a 48% yield of CD31+/VE-cadherin+ cells on CHL, compared to 27% on matrigel. To elucidate the signaling mechanism, a gene expression time course revealed that VE-cadherin and FLK1 were upregulated in a dynamically similar manner as integrin subunit β3 (>50 fold). To demonstrate the functional importance of integrin β3 in promoting endothelial differentiation, the addition of neutralization antibody inhibited endothelial differentiation on CHL-modified dishes by >50%. These data suggest that optimal combinatorial ECMs enhance endothelial differentiation, compared to many single-factor ECMs, in part through an integrin β3-mediated pathway.
View details for PubMedID 28747756
-
Prognostic significance of early aortic remodeling in acute uncomplicated type B aortic dissection and intramural hematoma.
The Journal of thoracic and cardiovascular surgery
2017; 154 (4): 1192–1200
Abstract
Patients with Stanford type B aortic dissections (ADs) are at risk of long-term disease progression and late complications. The aim of this study was to evaluate the natural course and evolution of acute type B AD and intramural hematomas (IMHs) in patients who presented without complications during their initial hospital admission and who were treated with optimal medical management (MM).Databases from 2 aortic centers in Europe and the United States were used to identify 136 patients with acute type B AD (n = 92) and acute type B IMH (n = 44) who presented without complications during their index admission and were treated with MM. Computed tomography angiography scans were available at onset (≤14 days) and during follow-up for those patients. Relevant data, including evidence of adverse events during follow-up (AE; defined according to current guidelines), were retrieved from medical records and by reviewing computed tomography scan images. Aortic diameters were measured with dedicated 3-dimensional software.The 1-, 2-, and 5-year event-free survival rates of patients with type B AD were 84.3% (95% confidence interval [CI], 74.4-90.6), 75.4% (95% CI, 64.0-83.7), and 62.6% (95% CI, 68.9-73.6), respectively. Corresponding estimates for IMH were 76.5% (95% CI, 57.8-87.8), 76.5% (95% CI, 57.8-87.8), and 68.9% (95% CI, 45.2-83.9), respectively. In patients with type B AD, risk of an AE increased with aortic growth within the first 6 months after onset. A diameter increase of 5 mm in the first half year was associated with a relative risk for AE of 2.29 (95% CI, 1.70-3.09) compared with the median 6 months' growth of 2.4 mm. In approximately 60% of patients with IMH, the abnormality resolved within 12 months and in the patients with nonresolving IMH, risk of an adverse event was greatest in the first year after onset and remained stable thereafter.More than one third of patients with initially uncomplicated type B AD suffer an AE under MM within 5 years of initial diagnosis. In patients with nonresolving IMH, most adverse events are observed in the first year after onset. In patients with type B AD an early aortic growth is associated with a greater risk of AE.
View details for PubMedID 28668458
-
Synergistic drug combinations from electronic health records and gene expression.
Journal of the American Medical Informatics Association
2016
Abstract
Using electronic health records (EHRs) and biomolecular data, we sought to discover drug pairs with synergistic repurposing potential. EHRs provide real-world treatment and outcome patterns, while complementary biomolecular data, including disease-specific gene expression and drug-protein interactions, provide mechanistic understanding.We applied Group Lasso INTERaction NETwork (glinternet), an overlap group lasso penalty on a logistic regression model, with pairwise interactions to identify variables and interacting drug pairs associated with reduced 5-year mortality using EHRs of 9945 breast cancer patients. We identified differentially expressed genes from 14 case-control human breast cancer gene expression datasets and integrated them with drug-protein networks. Drugs in the network were scored according to their association with breast cancer individually or in pairs. Lastly, we determined whether synergistic drug pairs found in the EHRs were enriched among synergistic drug pairs from gene-expression data using a method similar to gene set enrichment analysis.From EHRs, we discovered 3 drug-class pairs associated with lower mortality: anti-inflammatories and hormone antagonists, anti-inflammatories and lipid modifiers, and lipid modifiers and obstructive airway drugs. The first 2 pairs were also enriched among pairs discovered using gene expression data and are supported by molecular interactions in drug-protein networks and preclinical and epidemiologic evidence.This is a proof-of-concept study demonstrating that a combination of complementary data sources, such as EHRs and gene expression, can corroborate discoveries and provide mechanistic insight into drug synergism for repurposing.
View details for DOI 10.1093/jamia/ocw161
View details for PubMedID 27940607
-
Targeted use of growth mixture modeling: a learning perspective.
Statistics in medicine
2016
Abstract
From the statistical learning perspective, this paper shows a new direction for the use of growth mixture modeling (GMM), a method of identifying latent subpopulations that manifest heterogeneous outcome trajectories. In the proposed approach, we utilize the benefits of the conventional use of GMM for the purpose of generating potential candidate models based on empirical model fitting, which can be viewed as unsupervised learning. We then evaluate candidate GMM models on the basis of a direct measure of success; how well the trajectory types are predicted by clinically and demographically relevant baseline features, which can be viewed as supervised learning. We examine the proposed approach focusing on a particular utility of latent trajectory classes, as outcomes that can be used as valid prediction targets in clinical prognostic models. Our approach is illustrated using data from the Longitudinal Assessment of Manic Symptoms study. Copyright © 2016 John Wiley & Sons, Ltd.
View details for DOI 10.1002/sim.7152
View details for PubMedID 27804177
View details for PubMedCentralID PMC5217165
-
Human amygdala engagement moderated by early life stress exposure is a biobehavioral target for predicting recovery on antidepressants
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2016; 113 (42): 11955-11960
Abstract
Amygdala circuitry and early life stress (ELS) are both strongly and independently implicated in the neurobiology of depression. Importantly, animal models have revealed that the contribution of ELS to the development and maintenance of depression is likely a consequence of structural and physiological changes in amygdala circuitry in response to stress hormones. Despite these mechanistic foundations, amygdala engagement and ELS have not been investigated as biobehavioral targets for predicting functional remission in translational human studies of depression. Addressing this question, we integrated human neuroimaging and measurement of ELS within a controlled trial of antidepressant outcomes. Here we demonstrate that the interaction between amygdala activation engaged by emotional stimuli and ELS predicts functional remission on antidepressants with a greater than 80% cross-validated accuracy. Our model suggests that in depressed people with high ELS, the likelihood of remission is highest with greater amygdala reactivity to socially rewarding stimuli, whereas for those with low-ELS exposure, remission is associated with lower amygdala reactivity to both rewarding and threat-related stimuli. This full model predicted functional remission over and above the contribution of demographics, symptom severity, ELS, and amygdala reactivity alone. These findings identify a human target for elucidating the mechanisms of antidepressant functional remission and offer a target for developing novel therapeutics. The results also offer a proof-of-concept for using neuroimaging as a target for guiding neuroscience-informed intervention decisions at the level of the individual person.
View details for DOI 10.1073/pnas.1606671113
View details for PubMedID 27791054
-
Combinatorial extracellular matrix microenvironments promote survival and phenotype of human induced pluripotent stem cell-derived endothelial cells in hypoxia
ACTA BIOMATERIALIA
2016; 44: 188-199
Abstract
Recent developments in cell therapy using human induced pluripotent stem cell-derived endothelial cells (iPSC-ECs) hold great promise for treating ischemic cardiovascular tissues. However, poor post-transplantation viability largely limits the potential of stem cell therapy. Although the extracellular matrix (ECM) has become increasingly recognized as an important cell survival factor, conventional approaches primarily rely on single ECMs for in vivo co-delivery with cells, even though the endothelial basement membrane is comprised of a milieu of different ECMs. To address this limitation, we developed a combinatorial ECM microarray platform to simultaneously interrogate hundreds of micro-scale multi-component chemical compositions of ECMs on iPSC-EC response. After seeding iPSC-ECs onto ECM microarrays, we performed high-throughput analysis of the effects of combinatorial ECMs on iPSC-EC survival, endothelial phenotype, and nitric oxide production under conditions of hypoxia (1% O2) and reduced nutrients (1% fetal bovine serum), as is present in ischemic injury sites. Using automated image acquisition and analysis, we identified combinatorial ECMs such as collagen IV+gelatin+heparan sulfate+laminin and collagen IV+fibronectin+gelatin+heparan sulfate+laminin that significantly improved cell survival, nitric oxide production, and CD31 phenotypic expression, in comparison to single-component ECMs. These results were further validated in conventional cell culture platforms and within three-dimensional scaffolds. Furthermore, this approach revealed complex ECM interactions and non-intuitive cell behavior that otherwise could not be easily determined using conventional cell culture platforms. Together these data suggested that iPSC-EC delivery within optimal combinatorial ECMs may improve their survival and function under the condition of hypoxia with reduced nutrients.Human endothelial cells (ECs) derived from induced pluripotent stem cells (iPSC-ECs) are promising for treating diseases associated with reduced nutrient and oxygen supply like heart failure. However, diminished iPSC-EC survival after implantation into diseased environments limits their therapeutic potential. Since native ECs interact with numerous extracellular matrix (ECM) proteins for functional maintenance, we hypothesized that combinatorial ECMs may improve cell survival and function under conditions of reduced oxygen and nutrients. We developed a high-throughput system for simultaneous screening of iPSC-ECs cultured on multi-component ECM combinations under the condition of hypoxia and reduced serum. Using automated image acquisition and analytical algorithms, we identified combinatorial ECMs that significantly improved cell survival and function, in comparison to single ECMs. Furthermore, this approach revealed complex ECM interactions and non-intuitive cell behavior that otherwise could not be easily determined.
View details for DOI 10.1016/j.actbio.2016.08.003
View details for Web of Science ID 000385594700017
View details for PubMedCentralID PMC5045796
-
Combinatorial extracellular matrix microenvironments promote survival and phenotype of human induced pluripotent stem cell-derived endothelial cells in hypoxia.
Acta biomaterialia
2016; 44: 188-199
Abstract
Recent developments in cell therapy using human induced pluripotent stem cell-derived endothelial cells (iPSC-ECs) hold great promise for treating ischemic cardiovascular tissues. However, poor post-transplantation viability largely limits the potential of stem cell therapy. Although the extracellular matrix (ECM) has become increasingly recognized as an important cell survival factor, conventional approaches primarily rely on single ECMs for in vivo co-delivery with cells, even though the endothelial basement membrane is comprised of a milieu of different ECMs. To address this limitation, we developed a combinatorial ECM microarray platform to simultaneously interrogate hundreds of micro-scale multi-component chemical compositions of ECMs on iPSC-EC response. After seeding iPSC-ECs onto ECM microarrays, we performed high-throughput analysis of the effects of combinatorial ECMs on iPSC-EC survival, endothelial phenotype, and nitric oxide production under conditions of hypoxia (1% O2) and reduced nutrients (1% fetal bovine serum), as is present in ischemic injury sites. Using automated image acquisition and analysis, we identified combinatorial ECMs such as collagen IV+gelatin+heparan sulfate+laminin and collagen IV+fibronectin+gelatin+heparan sulfate+laminin that significantly improved cell survival, nitric oxide production, and CD31 phenotypic expression, in comparison to single-component ECMs. These results were further validated in conventional cell culture platforms and within three-dimensional scaffolds. Furthermore, this approach revealed complex ECM interactions and non-intuitive cell behavior that otherwise could not be easily determined using conventional cell culture platforms. Together these data suggested that iPSC-EC delivery within optimal combinatorial ECMs may improve their survival and function under the condition of hypoxia with reduced nutrients.Human endothelial cells (ECs) derived from induced pluripotent stem cells (iPSC-ECs) are promising for treating diseases associated with reduced nutrient and oxygen supply like heart failure. However, diminished iPSC-EC survival after implantation into diseased environments limits their therapeutic potential. Since native ECs interact with numerous extracellular matrix (ECM) proteins for functional maintenance, we hypothesized that combinatorial ECMs may improve cell survival and function under conditions of reduced oxygen and nutrients. We developed a high-throughput system for simultaneous screening of iPSC-ECs cultured on multi-component ECM combinations under the condition of hypoxia and reduced serum. Using automated image acquisition and analytical algorithms, we identified combinatorial ECMs that significantly improved cell survival and function, in comparison to single ECMs. Furthermore, this approach revealed complex ECM interactions and non-intuitive cell behavior that otherwise could not be easily determined.
View details for DOI 10.1016/j.actbio.2016.08.003
View details for PubMedID 27498178
View details for PubMedCentralID PMC5045796
-
REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants
AMERICAN JOURNAL OF HUMAN GENETICS
2016; 99 (4): 877-885
Abstract
The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10(-12)) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.
View details for DOI 10.1016/j.ajhg.2016.08.016
View details for PubMedID 27666373
-
Evaluating quantitative proton-density-mapping methods
HUMAN BRAIN MAPPING
2016; 37 (10): 3623-3635
Abstract
Quantitative magnetic resonance imaging (qMRI) aims to quantify tissue parameters by eliminating instrumental bias. We describe qMRI theory, simulations, and software designed to estimate proton density (PD), the apparent local concentration of water protons in the living human brain. First, we show that, in the absence of noise, multichannel coil data contain enough information to separate PD and coil sensitivity, a limiting instrumental bias. Second, we show that, in the presence of noise, regularization by a constraint on the relationship between T1 and PD produces accurate coil sensitivity and PD maps. The ability to measure PD quantitatively has applications in the analysis of in-vivo human brain tissue and enables multisite comparisons between individuals and across instruments. Hum Brain Mapp 37:3623-3635, 2016. © 2016 Wiley Periodicals, Inc.
View details for DOI 10.1002/hbm.23264
View details for Web of Science ID 000383864500018
View details for PubMedID 27273015
-
ZeitZeiger: supervised learning for high-dimensional data from an oscillatory system
NUCLEIC ACIDS RESEARCH
2016; 44 (8)
Abstract
Numerous biological systems oscillate over time or space. Despite these oscillators' importance, data from an oscillatory system is problematic for existing methods of regularized supervised learning. We present ZeitZeiger, a method to predict a periodic variable (e.g. time of day) from a high-dimensional observation. ZeitZeiger learns a sparse representation of the variation associated with the periodic variable in the training observations, then uses maximum-likelihood to make a prediction for a test observation. We applied ZeitZeiger to a comprehensive dataset of genome-wide gene expression from the mammalian circadian oscillator. Using the expression of 13 genes, ZeitZeiger predicted circadian time (internal time of day) in each of 12 mouse organs to within ∼1 h, resulting in a multi-organ predictor of circadian time. Compared to the state-of-the-art approach, ZeitZeiger was faster, more accurate and used fewer genes. We then validated the multi-organ predictor on 20 additional datasets comprising nearly 800 samples. Our results suggest that ZeitZeiger not only makes accurate predictions, but also gives insight into the behavior and structure of the oscillator from which the data originated. As our ability to collect high-dimensional data from various biological oscillators increases, ZeitZeiger should enhance efforts to convert these data to knowledge.
View details for DOI 10.1093/nar/gkw030
View details for Web of Science ID 000376389000011
View details for PubMedID 26819407
View details for PubMedCentralID PMC4856978
-
Effect of long-term antibiotic use on weight in adolescents with acne.
journal of antimicrobial chemotherapy
2016; 71 (4): 1098-1105
Abstract
Antibiotics increase weight in farm animals and may cause weight gain in humans. We used electronic health records from a large primary care organization to determine the effect of antibiotics on weight and BMI in healthy adolescents with acne.We performed a retrospective cohort study of adolescents with acne prescribed ≥4 weeks of oral antibiotics with weight measurements within 18 months pre-antibiotics and 12 months post-antibiotics. We compared within-individual changes in weight-for-age Z-scores (WAZs) and BMI-for-age Z-scores (BMIZs). We used: (i) paired t-tests to analyse changes between the last pre-antibiotics versus the first post-antibiotic measurements; (ii) piecewise-constant-mixed models to capture changes between mean measurements pre- versus post-antibiotics; (iii) piecewise-linear-mixed models to capture changes in trajectory slopes pre- versus post-antibiotics; and (iv) χ(2) tests to compare proportions of adolescents with ≥0.2 Z-scores WAZ or BMIZ increase or decrease.Our cohort included 1012 adolescents with WAZs; 542 also had BMIZs. WAZs decreased post-antibiotics in all analyses [change between last WAZ pre-antibiotics versus first WAZ post-antibiotics = -0.041 Z-scores (P < 0.001); change between mean WAZ pre- versus post-antibiotics = -0.050 Z-scores (P < 0.001); change in WAZ trajectory slopes pre- versus post-antibiotics = -0.025 Z-scores/6 months (P = 0.002)]. More adolescents had a WAZ decrease post-antibiotics ≥0.2 Z-scores than an increase (26% versus 18%; P < 0.001). Trends were similar, though not statistically significant, for BMIZ changes.Contrary to original expectations, long-term antibiotic use in healthy adolescents with acne was not associated with weight gain. This finding, which was consistent across all analyses, does not support a weight-promoting effect of antibiotics in adolescents.
View details for DOI 10.1093/jac/dkv455
View details for PubMedID 26782773
View details for PubMedCentralID PMC4790625
-
Construction of longitudinal prediction targets using semisupervised learning.
Statistical methods in medical research
2016: 962280216684163-?
Abstract
In establishing prognostic models, often aided by machine learning methods, much effort is concentrated in identifying good predictors. However, the same level of rigor is often absent in improving the outcome side of the models. In this study, we focus on this rather neglected aspect of model development. We are particularly interested in the use of longitudinal information as a way of improving the outcome side of prognostic models. This involves optimally characterizing individuals' outcome status, classifying them, and validating the formulated prediction targets. None of these tasks are straightforward, which may explain why longitudinal prediction targets are not commonly used in practice despite their compelling benefits. As a way of improving this situation, we explore the joint use of empirical model fitting, clinical insights, and cross-validation based on how well formulated targets are predicted by clinically relevant baseline characteristics (antecedent validators). The idea here is that all these methods are imperfect but can be used together to triangulate valid prediction targets. The proposed approach is illustrated using data from the longitudinal assessment of manic symptoms study.
View details for DOI 10.1177/0962280216684163
View details for PubMedID 28067113
-
CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE
ANNALS OF APPLIED STATISTICS
2015; 9 (4): 1709-1725
View details for DOI 10.1214/15-AOAS866
View details for Web of Science ID 000370445600001
-
Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares
JOURNAL OF MACHINE LEARNING RESEARCH
2015; 16: 3367-3402
View details for Web of Science ID 000369888000033
-
CUSTOMIZED TRAINING WITH AN APPLICATION TO MASS SPECTROMETRIC IMAGING OF CANCER TISSUE.
The annals of applied statistics
2015; 9 (4): 1709-1725
Abstract
We introduce a simple, interpretable strategy for making predictions on test data when the features of the test data are available at the time of model fitting. Our proposal-customized training-clusters the data to find training points close to each test point and then fits an ℓ1-regularized model (lasso) separately in each training cluster. This approach combines the local adaptivity of k-nearest neighbors with the interpretability of the lasso. Although we use the lasso for the model fitting, any supervised learning method can be applied to the customized training sets. We apply the method to a mass-spectrometric imaging data set from an ongoing collaboration in gastric cancer detection which demonstrates the power and interpretability of the technique. Our idea is simple but potentially useful in situations where the data have some underlying structure.
View details for DOI 10.1214/15-AOAS866
View details for PubMedID 30370000
View details for PubMedCentralID PMC6200412
-
The mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility.
Journal of the American Medical Informatics Association
2015; 22 (6): 1120-1125
Abstract
Regular physical activity helps prevent heart disease, stroke, diabetes, and other chronic diseases, yet a broad range of conditions impair mobility at great personal and societal cost. Vast amounts of data characterizing human movement are available from research labs, clinics, and millions of smartphones and wearable sensors, but integration and analysis of this large quantity of mobility data are extremely challenging. The authors have established the Mobilize Center (http://mobilize.stanford.edu) to harness these data to improve human mobility and help lay the foundation for using data science methods in biomedicine. The Center is organized around 4 data science research cores: biomechanical modeling, statistical learning, behavioral and social modeling, and integrative modeling. Important biomedical applications, such as osteoarthritis and weight management, will focus the development of new data science methods. By developing these new approaches, sharing data and validated software tools, and training thousands of researchers, the Mobilize Center will transform human movement research.
View details for DOI 10.1093/jamia/ocv071
View details for PubMedID 26272077
View details for PubMedCentralID PMC4639715
-
Learning interactions via hierarchical group-lasso regularization.
Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
2015; 24 (3): 627-654
Abstract
We introduce a method for learning pairwise interactions in a linear regression or logistic regression model in a manner that satisfies strong hierarchy: whenever an interaction is estimated to be nonzero, both its associated main effects are also included in the model. We motivate our approach by modeling pairwise interactions for categorical variables with arbitrary numbers of levels, and then show how we can accommodate continuous variables as well. Our approach allows us to dispense with explicitly applying constraints on the main effects and interactions for identifiability, which results in interpretable interaction models. We compare our method with existing approaches on both simulated and real data, including a genome-wide association study, all using our R package glinternet.
View details for DOI 10.1080/10618600.2014.938812
View details for PubMedID 26759522
View details for PubMedCentralID PMC4706754
-
Clinically Relevant Molecular Subtypes in Leiomyosarcoma.
Clinical cancer research
2015; 21 (15): 3501-3511
Abstract
Leiomyosarcoma is a malignant neoplasm with smooth muscle differentiation. Little is known about its molecular heterogeneity and no targeted therapy currently exists for leiomyosarcoma. Recognition of different molecular subtypes is necessary to evaluate novel therapeutic options. In a previous study on 51 leiomyosarcomas, we identified three molecular subtypes in leiomyosarcoma. The current study was performed to determine whether the existence of these subtypes could be confirmed in independent cohorts.Ninety-nine cases of leiomyosarcoma were expression profiled with 3'end RNA-Sequencing (3SEQ). Consensus clustering was conducted to determine the optimal number of subtypes.We identified 3 leiomyosarcoma molecular subtypes and confirmed this finding by analyzing publically available data on 82 leiomyosarcoma from The Cancer Genome Atlas (TCGA). We identified two new formalin-fixed, paraffin-embedded tissue-compatible diagnostic immunohistochemical markers; LMOD1 for subtype I leiomyosarcoma and ARL4C for subtype II leiomyosarcoma. A leiomyosarcoma tissue microarray with known clinical outcome was used to show that subtype I leiomyosarcoma is associated with good outcome in extrauterine leiomyosarcoma while subtype II leiomyosarcoma is associated with poor prognosis in both uterine and extrauterine leiomyosarcoma. The leiomyosarcoma subtypes showed significant differences in expression levels for genes for which novel targeted therapies are being developed, suggesting that leiomyosarcoma subtypes may respond differentially to these targeted therapies.We confirm the existence of 3 molecular subtypes in leiomyosarcoma using two independent datasets and show that the different molecular subtypes are associated with distinct clinical outcomes. The findings offer an opportunity for treating leiomyosarcoma in a subtype-specific targeted approach. Clin Cancer Res; 21(15); 3501-11. ©2015 AACR.
View details for DOI 10.1158/1078-0432.CCR-14-3141
View details for PubMedID 25896974
-
Effective degrees of freedom: a flawed metaphor
BIOMETRIKA
2015; 102 (2): 479-485
View details for DOI 10.1093/biomet/asv019
View details for Web of Science ID 000355677500016
-
Effective degrees of freedom: a flawed metaphor.
Biometrika
2015; 102 (2): 479-485
Abstract
To most applied statisticians, a fitting procedure's degrees of freedom is synonymous with its model complexity, or its capacity for overfitting to data. In particular, it is often used to parameterize the bias-variance tradeoff in model selection. We argue that, on the contrary, model complexity and degrees of freedom may correspond very poorly. We exhibit and theoretically explore various fitting procedures for which degrees of freedom is not monotonic in the model complexity parameter, and can exceed the total dimension of the ambient space even in very simple settings. We show that the degrees of freedom for any non-convex projection method can be unbounded.
View details for DOI 10.1093/biomet/asv019
View details for PubMedID 26977114
View details for PubMedCentralID PMC4787623
-
Detecting clinically meaningful biomarkers with repeated measurements: An illustration with electronic health records
BIOMETRICS
2015; 71 (2): 478-486
Abstract
Data sources with repeated measurements are an appealing resource to understand the relationship between changes in biological markers and risk of a clinical event. While longitudinal data present opportunities to observe changing risk over time, these analyses can be complicated if the measurement of clinical metrics is sparse and/or irregular, making typical statistical methods unsuitable. In this article, we use electronic health record (EHR) data as an example to present an analytic procedure to both create an analytic sample and analyze the data to detect clinically meaningful markers of acute myocardial infarction (MI). Using an EHR from a large national dialysis organization we abstracted the records of 64,318 individuals and identified 4769 people that had an MI during the study period. We describe a nested case-control design to sample appropriate controls and an analytic approach using regression splines. Fitting a mixed-model with truncated power splines we perform a series of goodness-of-fit tests to determine whether any of 11 regularly collected laboratory markers are useful clinical predictors. We test the clinical utility of each marker using an independent test set. The results suggest that EHR data can be easily used to detect markers of clinically acute events. Special software or analytic tools are not needed, even with irregular EHR data.
View details for DOI 10.1111/biom.12283
View details for Web of Science ID 000356810000024
View details for PubMedID 25652566
-
Point process models for presence-only analysis
METHODS IN ECOLOGY AND EVOLUTION
2015; 6 (4): 366-379
View details for DOI 10.1111/2041-210X.12352
View details for Web of Science ID 000352794100002
-
Bias correction in species distribution models: pooling survey and collection data for multiple species.
Methods in ecology and evolution
2015; 6 (4): 424-438
Abstract
Presence-only records may provide data on the distributions of rare species, but commonly suffer from large, unknown biases due to their typically haphazard collection schemes. Presence-absence or count data collected in systematic, planned surveys are more reliable but typically less abundant.We proposed a probabilistic model to allow for joint analysis of presence-only and survey data to exploit their complementary strengths. Our method pools presence-only and presence-absence data for many species and maximizes a joint likelihood, simultaneously estimating and adjusting for the sampling bias affecting the presence-only data. By assuming that the sampling bias is the same for all species, we can borrow strength across species to efficiently estimate the bias and improve our inference from presence-only data.We evaluate our model's performance on data for 36 eucalypt species in south-eastern Australia. We find that presence-only records exhibit a strong sampling bias towards the coast and towards Sydney, the largest city. Our data-pooling technique substantially improves the out-of-sample predictive performance of our model when the amount of available presence-absence data for a given species is scarceIf we have only presence-only data and no presence-absence data for a given species, but both types of data for several other species that suffer from the same spatial sampling bias, then our method can obtain an unbiased estimate of the first species' geographic range.
View details for DOI 10.1111/2041-210X.12242
View details for PubMedID 27840673
View details for PubMedCentralID PMC5102514
-
CATS regression - a model-based approach to studying trait-based community assembly
METHODS IN ECOLOGY AND EVOLUTION
2015; 6 (4): 389-398
View details for DOI 10.1111/2041-210X.12280
View details for Web of Science ID 000352794100004
-
Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.
Journal of machine learning research : JMLR
2015; 16: 3367-3402
Abstract
The matrix-completion problem has attracted a lot of attention, largely as a result of the celebrated Netflix competition. Two popular approaches for solving the problem are nuclear-norm-regularized matrix approximation (Candès and Tao, 2009; Mazumder et al., 2010), and maximum-margin matrix factorization (Srebro et al., 2005). These two procedures are in some cases solving equivalent problems, but with quite different algorithms. In this article we bring the two approaches together, leading to an efficient algorithm for large matrix factorization and completion that outperforms both of these. We develop a software package softlmpute in R for implementing our approaches, and a distributed version for very large matrices using the Spark cluster programming environment.
View details for PubMedID 31130828
-
Learning the Structure of Mixed Graphical Models.
Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
2015; 24 (1): 230–53
Abstract
We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. The penalization scheme involves a novel symmetric use of the group-lasso norm and follows naturally from a particular parametrization of the model. Supplementary materials for this paper are available online.
View details for PubMedID 26085782
-
Bias correction in species distribution models: pooling survey and collection data for multiple species
METHODS IN ECOLOGY AND EVOLUTION
2015
View details for DOI 10.1111/2041-210X.12242
-
LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS
ANNALS OF STATISTICS
2014; 42 (5): 1693-1724
View details for DOI 10.1214/14-AOS1220
View details for Web of Science ID 000344632400001
-
LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.
Annals of statistics
2014; 42 (5): 1693-1724
Abstract
For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features. The biased subsampling is corrected by a post-hoc analytic adjustment to the parameters. The method is simple and requires one parallelizable scan over the full data set. Standard case-control sampling is inconsistent under model misspecification for the population risk-minimizing coefficients θ*. By contrast, our estimator is consistent for θ* provided that the pilot estimate is. Moreover, under correct specification and with a consistent, independent pilot estimate, our estimator has exactly twice the asymptotic variance of the full-sample MLE-even if the selected subsample comprises a miniscule fraction of the full data set, as happens when the original data are severely imbalanced. The factor of two improves to [Formula: see text] if we multiply the baseline acceptance probabilities by c > 1 (and weight points with acceptance probability greater than 1), taking roughly [Formula: see text] times as many data points into the subsample. Experiments on simulated and real data show that our method can substantially outperform standard case-control subsampling.
View details for DOI 10.1214/14-AOS1220
View details for PubMedID 25492979
View details for PubMedCentralID PMC4258397
-
Assessing the significance of global and local correlations under spatial autocorrelation: A nonparametric approach
BIOMETRICS
2014; 70 (2): 409-418
Abstract
We propose a method to test the correlation of two random fields when they are both spatially autocorrelated. In this scenario, the assumption of independence for the pair of observations in the standard test does not hold, and as a result we reject in many cases where there is no effect (the precision of the null distribution is overestimated). Our method recovers the null distribution taking into account the autocorrelation. It uses Monte-Carlo methods, and focuses on permuting, and then smoothing and scaling one of the variables to destroy the correlation with the other, while maintaining at the same time the initial autocorrelation. With this simulation model, any test based on the independence of two (or more) random fields can be constructed. This research was motivated by a project in biodiversity and conservation in the Biology Department at Stanford University.
View details for DOI 10.1111/biom.12139
View details for Web of Science ID 000337621000016
View details for PubMedID 24571609
View details for PubMedCentralID PMC4108159
-
Boosted Varying-Coefficient Regression Models for Product Demand Prediction
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2014; 23 (2): 361-382
View details for DOI 10.1080/10618600.2013.778777
View details for Web of Science ID 000335938300004
-
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife
JOURNAL OF MACHINE LEARNING RESEARCH
2014; 15: 1625-1651
View details for Web of Science ID 000344638100001
-
CATS regression–a model‐based approach to studying trait‐based community assembly
Methods & Statistics in Ecology: Methods in Ecology and Evolution
2014; 6 (4): 389-398
View details for DOI 10.1111/2041-210X.12280
-
Bias correction in species distribution models: pooling survey and collection data for multiple species
METHODS IN ECOLOGY AND EVOLUTION
2014; 6 (4): pages 424–438
View details for DOI 10.1111/2041-210X.12242
-
Learning interactions via hierarchical group-lasso regularization
Journal of Computational and Graphical Statistics
2014
View details for DOI 10.1080/10618600.2014.938812
- Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares Technical Report, Statistics Department, Stanford University 2014
-
Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife.
Journal of machine learning research : JMLR
2014; 15 (1): 1625-1651
Abstract
We study the variability of predictions made by bagged learners and random forests, and show how to estimate standard errors for these methods. Our work builds on variance estimates for bagging proposed by Efron (1992, 2013) that are based on the jackknife and the infinitesimal jackknife (IJ). In practice, bagged predictors are computed using a finite number B of bootstrap replicates, and working with a large B can be computationally expensive. Direct applications of jackknife and IJ estimators to bagging require B = Θ(n1.5) bootstrap replicates to converge, where n is the size of the training set. We propose improved versions that only require B = Θ(n) replicates. Moreover, we show that the IJ estimator requires 1.7 times less bootstrap replicates than the jackknife to achieve a given accuracy. Finally, we study the sampling distributions of the jackknife and IJ variance estimates themselves. We illustrate our findings with multiple experiments and simulation studies.
View details for PubMedID 25580094
View details for PubMedCentralID PMC4286302
-
Learning the Structure of Mixed Graphical Models
Journal of Computational and Graphical Statistics
2014; 24 (1): 230-253
View details for DOI 10.1080/10618600.2014.900500
-
FINITE-SAMPLE EQUIVALENCE IN STATISTICAL MODELS FOR PRESENCE-ONLY DATA
ANNALS OF APPLIED STATISTICS
2013; 7 (4): 1917-1939
View details for DOI 10.1214/13-AOAS667
View details for Web of Science ID 000330044900011
-
Finite-Sample Equivalence in Statistical Models for Presence-Only Data.
The annals of applied statistics
2013; 7 (4): 1917-1939
Abstract
Statistical modeling of presence-only data has attracted much recent attention in the ecological literature, leading to a proliferation of methods, including the inhomogeneous Poisson process (IPP) model, maximum entropy (Maxent) modeling of species distributions and logistic regression models. Several recent articles have shown the close relationships between these methods. We explain why the IPP intensity function is a more natural object of inference in presence-only studies than occurrence probability (which is only defined with reference to quadrat size), and why presence-only data only allows estimation of relative, and not absolute intensity of species occurrence. All three of the above techniques amount to parametric density estimation under the same exponential family model (in the case of the IPP, the fitted density is multiplied by the number of presence records to obtain a fitted intensity). We show that IPP and Maxent give the exact same estimate for this density, but logistic regression in general yields a different estimate in finite samples. When the model is misspecified-as it practically always is-logistic regression and the IPP may have substantially different asymptotic limits with large data sets. We propose "infinitely weighted logistic regression," which is exactly equivalent to the IPP in finite samples. Consequently, many already-implemented methods extending logistic regression can also extend the Maxent and IPP models in directly analogous ways using this technique.
View details for DOI 10.1214/13-AOAS667
View details for PubMedID 25493106
View details for PubMedCentralID PMC4258396
-
Inference from presence-only data; the ongoing controversy
ECOGRAPHY
2013; 36 (8): 864-867
View details for DOI 10.1111/j.1600-0587.2013.00321.x
View details for Web of Science ID 000321328100002
-
A Sparse-Group Lasso
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2013; 22 (2): 231-245
View details for DOI 10.1080/10618600.2012.681250
View details for Web of Science ID 000319954000001
-
Boosted Varying-Coefficient Regression Models for Product Demand Prediction
Journal of Computational and Graphical Statistics
2013; 23 (2): 361-382
View details for DOI 10.1080/10618600.2013.778777
- Effective degrees of freedom: a flawed metaphor Technical Report, Statistics Department, Stanford University 2013
- A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. Technical Report, Statistics Department, Stanford University 2013
- An Introduction to Statistical Learning with Applications in R Springer Texts in Statistics. 2013
- Compressive Feature Learning Curran Associates, Inc., 2013: 2931–39
- Structure Learning of Mixed Grpahical Models Proceedings of the 16th International Conference on Artificial Intelligence and Statistics 2013: 388–396
-
The graphical lasso: New insights and alternatives.
Electronic journal of statistics
2012; 6: 2125-2149
Abstract
The graphical lasso [5] is an algorithm for learning the structure in an undirected Gaussian graphical model, using ℓ1 regularization to control the number of zeros in the precision matrix Θ = Σ-1 [2, 11]. The R package GLASSO [5] is popular, fast, and allows one to efficiently build a path of models for different values of the tuning parameter. Convergence of GLASSO can be tricky; the converged precision matrix might not be the inverse of the estimated covariance, and occasionally it fails to converge with warm starts. In this paper we explain this behavior, and propose new algorithms that appear to outperform GLASSO. By studying the "normal equations" we see that, GLASSO is solving the dual of the graphical lasso penalized likelihood, by block coordinate ascent; a result which can also be found in [2]. In this dual, the target of estimation is Σ, the covariance matrix, rather than the precision matrix Θ. We propose similar primal algorithms P-GLASSO and DP-GLASSO, that also operate by block-coordinate descent, where Θ is the optimization target. We study all of these algorithms, and in particular different approaches to solving their coordinate sub-problems. We conclude that DP-GLASSO is superior from several points of view.
View details for DOI 10.1214/12-EJS740
View details for PubMedID 25558297
View details for PubMedCentralID PMC4281944
-
No increased mortality with early aortic aneurysm disease
26th Annual Meeting of the Western-Vascular-Society
MOSBY-ELSEVIER. 2012: 1246–51
Abstract
In addition to increased risks for aneurysm-related death, previous studies have determined that all-cause mortality in abdominal aortic aneurysm (AAA) patients is excessive and equivalent to that associated with coronary heart disease. These studies largely preceded the current era of coronary heart disease risk factor management, however, and no recent study has examined contemporary mortality associated with early AAA disease (aneurysm diameter between 3 and 5 cm). As part of an ongoing natural history study of AAA, we report the mortality risk associated with presence of early disease.Participants were recruited from three distinct health care systems in Northern California between 2006 and 2011. Aneurysm diameter, demographic information, comorbidities, medication history, and plasma for biomarker analysis were collected at study entry. Survival status was determined at follow-up. Data were analyzed with t-tests or χ(2) tests where appropriate. Freedom from death was calculated via Cox proportional hazards modeling; the relevance of individual predictors on mortality was determined by log-rank test.The study enrolled 634 AAA patients; age 76.4 ± 8.0 years, aortic diameter 3.86 ± 0.7 cm. Participants were mostly male (88.8%), not current smokers (81.6%), and taking statins (76.7%). Mean follow-up was 2.1 ± 1.0 years. Estimated 1- and 3-year survival was 98.2% and 90.9%, respectively. Factors independently associated with mortality included larger aneurysm size (hazard ratio, 2.12; 95% confidence interval, 1.26-3.57 for diameter >4.0 cm) and diabetes (hazard ratio, 2.24; 95% confidence interval, 1.12-4.47). After adjusting for patient-level factors, health care system independently predicted mortality.Contemporary all-cause mortality for patients with early AAA disease is lower than that previously reported. Further research is warranted to determine important factors that contribute to improved survival in early AAA disease.
View details for DOI 10.1016/j.jvs.2012.04.023
View details for Web of Science ID 000310428200007
View details for PubMedID 22832264
View details for PubMedCentralID PMC3478494
-
Coronary risk assessment among intermediate risk patients using a clinical and biomarker based algorithm developed and validated in two population cohorts
CURRENT MEDICAL RESEARCH AND OPINION
2012; 28 (11): 1819-1830
Abstract
Many coronary heart disease (CHD) events occur in individuals classified as intermediate risk by commonly used assessment tools. Over half the individuals presenting with a severe cardiac event, such as myocardial infarction (MI), have at most one risk factor as included in the widely used Framingham risk assessment. Individuals classified as intermediate risk, who are actually at high risk, may not receive guideline recommended treatments. A clinically useful method for accurately predicting 5-year CHD risk among intermediate risk patients remains an unmet medical need.This study sought to develop a CHD Risk Assessment (CHDRA) model that improves 5-year risk stratification among intermediate risk individuals.Assay panels for biomarkers associated with atherosclerosis biology (inflammation, angiogenesis, apoptosis, chemotaxis, etc.) were optimized for measuring baseline serum samples from 1084 initially CHD-free Marshfield Clinic Personalized Medicine Research Project (PMRP) individuals. A multivariable Cox regression model was fit using the most powerful risk predictors within the clinical and protein variables identified by repeated cross-validation. The resulting CHDRA algorithm was validated in a Multiple-Ethnic Study of Atherosclerosis (MESA) case-cohort sample.A CHDRA algorithm of age, sex, diabetes, and family history of MI, combined with serum levels of seven biomarkers (CTACK, Eotaxin, Fas Ligand, HGF, IL-16, MCP-3, and sFas) yielded a clinical net reclassification index of 42.7% (p < 0.001) for MESA patients with a recalibrated Framingham 5-year intermediate risk level. Across all patients, the model predicted acute coronary events (hazard ratio = 2.17, p < 0.001), and remained an independent predictor after Framingham risk factor adjustments.These include the slightly different event definition with the MESA samples and inability to include PMRP fatal CHD events.A novel risk score of serum protein levels plus clinical risk factors, developed and validated in independent cohorts, demonstrated clinical utility for assessing the true risk of CHD events in intermediate risk patients. Improved accuracy in cardiovascular risk classification could lead to improved preventive care and fewer deaths.
View details for DOI 10.1185/03007995.2012.742878
View details for Web of Science ID 000310985600009
View details for PubMedID 23092312
View details for PubMedCentralID PMC3666558
-
Strong rules for discarding predictors in lasso-type problems.
Journal of the Royal Statistical Society. Series B, Statistical methodology
2012; 74 (2): 245-266
Abstract
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
View details for DOI 10.1111/j.1467-9868.2011.01004.x
View details for PubMedID 25506256
View details for PubMedCentralID PMC4262615
-
Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso.
Journal of machine learning research : JMLR
2012; 13: 781-794
Abstract
We consider the sparse inverse covariance regularization problem or graphical lasso with regularization parameter λ. Suppose the sample covariance graph formed by thresholding the entries of the sample covariance matrix at λ is decomposed into connected components. We show that the vertex-partition induced by the connected components of the thresholded sample covariance graph (at λ) is exactly equal to that induced by the connected components of the estimated concentration graph, obtained by solving the graphical lasso problem for the same λ. This characterizes a very interesting property of a path of graphical lasso solutions. Furthermore, this simple rule, when used as a wrapper around existing algorithms for the graphical lasso, leads to enormous performance gains. For a range of values of λ, our proposal splits a large graphical lasso problem into smaller tractable problems, making it possible to solve an otherwise infeasible large-scale problem. We illustrate the graceful scalability of our proposal via synthetic and real-life microarray examples.
View details for PubMedID 25392704
View details for PubMedCentralID PMC4225650
-
Exact Covariance Thresholding into Connected Components for Large-Scale Graphical Lasso
JOURNAL OF MACHINE LEARNING RESEARCH
2012; 13: 781-794
View details for Web of Science ID 000303772100011
-
Strong rules for discarding predictors in lasso-type problems
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2012; 74: 245-266
Abstract
We consider rules for discarding predictors in lasso regression and related problems, for computational efficiency. El Ghaoui and his colleagues have propose 'SAFE' rules, based on univariate inner products between each predictor and the outcome, which guarantee that a coefficient will be 0 in the solution vector. This provides a reduction in the number of variables that need to be entered into the optimization. We propose strong rules that are very simple and yet screen out far more predictors than the SAFE rules. This great practical improvement comes at a price: the strong rules are not foolproof and can mistakenly discard active predictors, i.e. predictors that have non-zero coefficients in the solution. We therefore combine them with simple checks of the Karush-Kuhn-Tucker conditions to ensure that the exact solution to the convex problem is delivered. Of course, any (approximate) screening method can be combined with the Karush-Kuhn-Tucker, conditions to ensure the exact solution; the strength of the strong rules lies in the fact that, in practice, they discard a very large number of the inactive predictors and almost never commit mistakes. We also derive conditions under which they are foolproof. Strong rules provide substantial savings in computational time for a variety of statistical optimization problems.
View details for DOI 10.1111/j.1467-9868.2011.01004.x
View details for Web of Science ID 000301286200004
View details for PubMedCentralID PMC4262615
-
The graphical lasso: New insights and alternatives
ELECTRONIC JOURNAL OF STATISTICS
2012; 6: 2125-2149
View details for DOI 10.1214/12-EJS740
View details for Web of Science ID 000321016800001
- Improved coronary risk assessment among intermediate risk patients using a clincial and biomarker based algorithm developed and validated in two popluation cohorts Current Medical Research and Opinion 2012
-
Sparse Discriminant Analysis
TECHNOMETRICS
2011; 53 (4): 406-413
View details for DOI 10.1198/TECH.2011.08118
View details for Web of Science ID 000297904600007
-
A fused lasso latent feature model for analyzing multi-sample aCGH data
BIOSTATISTICS
2011; 12 (4): 776-791
Abstract
Array-based comparative genomic hybridization (aCGH) enables the measurement of DNA copy number across thousands of locations in a genome. The main goals of analyzing aCGH data are to identify the regions of copy number variation (CNV) and to quantify the amount of CNV. Although there are many methods for analyzing single-sample aCGH data, the analysis of multi-sample aCGH data is a relatively new area of research. Further, many of the current approaches for analyzing multi-sample aCGH data do not appropriately utilize the additional information present in the multiple samples. We propose a procedure called the Fused Lasso Latent Feature Model (FLLat) that provides a statistical framework for modeling multi-sample aCGH data and identifying regions of CNV. The procedure involves modeling each sample of aCGH data as a weighted sum of a fixed number of features. Regions of CNV are then identified through an application of the fused lasso penalty to each feature. Some simulation analyses show that FLLat outperforms single-sample methods when the simulated samples share common information. We also propose a method for estimating the false discovery rate. An analysis of an aCGH data set obtained from human breast tumors, focusing on chromosomes 8 and 17, shows that FLLat and Significance Testing of Aberrant Copy number (an alternative, existing approach) identify similar regions of CNV that are consistent with previous findings. However, through the estimated features and their corresponding weights, FLLat is further able to discern specific relationships between the samples, for example, identifying 3 distinct groups of samples based on their patterns of CNV for chromosome 17.
View details for DOI 10.1093/biostatistics/kxr012
View details for Web of Science ID 000294806800014
View details for PubMedID 21642389
-
SparseNet: Coordinate Descent With Nonconvex Penalties
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2011; 106 (495): 1125-1138
Abstract
We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed in the literature for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this article we pursue a coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df-standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. Certain technical derivations and experiments related to this article are included in the Supplementary Materials section.
View details for DOI 10.1198/jasa.2011.tm09738
View details for Web of Science ID 000296224200037
View details for PubMedCentralID PMC4286300
-
Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent
JOURNAL OF STATISTICAL SOFTWARE
2011; 39 (5): 1-13
Abstract
We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.
View details for Web of Science ID 000288204000001
View details for PubMedCentralID PMC4824408
-
Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent.
Journal of statistical software
2011; 39 (5): 1-13
Abstract
We introduce a pathwise algorithm for the Cox proportional hazards model, regularized by convex combinations of ℓ1 and ℓ2 penalties (elastic net). Our algorithm fits via cyclical coordinate descent, and employs warm starts to find a solution along a regularization path. We demonstrate the efficacy of our algorithm on real and simulated data sets, and find considerable speedup between our algorithm and competing methods.
View details for DOI 10.18637/jss.v039.i05
View details for PubMedID 27065756
View details for PubMedCentralID PMC4824408
-
A statistical explanation of MaxEnt for ecologists
DIVERSITY AND DISTRIBUTIONS
2011; 17 (1): 43-57
View details for DOI 10.1111/j.1472-4642.2010.00725.x
View details for Web of Science ID 000285246700005
-
SparseNet: Coordinate Descent With Nonconvex Penalties.
Journal of the American Statistical Association
2011; 106 (495): 1125-1138
Abstract
We address the problem of sparse selection in linear models. A number of nonconvex penalties have been proposed in the literature for this purpose, along with a variety of convex-relaxation algorithms for finding good solutions. In this article we pursue a coordinate-descent approach for optimization, and study its convergence properties. We characterize the properties of penalties suitable for this approach, study their corresponding threshold functions, and describe a df-standardizing reparametrization that assists our pathwise algorithm. The MC+ penalty is ideally suited to this task, and we use it to demonstrate the performance of our algorithm. Certain technical derivations and experiments related to this article are included in the Supplementary Materials section.
View details for DOI 10.1198/jasa.2011.tm09738
View details for PubMedID 25580042
View details for PubMedCentralID PMC4286300
-
Regularization Paths for Generalized Linear Models via Coordinate Descent.
Journal of statistical software
2010; 33 (1): 1-22
Abstract
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
View details for PubMedID 20808728
View details for PubMedCentralID PMC2929880
-
Spectral Regularization Algorithms for Learning Large Incomplete Matrices
JOURNAL OF MACHINE LEARNING RESEARCH
2010; 11: 2287-2322
Abstract
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
View details for Web of Science ID 000282523300010
View details for PubMedCentralID PMC3087301
-
Dynamic visualization of statistical learning in the context of high-dimensional textual data
JOURNAL OF WEB SEMANTICS
2010; 8 (2-3): 163-168
View details for DOI 10.1016/j.websem.2010.03.007
View details for Web of Science ID 000279532700009
-
Likelihood-Based Sufficient Dimension Reduction
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2010; 105 (490): 880-880
View details for DOI 10.1198/jasa.2010.tm09295
View details for Web of Science ID 000280216700036
-
Cell type-specific gene expression differences in complex tissues
NATURE METHODS
2010; 7 (4): 287-289
Abstract
We describe cell type-specific significance analysis of microarrays (csSAM) for analyzing differential gene expression for each cell type in a biological sample from microarray data and relative cell-type frequencies. First, we validated csSAM with predesigned mixtures and then applied it to whole-blood gene expression datasets from stable post-transplant kidney transplant recipients and those experiencing acute transplant rejection, which revealed hundreds of differentially expressed genes that were otherwise undetectable.
View details for DOI 10.1038/NMETH.1439
View details for Web of Science ID 000276150600017
View details for PubMedID 20208531
-
Spectral Regularization Algorithms for Learning Large Incomplete Matrices.
Journal of machine learning research : JMLR
2010; 11: 2287-2322
Abstract
We use convex relaxation techniques to provide a sequence of regularized low-rank solutions for large-scale matrix completion problems. Using the nuclear norm as a regularizer, we provide a simple and very efficient convex algorithm for minimizing the reconstruction error subject to a bound on the nuclear norm. Our algorithm Soft-Impute iteratively replaces the missing elements with those obtained from a soft-thresholded SVD. With warm starts this allows us to efficiently compute an entire regularization path of solutions on a grid of values of the regularization parameter. The computationally intensive part of our algorithm is in computing a low-rank SVD of a dense matrix. Exploiting the problem structure, we show that the task can be performed with a complexity linear in the matrix dimensions. Our semidefinite-programming algorithm is readily scalable to large matrices: for example it can obtain a rank-80 approximation of a 10(6) × 10(6) incomplete matrix with 10(5) observed entries in 2.5 hours, and can fit a rank 40 approximation to the full Netflix training set in 6.6 hours. Our methods show very good performance both in training and test error when compared to other competitive state-of-the art techniques.
View details for PubMedID 21552465
View details for PubMedCentralID PMC3087301
-
Discovery of molecular subtypes in leiomyosarcoma through integrative molecular profiling
ONCOGENE
2010; 29 (6): 845-854
Abstract
Leiomyosarcoma (LMS) is a soft tissue tumor with a significant degree of morphologic and molecular heterogeneity. We used integrative molecular profiling to discover and characterize molecular subtypes of LMS. Gene expression profiling was performed on 51 LMS samples. Unsupervised clustering showed three reproducible LMS clusters. Array comparative genomic hybridization (aCGH) was performed on 20 LMS samples and showed that the molecular subtypes defined by gene expression showed distinct genomic changes. Tumors from the 'muscle-enriched' cluster showed significantly increased copy number changes (P=0.04). A majority of the muscle-enriched cases showed loss at 16q24, which contains Fanconi anemia, complementation group A, known to have an important role in DNA repair, and loss at 1p36, which contains PRDM16, of which loss promotes muscle differentiation. Immunohistochemistry (IHC) was performed on LMS tissue microarrays (n=377) for five markers with high levels of messenger RNA in the muscle-enriched cluster (ACTG2, CASQ2, SLMAP, CFL2 and MYLK) and showed significantly correlated expression of the five proteins (all pairwise P<0.005). Expression of the five markers was associated with improved disease-specific survival in a multivariate Cox regression analysis (P<0.04). In this analysis that combined gene expression profiling, aCGH and IHC, we characterized distinct molecular LMS subtypes, provided insight into their pathogenesis, and identified prognostic biomarkers.
View details for DOI 10.1038/onc.2009.381
View details for Web of Science ID 000274397800007
View details for PubMedID 19901961
View details for PubMedCentralID PMC2820592
-
Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (2)
Abstract
Current work in elucidating relationships between diseases has largely been based on pre-existing knowledge of disease genes. Consequently, these studies are limited in their discovery of new and unknown disease relationships. We present the first quantitative framework to compare and contrast diseases by an integrated analysis of disease-related mRNA expression data and the human protein interaction network. We identified 4,620 functional modules in the human protein network and provided a quantitative metric to record their responses in 54 diseases leading to 138 significant similarities between diseases. Fourteen of the significant disease correlations also shared common drugs, supporting the hypothesis that similar diseases can be treated by the same drugs, allowing us to make predictions for new uses of existing drugs. Finally, we also identified 59 modules that were dysregulated in at least half of the diseases, representing a common disease-state "signature". These modules were significantly enriched for genes that are known to be drug targets. Interestingly, drugs known to target these genes/proteins are already known to treat significantly more diseases than drugs targeting other genes/proteins, highlighting the importance of these core modules as prime therapeutic opportunities.
View details for DOI 10.1371/journal.pcbi.1000662
View details for Web of Science ID 000275260000026
View details for PubMedID 20140234
View details for PubMedCentralID PMC2816673
-
Regularization Paths for Generalized Linear Models via Coordinate Descent
JOURNAL OF STATISTICAL SOFTWARE
2010; 33 (1): 1-22
Abstract
We develop fast algorithms for estimation of generalized linear models with convex penalties. The models include linear regression, two-class logistic regression, and multinomial regression problems while the penalties include ℓ(1) (the lasso), ℓ(2) (ridge regression) and mixtures of the two (the elastic net). The algorithms use cyclical coordinate descent, computed along a regularization path. The methods can handle large problems and can also deal efficiently with sparse features. In comparative timings we find that the new algorithms are considerably faster than competing methods.
View details for Web of Science ID 000275203200001
View details for PubMedCentralID PMC2929880
-
A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis
BIOSTATISTICS
2009; 10 (3): 515-534
Abstract
We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.
View details for DOI 10.1093/biostatistics/kxp008
View details for Web of Science ID 000267213700010
View details for PubMedID 19377034
View details for PubMedCentralID PMC2697346
-
Presence-Only Data and the EM Algorithm
BIOMETRICS
2009; 65 (2): 554-563
Abstract
In ecological modeling of the habitat of a species, it can be prohibitively expensive to determine species absence. Presence-only data consist of a sample of locations with observed presences and a separate group of locations sampled from the full landscape, with unknown presences. We propose an expectation-maximization algorithm to estimate the underlying presence-absence logistic model for presence-only data. This algorithm can be used with any off-the-shelf logistic model. For models with stepwise fitting procedures, such as boosted trees, the fitting process can be accelerated by interleaving expectation steps within the procedure. Preliminary analyses based on sampling from presence-absence records of fish in New Zealand rivers illustrate that this new procedure can reduce both deviance and the shrinkage of marginal effect estimates that occur in the naive model often used in practice. Finally, it is shown that the population prevalence of a species is only identifiable when there is some unrealistic constraint on the structure of the logistic model. In practice, it is strongly recommended that an estimate of population prevalence be provided.
View details for DOI 10.1111/j.1541-0420.2008.01116.x
View details for Web of Science ID 000266449900025
View details for PubMedID 18759851
-
Genome-wide association analysis by lasso penalized logistic regression
BIOINFORMATICS
2009; 25 (6): 714-721
Abstract
In ordinary regression, imposition of a lasso penalty makes continuous model selection straightforward. Lasso penalized regression is particularly advantageous when the number of predictors far exceeds the number of observations.The present article evaluates the performance of lasso penalized logistic regression in case-control disease gene mapping with a large number of SNPs (single nucleotide polymorphisms) predictors. The strength of the lasso penalty can be tuned to select a predetermined number of the most relevant SNPs and other predictors. For a given value of the tuning constant, the penalized likelihood is quickly maximized by cyclic coordinate ascent. Once the most potent marginal predictors are identified, their two-way and higher order interactions can also be examined by lasso penalized logistic regression.This strategy is tested on both simulated and real data. Our findings on coeliac disease replicate the previous SNP results and shed light on possible interactions among the SNPs.The software discussed is available in Mendel 9.0 at the UCLA Human Genetics web site.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btp041
View details for Web of Science ID 000264189600003
View details for PubMedID 19176549
View details for PubMedCentralID PMC2732298
-
Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling
98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2009: 368A–368A
View details for Web of Science ID 000262486301668
-
Multi-class AdaBoost
STATISTICS AND ITS INTERFACE
2009; 2 (3): 349-360
View details for Web of Science ID 000282650400009
- The Elements of Statistical Learning: Prediction, Inference, and Data Mining Springer Verlag. 2009
- Multi-class AdaBoost STATISTICS AND ITS INTERFACE STATISTICS AND ITS INTERFACE 2009; 2 (3): 349-360
-
Discovery of Molecular Subtypes in Leiomyosarcoma through Integrative Molecular Profiling
98th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2009: 368A–368A
View details for Web of Science ID 000262371501667
-
New cutpoints to identify increased HER2 copy number: analysis of a large, population-based cohort with long-term follow-up
BREAST CANCER RESEARCH AND TREATMENT
2008; 112 (3): 453-459
Abstract
HER2 gene amplification and/or protein overexpression in breast cancer is associated with a poor prognosis and predicts response to anti-HER2 therapy. We examine the natural history of breast cancers in relationship to increased HER2 copy numbers in a large population-based study.HER2 status was measured by fluorescence in situ hybridization (FISH) and immunohistochemistry (IHC) in approximately 1,400 breast cancer cases with greater than 15 years of follow-up. Protein expression was evaluated with two different commercially-available antibodies.We looked for subgroups of breast cancer with different clinical outcomes, based on HER2 FISH amplification ratio. The current HER2 ratio cut point for classifying HER2 positive and negative cases is 2.2. However, we found an increased risk of disease-specific death associated with FISH ratios of >1.5. An 'intermediate' group of cases with HER2 ratios between 1.5 and 2.2 was found to have a significantly better outcome than the conventional 'amplified' group (HER2 ratio >2.2) but a significantly worse outcome than groups with FISH ratios less than 1.5.Breast cancers with increased HER2 copy numbers (low level HER2 amplification), below the currently accepted positive threshold ratio of 2.2, showed a distinct, intermediate outcome when compared to HER2 unamplified tumors and tumors with HER2 ratios greater than 2.2. These findings suggest that a new cut point to determine HER2 positivity, at a ratio of 1.5 (well below the current recommended cut point of 2.2), should be evaluated.
View details for DOI 10.1007/s10549-007-9887-y
View details for Web of Science ID 000261951000007
View details for PubMedID 18193353
-
NEW MULTICATEGORY BOOSTING ALGORITHMS BASED ON MULTICATEGORY FISHER-CONSISTENT LOSSES.
The annals of applied statistics
2008; 2 (4): 1290-1306
Abstract
Fisher-consistent loss functions play a fundamental role in the construction of successful binary margin-based classifiers. In this paper we establish the Fisher-consistency condition for multicategory classification problems. Our approach uses the margin vector concept which can be regarded as a multicategory generalization of the binary margin. We characterize a wide class of smooth convex loss functions that are Fisher-consistent for multicategory classification. We then consider using the margin-vector-based loss functions to derive multicategory boosting algorithms. In particular, we derive two new multicategory boosting algorithms by using the exponential and logistic regression losses.
View details for DOI 10.1214/08-AOAS198
View details for PubMedID 27347277
View details for PubMedCentralID PMC4918057
-
NEW MULTICATEGORY BOOSTING ALGORITHMS BASED ON MULTICATEGORY FISHER-CONSISTENT LOSSES
ANNALS OF APPLIED STATISTICS
2008; 2 (4): 1290-1306
View details for DOI 10.1214/08-AOAS198
View details for Web of Science ID 000262731100009
-
Risk estimation of distant metastasis in node-negative, estrogen receptor-positive breast cancer patients using an RT-PCR based prognostic expression signature
BMC CANCER
2008; 8
Abstract
Given the large number of genes purported to be prognostic for breast cancer, it would be optimal if the genes identified are not confounded by the continuously changing systemic therapies. The aim of this study was to discover and validate a breast cancer prognostic expression signature for distant metastasis in untreated, early stage, lymph node-negative (N-) estrogen receptor-positive (ER+) patients with extensive follow-up times.197 genes previously associated with metastasis and ER status were profiled from 142 untreated breast cancer subjects. A "metastasis score" (MS) representing fourteen differentially expressed genes was developed and evaluated for its association with distant-metastasis-free survival (DMFS). Categorical risk classification was established from the continuous MS and further evaluated on an independent set of 279 untreated subjects. A third set of 45 subjects was tested to determine the prognostic performance of the MS in tamoxifen-treated women.A 14-gene signature was found to be significantly associated (p < 0.05) with distant metastasis in a training set and subsequently in an independent validation set. In the validation set, the hazard ratios (HR) of the high risk compared to low risk groups were 4.02 (95% CI 1.91-8.44) for the endpoint of DMFS and 1.97 (95% CI 1.28 to 3.04) for overall survival after adjustment for age, tumor size and grade. The low and high MS risk groups had 10-year estimates (95% CI) of 96% (90-99%) and 72% (64-78%) respectively, for DMFS and 91% (84-95%) and 68% (61-75%), respectively for overall survival. Performance characteristics of the signature in the two sets were similar. Ki-67 labeling index (LI) was predictive for recurrent disease in the training set, but lost significance after adjustment for the expression signature. In a study of tamoxifen-treated patients, the HR for DMFS in high compared to low risk groups was 3.61 (95% CI 0.86-15.14).The 14-gene signature is significantly associated with risk of distant metastasis. The signature has a predominance of proliferation genes which have prognostic significance above that of Ki-67 LI and may aid in prioritizing future mechanistic studies and therapeutic interventions.
View details for DOI 10.1186/1471-2407-8-339
View details for Web of Science ID 000262700100001
View details for PubMedID 19025599
View details for PubMedCentralID PMC2631011
-
Combining biological gene expression signatures in predicting outcome in breast cancer: An alternative to supervised classification
EUROPEAN JOURNAL OF CANCER
2008; 44 (15): 2319-2329
Abstract
Gene expression profiling has been extensively used to predict outcome in breast cancer patients. We have previously reported on biological hypothesis-driven analysis of gene expression profiling data and we wished to extend this approach through the combinations of various gene signatures to improve the prediction of outcome in breast cancer.We have used gene expression data (25.000 gene probes) from a previously published study of tumours from 295 early stage breast cancer patients from the Netherlands Cancer Institute using updated follow-up. Tumours were assigned to three prognostic groups using the previously reported Wound-response and hypoxia-response signatures, and the outcome in each of these subgroups was evaluated.We have assigned invasive breast carcinomas from 295 stages I and II breast cancer patients to three groups based on gene expression profiles subdivided by the wound-response signature (WS) and hypoxia-response signature (HS). These three groups are (1) quiescent WS/non-hypoxic HS; (2) activated WS/non-hypoxic HS or quiescent WS/hypoxic tumours and (3) activated WS/hypoxic HS. The overall survival at 15 years for patients with tumours in groups 1, 2 and 3 are 79%, 59% and 27%, respectively. In multivariate analysis, this signature is not only independent of clinical and pathological risk factors; it is also the strongest predictor of outcome. Compared to a previously identified 70-gene prognosis profile, obtained with supervised classification, the combination of signatures performs roughly equally well and might have additional value in the ER-negative subgroup. In the subgroup of lymph node positive patients, the combination signature outperforms the 70-gene signature in multivariate analysis. In addition, in multivariate analysis, the WS/HS combination is a stronger predictor of outcome compared to the recently reported invasiveness gene signature combined with the WS.A combination of biological gene expression signatures can be used to identify a powerful and independent predictor for outcome in breast cancer patients.
View details for DOI 10.1016/j.ejca.2008.07.015
View details for Web of Science ID 000261020800031
View details for PubMedID 18715778
View details for PubMedCentralID PMC3756930
-
"Preconditioning" for feature selection and regression in high-dimensional problems'
ANNALS OF STATISTICS
2008; 36 (4): 1595-1618
View details for DOI 10.1214/009053607000000578
View details for Web of Science ID 000258243000007
-
Dispersal, disturbance and the contrasting biogeographies of New Zealand's diadromous and non-diadromous fish species
JOURNAL OF BIOGEOGRAPHY
2008; 35 (8): 1481-1497
View details for DOI 10.1111/j.1365-2699.2008.01887.x
View details for Web of Science ID 000258260200013
-
A working guide to boosted regression trees
JOURNAL OF ANIMAL ECOLOGY
2008; 77 (4): 802-813
Abstract
1. Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions. 2. This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model. Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion. 3. Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data. They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors. Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance. Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods. 4. The unique features of BRT raise a number of practical issues in model fitting. We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel (Anguilla australis Richardson), a native freshwater fish of New Zealand. We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data. We provide code and a tutorial to enable the wider use of BRT by ecologists.
View details for DOI 10.1111/j.1365-2656.2008.01390.x
View details for Web of Science ID 000256539800020
View details for PubMedID 18397250
-
Sparse inverse covariance estimation with the graphical lasso
BIOSTATISTICS
2008; 9 (3): 432-441
Abstract
We consider the problem of estimating sparse graphs by a lasso penalty applied to the inverse covariance matrix. Using a coordinate descent procedure for the lasso, we develop a simple algorithm--the graphical lasso--that is remarkably fast: It solves a 1000-node problem ( approximately 500,000 parameters) in at most a minute and is 30-4000 times faster than competing methods. It also provides a conceptual link between the exact problem and the approximation suggested by Meinshausen and Bühlmann (2006). We illustrate the method on some cell-signaling data from proteomics.
View details for DOI 10.1093/biostatistics/kxm045
View details for Web of Science ID 000256977000005
View details for PubMedID 18079126
View details for PubMedCentralID PMC3019769
-
Novel methods for the design and evaluation of marine protected areas in offshore waters
CONSERVATION LETTERS
2008; 1 (2): 91-102
View details for DOI 10.1111/j.1755-263X.2008.00012.x
View details for Web of Science ID 000207586900006
-
Radiation-induced gene expression in human subcutaneous fibroblasts is predictive of radiation-induced fibrosis
RADIOTHERAPY AND ONCOLOGY
2008; 86 (3): 314-320
Abstract
Breast cancer patients show a large variation in normal tissue reactions after ionizing radiation (IR) therapy. One of the most common long-term adverse effects of ionizing radiotherapy is radiation-induced fibrosis (RIF), and several attempts have been made over the last years to develop predictive assays for RIF. Our aim was to identify basal and radiation-induced transcriptional profiles in fibroblasts from breast cancer patients that might be related to the individual risk of RIF in these patients.Fibroblast cell lines from 31 individuals with variable risk of RIF (grouped into five classes from low to high risk) were irradiated with two different schemes: 1 x 3.5 Gy with RNA isolated 2 and 24h after irradiation, and a fractionated scheme with 3 x 3.5 Gy in intervals of 24h with RNA isolated 2h after the last dose. RNA was also isolated from non-treated fibroblasts. Transcriptional differences in basal and radiation-induced gene expression profiles were investigated using 15K cDNA microarrays, and results analyzed by both SAM and PAM.Sixty differentially expressed genes were identified by applying SAM on 10 patients with the highest risk of RIF and the four patients with the lowest risk of RIF after the fractionated scheme. The genes were associated with known functions in processes like apoptosis, extracellular matrix remodelling/cell adhesion, proliferation and ROS scavenging. A minimum set of 18 genes were identified that could differentiate high risk from low risk-patients after the fractionated scheme.The classifier of 18 genes may provide basis for a predictive assay for normal tissue reactions after radiotherapy, and provide new insight into the molecular mechanisms of RIF.
View details for DOI 10.1016/j.radonc.2007.09.013
View details for Web of Science ID 000255304300003
View details for PubMedID 17963910
-
HER2 status in a large, population-based cohort: Analysis of distinct HER2 subgroups
97th Annual Meeting of the United-States-and-Canadian-Academy-of-Pathology
NATURE PUBLISHING GROUP. 2008: 39A–39A
View details for Web of Science ID 000252181100167
-
Penalized logistic regression for detecting gene interactions
BIOSTATISTICS
2008; 9 (1): 30-50
Abstract
We propose using a variant of logistic regression (LR) with (L)_(2)-regularization to fit gene-gene and gene-environment interaction models. Studies have shown that many common diseases are influenced by interaction of certain genes. LR models with quadratic penalization not only correctly characterizes the influential genes along with their interaction structures but also yields additional benefits in handling high-dimensional, discrete factors with a binary response. We illustrate the advantages of using an (L)_(2)-regularization scheme and compare its performance with that of "multifactor dimensionality reduction" and "FlexTree," 2 recent tools for identifying gene-gene interactions. Through simulated and real data sets, we demonstrate that our method outperforms other methods in the identification of the interaction structures as well as prediction accuracy. In addition, we validate the significance of the factors selected through bootstrap analyses.
View details for DOI 10.1093/biostatistics/kxm010
View details for Web of Science ID 000251679400003
View details for PubMedID 17429103
-
On the "degrees of freedom" of the lasso
ANNALS OF STATISTICS
2007; 35 (5): 2173-2192
View details for DOI 10.1214/009053607000000127
View details for Web of Science ID 000251096100013
-
Nonlinear estimators and tail bounds for dimension reduction in l(1) using Cauchy random projections
JOURNAL OF MACHINE LEARNING RESEARCH
2007; 8: 2497-2532
View details for Web of Science ID 000252744800010
-
Gene expression programs of human smooth muscle cells: Tissue-specific differentiation and prognostic significance in breast cancers
PLOS GENETICS
2007; 3 (9): 1770-1784
Abstract
Smooth muscle is present in a wide variety of anatomical locations, such as blood vessels, various visceral organs, and hair follicles. Contraction of smooth muscle is central to functions as diverse as peristalsis, urination, respiration, and the maintenance of vascular tone. Despite the varied physiological roles of smooth muscle cells (SMCs), we possess only a limited knowledge of the heterogeneity underlying their functional and anatomic specializations. As a step toward understanding the intrinsic differences between SMCs from different anatomical locations, we used DNA microarrays to profile global gene expression patterns in 36 SMC samples from various tissues after propagation under defined conditions in cell culture. Significant variations were found between the cells isolated from blood vessels, bronchi, and visceral organs. Furthermore, pervasive differences were noted within the visceral organ subgroups that appear to reflect the distinct molecular pathways essential for organogenesis as well as those involved in organ-specific contractile and physiological properties. Finally, we sought to understand how this diversity may contribute to SMC-involving pathology. We found that a gene expression signature of the responses of vascular SMCs to serum exposure is associated with a significantly poorer prognosis in human cancers, potentially linking vascular injury response to tumor progression.
View details for DOI 10.1371/journal.pgen.0030164
View details for Web of Science ID 000249767800019
View details for PubMedID 17907811
View details for PubMedCentralID PMC1994710
-
Averaged gene expressions for regression
BIOSTATISTICS
2007; 8 (2): 212-227
Abstract
Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.
View details for DOI 10.1093/biostatistics/kxl002
View details for Web of Science ID 000245512000004
View details for PubMedID 16698769
-
Margin trees for high-dimensional classification
JOURNAL OF MACHINE LEARNING RESEARCH
2007; 8: 637-652
View details for Web of Science ID 000247002700009
-
Characterization of heterotypic interaction effects in vitro to deconvolute global gene expression profiles in cancer
GENOME BIOLOGY
2007; 8 (9)
Abstract
Perturbations in cell-cell interactions are a key feature of cancer. However, little is known about the systematic effects of cell-cell interaction on global gene expression in cancer.We used an ex vivo model to simulate tumor-stroma interaction by systematically co-cultivating breast cancer cells with stromal fibroblasts and determined associated gene expression changes with cDNA microarrays. In the complex picture of epithelial-mesenchymal interaction effects, a prominent characteristic was an induction of interferon-response genes (IRGs) in a subset of cancer cells. In close proximity to these cancer cells, the fibroblasts secreted type I interferons, which, in turn, induced expression of the IRGs in the tumor cells. Paralleling this model, immunohistochemical analysis of human breast cancer tissues showed that STAT1, the key transcriptional activator of the IRGs, and itself an IRG, was expressed in a subset of the cancers, with a striking pattern of elevated expression in the cancer cells in close proximity to the stroma. In vivo, expression of the IRGs was remarkably coherent, providing a basis for segregation of 295 early-stage breast cancers into two groups. Tumors with high compared to low expression levels of IRGs were associated with significantly shorter overall survival; 59% versus 80% at 10 years (log-rank p = 0.001).In an effort to deconvolute global gene expression profiles of breast cancer by systematic characterization of heterotypic interaction effects in vitro, we found that an interaction between some breast cancer cells and stromal fibroblasts can induce an interferon-response, and that this response may be associated with a greater propensity for tumor progression.
View details for DOI 10.1186/gb-2007-8-9-r191
View details for Web of Science ID 000252100800017
View details for PubMedID 17868458
View details for PubMedCentralID PMC2375029
-
L-1-regularization path algorithm for generalized linear models
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2007; 69: 659-677
View details for Web of Science ID 000249250000008
-
Nonlinear estimators and tail bounds for dimension reduction in l(1) using Cauchy random projections
20th Annual Conference on Learning Theory
SPRINGER-VERLAG BERLIN. 2007: 514–529
View details for Web of Science ID 000247339600037
-
Automatic bias correction methods in semi-supervised learning
AMS/IMS/SIAM Joint Summer Research Conference on Machine and Statistical Learning - Prediction and Discovery
AMER MATHEMATICAL SOC. 2007: 165–175
View details for Web of Science ID 000250954400012
-
Forward stagewise regression and the monotone lasso
ELECTRONIC JOURNAL OF STATISTICS
2007; 1: 1-29
View details for DOI 10.1214/07-EJS004
View details for Web of Science ID 000207854200001
-
Regularized linear discriminant analysis and its application in microarrays
BIOSTATISTICS
2007; 8 (1): 86-100
Abstract
In this paper, we introduce a modified version of linear discriminant analysis, called the "shrunken centroids regularized discriminant analysis" (SCRDA). This method generalizes the idea of the "nearest shrunken centroids" (NSC) (Tibshirani and others, 2003) into the classical discriminant analysis. The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performs very well in multivariate classification problems, often outperforms the PAM method (using the NSC algorithm) and can be as competitive as the support vector machines classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method. The open source R package for this method (named "rda") is available on CRAN (http://www.r-project.org) for download and testing.
View details for DOI 10.1093/biostatistics/kxj035
View details for Web of Science ID 000242715400006
View details for PubMedID 16603682
-
Does cancer risk affect health-related quality of life in patients with Barrett's esophagus?
Digestive Disease Week Meeting/106th Annual Meeting of the American-Gastroenterological-Association
MOSBY-ELSEVIER. 2007: 16–25
Abstract
Health-related quality of life is decreased in patients with GERD and Barrett's esophagus (BE).To determine whether time-tradeoff (TTO) values would differ in patients with BE when patients were asked to trade away the potential risk of esophageal adenocarcinoma rather than chronic heartburn symptoms.A prospective clinical trial.Subjects with biopsy-proven BE.Custom-designed computer program to elicit health-state utility values, quality of life in reflux and dyspepsia (QOLRAD), and Medical Outcomes Survey short form-36 surveys.TTO utility values for the annual cancer-risk-associated current health state and for hypothetical scenarios of dysplasia and esophageal cancer.We studied 60 patients in the cancer-risk cohort (57 men, 92% veteran; mean age [standard deviation; SD], 65 years [11 years], mean GERD duration 17 years [12 years]). The heartburn cohort included 40 patients with GERD and BE with TTO values derived for GERD symptoms. The mean (SD) utility for nondysplastic BE was 0.91 (0.13) compared with 0.90 (0.12) for the heartburn cohort (P = .7). The mean utility values were significantly lower for scenarios of low-grade dysplasia (0.85 [0.12], P = .02) and high-grade dysplasia (0.77 [0.14], P < .005). The mean TTO was 0.67 (0.19) for the scenario of esophageal cancer. There was no correlation between the utility scores and the disease-specific survey scores.TTO values were hypothetical for states of dysplasia and cancer.TTO utility values based on heartburn symptoms or annual risk of cancer in patients with nondysplastic BE are roughly equivalent. However, TTO utility values are significantly lower for health states with increasing cancer risks.
View details for DOI 10.1016/j.gie.2006.05.018
View details for Web of Science ID 000243361000005
View details for PubMedID 17185075
-
Outlier sums for differential gene expression analysis
BIOSTATISTICS
2007; 8 (1): 2-8
Abstract
We propose a method for detecting genes that, in a disease group, exhibit unusually high gene expression in some but not all samples. This can be particularly useful in cancer studies, where mutations that can amplify or turn off gene expression often occur in only a minority of samples. In real and simulated examples, the new method often exhibits lower false discovery rates than simple t-statistic thresholding. We also compare our approach to the recent cancer profile outlier analysis proposal of Tomlins and others (2005).
View details for DOI 10.1093/biostatistics/kx1005
View details for Web of Science ID 000242715400001
View details for PubMedID 16702229
-
Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions
2nd Workshop on Advances in Predictive Species Distribution Models
ELSEVIER SCIENCE BV. 2006: 188–96
View details for DOI 10.1016/j.ecolmodel.2006.05.022
View details for Web of Science ID 000241994800007
-
An RT-PCR-based multi-gene prognostic signature predicts distant metastasis of node negative, ER positive breast cancer from FFPE sections.
42nd Annual Meeting of the American-Society-of-Clinical-Oncology
AMER SOC CLINICAL ONCOLOGY. 2006: 4S–4S
View details for Web of Science ID 000239009400013
-
Sparse principal component analysis
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2006; 15 (2): 265-286
View details for DOI 10.1198/106186006X113430
View details for Web of Science ID 000238044400001
-
Gene expression programs in response to hypoxia: Cell type specificity and prognostic significance in human cancers
PLOS MEDICINE
2006; 3 (3): 395-409
Abstract
Inadequate oxygen (hypoxia) triggers a multifaceted cellular response that has important roles in normal physiology and in many human diseases. A transcription factor, hypoxia-inducible factor (HIF), plays a central role in the hypoxia response; its activity is regulated by the oxygen-dependent degradation of the HIF-1alpha protein. Despite the ubiquity and importance of hypoxia responses, little is known about the variation in the global transcriptional response to hypoxia among different cell types or how this variation might relate to tissue- and cell-specific diseases.We analyzed the temporal changes in global transcript levels in response to hypoxia in primary renal proximal tubule epithelial cells, breast epithelial cells, smooth muscle cells, and endothelial cells with DNA microarrays. The extent of the transcriptional response to hypoxia was greatest in the renal tubule cells. This heightened response was associated with a uniquely high level of HIF-1alpha RNA in renal cells, and it could be diminished by reducing HIF-1alpha expression via RNA interference. A gene-expression signature of the hypoxia response, derived from our studies of cultured mammary and renal tubular epithelial cells, showed coordinated variation in several human cancers, and was a strong predictor of clinical outcomes in breast and ovarian cancers. In an analysis of a large, published gene-expression dataset from breast cancers, we found that the prognostic information in the hypoxia signature was virtually independent of that provided by the previously reported wound signature and more predictive of outcomes than any of the clinical parameters in current use.The transcriptional response to hypoxia varies among human cells. Some of this variation is traceable to variation in expression of the HIF1A gene. A gene-expression signature of the cellular response to hypoxia is associated with a significantly poorer prognosis in breast and ovarian cancer.
View details for DOI 10.1371/journal.pmed.0030047
View details for Web of Science ID 000236897500020
View details for PubMedID 16417408
View details for PubMedCentralID PMC1334226
-
Prediction by supervised principal components
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
2006; 101 (473): 119-137
View details for DOI 10.1198/016214505000000628
View details for Web of Science ID 000235958400016
-
Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees
MARINE ECOLOGY PROGRESS SERIES
2006; 321: 267-281
View details for Web of Science ID 000241282700023
-
Improving random projections using marginal information
19th Annual Conference on Learning Theory (COLT 2006)
SPRINGER-VERLAG BERLIN. 2006: 635–649
View details for Web of Science ID 000239587900046
-
Representing cyclic human motion using functional analysis
14th Annual Neural Information Processing Systems Conference (NIPS)
ELSEVIER SCIENCE BV. 2005: 1264–76
View details for DOI 10.1016/j.imavis.2005.09.004
View details for Web of Science ID 000234243800003
-
Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish
FRESHWATER BIOLOGY
2005; 50 (12): 2034-2052
View details for DOI 10.1111/j.1365-2427.2005.01448.x
View details for Web of Science ID 000233290000011
-
Microarray analysis of the transcriptional response to single or multiple doses of ionizing radiation in human subcutaneous fibroblasts
RADIOTHERAPY AND ONCOLOGY
2005; 77 (3): 231-240
Abstract
Transcriptional profiling of fibroblasts derived from breast cancer patients might improve our understanding of subcutaneous radiation-induced fibrosis. The aim of this study was to get a comprehensive overview of the changes in gene expression in subcutaneous fibroblast cell lines after various ionizing radiation (IR) schemes in order to provide information on potential targets for prevention and to suggest candidate genes for SNP association studies aimed at predicting individual risk of radiation-induced morbidity.Thirty different human fibroblast cell lines were included in the study, and two different radiation schemes; single dose experiments with 3.5 Gy or fractionated with 3 x 3.5 Gy. Expression analyses were performed on unexposed and exposed cells after different time points. The IR response was analyzed using the statistical method Significance Analysis of Microarrays (SAM).While many of the identified genes were involved in known IR response pathways like cell cycle arrest, proliferation and detoxification, a substantial fraction of the genes were involved in processes not previously associated with IR response. Of particular interest is genes involved in ECM remodelling, Wnt signalling and IGF signalling. Many of the genes were identified after a single dose, but transcriptional changes in genes related to ROS scavenging and ECM remodelling were most profound after a fractionated scheme.We have identified a number of IR response pathways in fibroblasts derived from breast cancer patients. Besides previously identified pathways, we have identified new pathways and genes that could be relevant for prevention and intervention studies of subcutaneous radiation-induced fibrosis as well as being candidates for SNP association studies.
View details for DOI 10.1016/j.radonc.2005.09.020
View details for Web of Science ID 000234358900002
View details for PubMedID 16297999
-
Constrained ordination analysis with flexible response functions
ECOLOGICAL MODELLING
2005; 187 (4): 524-536
View details for DOI 10.1016/j.ecolmodel.2005.01.049
View details for Web of Science ID 000232940400009
-
Quantitative measurements of alternating finger tapping in Parkinson's disease correlate with UPDRS motor disability and reveal the improvement in fine motor control from medication and deep brain stimulation.
Movement disorders
2005; 20 (10): 1286-1298
Abstract
The Unified Parkinson's Disease Rating Scale (UPDRS) is the primary outcome measure in most clinical trials of Parkinson's disease (PD) therapeutics. Each subscore of the motor section (UPDRS III) compresses a wide range of motor performance into a coarse-grained scale from 0 to 4; the assessment of performance can also be subjective. Quantitative digitography (QDG) is an objective, quantitative assessment of digital motor control using a computer-interfaced musical keyboard. In this study, we show that the kinematics of a repetitive alternating finger-tapping (RAFT) task using QDG correlate with the UPDRS motor score, particularly with the bradykinesia subscore, in 33 patients with PD. We show that dopaminergic medication and an average of 9.5 months of bilateral subthalamic nucleus deep brain stimulation (B-STN DBS) significantly improve UPDRS and QDG scores but may have different effects on certain kinematic parameters. This study substantiates the use of QDG to measure motor outcome in trials of PD therapeutics and shows that medication and B-STN DBS both improve fine motor control.
View details for PubMedID 16001401
-
Quantitative measurements of Parkinson's disease correlate alternating finger tapping in with UPDRS motor disability and reveal the improvement in fine motor control from medication and deep brain stimulation
MOVEMENT DISORDERS
2005; 20 (10): 1286-1298
Abstract
The Unified Parkinson's Disease Rating Scale (UPDRS) is the primary outcome measure in most clinical trials of Parkinson's disease (PD) therapeutics. Each subscore of the motor section (UPDRS III) compresses a wide range of motor performance into a coarse-grained scale from 0 to 4; the assessment of performance can also be subjective. Quantitative digitography (QDG) is an objective, quantitative assessment of digital motor control using a computer-interfaced musical keyboard. In this study, we show that the kinematics of a repetitive alternating finger-tapping (RAFT) task using QDG correlate with the UPDRS motor score, particularly with the bradykinesia subscore, in 33 patients with PD. We show that dopaminergic medication and an average of 9.5 months of bilateral subthalamic nucleus deep brain stimulation (B-STN DBS) significantly improve UPDRS and QDG scores but may have different effects on certain kinematic parameters. This study substantiates the use of QDG to measure motor outcome in trials of PD therapeutics and shows that medication and B-STN DBS both improve fine motor control.
View details for DOI 10.1002/mds.20556
View details for Web of Science ID 000232749300005
-
Combination of two biological gene expression signatures in predicting outcome in breast cancer as an alternative for supervised classification
PERGAMON-ELSEVIER SCIENCE LTD. 2005: 71–72
View details for Web of Science ID 000247564800237
-
Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (10): 3738-3743
Abstract
Based on the hypothesis that features of the molecular program of normal wound healing might play an important role in cancer metastasis, we previously identified consistent features in the transcriptional response of normal fibroblasts to serum, and used this "wound-response signature" to reveal links between wound healing and cancer progression in a variety of common epithelial tumors. Here, in a consecutive series of 295 early breast cancer patients, we show that both overall survival and distant metastasis-free survival are markedly diminished in patients whose tumors expressed this wound-response signature compared to tumors that did not express this signature. A gene expression centroid of the wound-response signature provides a basis for prospectively assigning a prognostic score that can be scaled to suit different clinical purposes. The wound-response signature improves risk stratification independently of known clinico-pathologic risk factors and previously established prognostic signatures based on unsupervised hierarchical clustering ("molecular subtypes") or supervised predictors of metastasis ("70-gene prognosis signature").
View details for DOI 10.1073/pnas.0409462102
View details for PubMedID 15701700
-
Patient-derived health state utilities for gastroesophageal reflux disease
AMERICAN JOURNAL OF GASTROENTEROLOGY
2005; 100 (3): 524-533
Abstract
Gastroesophageal reflux disease is a chronic disease that adversely affects health-related quality of life. The purpose of this study was to derive health state utilities for patients with chronic heartburn symptoms.We used a custom-designed computer program in order to elicit utilities with the time-tradeoff and standard-gamble techniques. Patients with chronic (more than 6 months) symptoms of gastroesophageal reflux disease entered the study. Two interviews were performed in random sequence either initially on medications for heartburn that adequately controlled symptoms, or off of medications for 1 wk while the patient was symptomatic. We also collected data using visual-analog scales, quality of life in reflux and dyspepsia (QOLRAD), and Gastrointestinal Symptom Rating Scale (GSRS) scores.We invited 222 patients to participate; 158 (71%) patients (129 men, 29 women) completed the study. Barrett's esophagus was present in 40 (25%), erosive disease in 17 (11%), and 118 (74%) had comorbid conditions. The mean (+/-SD) utility ratings were 0.94 +/- 0.09 on medical therapy and 0.90 +/- 0.12 off medications for patients with reflux alone using time tradeoff (p= 0.004), and 0.94 +/- 8.0 both on and off of antireflux medications with standard-gamble assessment (p= 0.96). Mean time-tradeoff scores were also significantly lower off of medications for patients with other comorbid conditions (p= 0.002). There was no significant difference between mean utility scores for patients with or without Barrett's esophagus or erosive disease.Gastroesophageal reflux disease adversely affects health-related quality of life. Time-tradeoff utility for patients with reflux disease is substantially higher when patients are on medication than off medications.
View details for DOI 10.1111/j.1572-0241.40588.x
View details for Web of Science ID 000227697900005
View details for PubMedID 15743346
-
Kernel logistic regression and the import vector machine
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2005; 14 (1): 185-205
View details for DOI 10.1198/106186005X25619
View details for Web of Science ID 000235041600011
-
Sample classification from protein mass spectrometry, by 'peak probability contrasts'
BIOINFORMATICS
2004; 20 (17): 3034-3044
Abstract
Early cancer detection has always been a major research focus in solid tumor oncology. Early tumor detection can theoretically result in lower stage tumors, more treatable diseases and ultimately higher cure rates with less treatment-related morbidities. Protein mass spectrometry is a potentially powerful tool for early cancer detection. We propose a novel method for sample classification from protein mass spectrometry data. When applied to spectra from both diseased and healthy patients, the 'peak probability contrast' technique provides a list of all common peaks among the spectra, their statistical significance and their relative importance in discriminating between the two groups. We illustrate the method on matrix-assisted laser desorption and ionization mass spectrometry data from a study of ovarian cancers.Compared to other statistical approaches for class prediction, the peak probability contrast method performs as well or better than several methods that require the full spectra, rather than just labelled peaks. It is also much more interpretable biologically. The peak probability contrast method is a potentially useful tool for sample classification from protein mass spectrometry data.
View details for DOI 10.1093/bioinformatics/bth357
View details for Web of Science ID 000225361400017
View details for PubMedID 15226172
-
Efficient quadratic regularization for expression arrays
BIOSTATISTICS
2004; 5 (3): 329-340
Abstract
Gene expression arrays typically have 50 to 100 samples and 1000 to 20,000 variables (genes). There have been many attempts to adapt statistical models for regression and classification to these data, and in many cases these attempts have challenged the computational resources. In this article we expose a class of techniques based on quadratic regularization of linear models, including regularized (ridge) regression, logistic and multinomial regression, linear and mixture discriminant analysis, the Cox model and neural networks. For all of these models, we show that dramatic computational savings are possible over naive implementations, using standard transformations in numerical linear algebra.
View details for DOI 10.1093/biostatistics/kxh010
View details for Web of Science ID 000222723600001
View details for PubMedID 15208198
-
Classification of gene microarrays by penalized logistic regression
BIOSTATISTICS
2004; 5 (3): 427-443
Abstract
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.
View details for DOI 10.1093/biostatistics/kxg046
View details for Web of Science ID 000222723600007
View details for PubMedID 15208204
-
Microelectrode recording revealing a somatotopic body map in the subthalamic nucleus in humans with Parkinson disease
JOURNAL OF NEUROSURGERY
2004; 100 (4): 611-618
Abstract
The subthalamic nucleus (STN) is a key structure for motor control through the basal ganglia. The aim of this study was to show that the STN in patients with Parkinson disease (PD) has a somatotopic organization similar to that in nonhuman primates.A functional map of the STN was obtained using electrophysiological microrecording during placement of deep brain stimulation (DBS) electrodes in patients with PD. Magnetic resonance imaging was combined with ventriculography and intraoperative x-ray film to assess the position of the electrodes and the STN units, which were activated by limb movements to map the sensorimotor region of the STN. Each activated cell was located relative to the anterior commissure-posterior commissure line. Three-dimensional coordinates of the cells were analyzed statistically to determine whether those cells activated by movements of the arm and leg were segregated spatially. Three hundred seventy-nine microelectrode tracks were created during placement of 71 DBS electrodes in 44 consecutive patients. Somatosensory driving was found in 288 tracks. The authors identified and localized 1213 movement-related cells and recorded responses from 29 orofacial cells, 480 arm-related cells, 558 leg-related cells, and 146 cells responsive to both arm and leg movements. Leg-related cells were localized in medial (p < 0.0001) and ventral (p < 0.0004) positions and tended to be situated anteriorly (p = 0.063) relative to arm-related cells.Evidence of somatotopic organization in the STN in patients with PD supports the current theory of highly segregated loops integrating cortex-basal ganglia connections. These loops are preserved in chronic degenerative diseases such as PD, but may subserve a distorted body map. This finding also supports the relevance of microelectrode mapping in the optimal placement of DBS electrodes along the subthalamic homunculus.
View details for Web of Science ID 000220440900009
View details for PubMedID 15070113
-
Mitral annular size predicts Alfieri stitch tension in mitral edge-to-edge repalir
JOURNAL OF HEART VALVE DISEASE
2004; 13 (2): 165-173
Abstract
Whilst increased 'Alfieri stitch' tension may reduce the durability of 'edge-to-edge' mitral repair, the factors affecting suture tension are unknown. In order to study hemodynamics and left ventricular (LV) and annular dynamics that determine suture tension, the central edge of the mitral leaflets was approximated with a miniature force transducer to measure leaflet tension (T) at the leaflet approximation point.Eight sheep were studied under open-chest conditions immediately after surgical placement of a force transducer and implantation of radiopaque markers on the left ventricle and mitral annulus (MA). Hemodynamic variables were altered by two caval occlusion steps (deltaV1 and deltaV2) and dobutamine infusion. Three-dimensional marker coordinates were obtained by simultaneous biplane videofluoroscopy to measure LV volume, MA area (MAA) and septal-lateral (SL) annular dimension throughout the cardiac cycle.At baseline, peak Alfieri stitch tension (0.30 +/- 0.18 N) was observed 96 +/- 61 ms prior to end-diastole coincident with peak annular SL diameter (98 +/- 58 ms before end-diastole). Dobutamine infusion decreased suture tension (from 0.30 +/- 0.18 N to 0.20 +/- 0.12 N, p = 0.01), although peak systolic pressure increased significantly (138 +/- 19 versus 115 +/- 14 mmHg; p = 0.03). A regression model was fitted with the goal of interpreting the hemodynamic and geometric predictors of tension as their influence varied with time: Tt (N) = 0.1916 + 0.2115 x SL (cm) - 0.1996 x MAA/SL (cm2/cm) + ft x LVP (mmHg), where Tt is tension at any time during the cardiac cycle and ft is the time-varying coefficient of LVP.Tension on the leaflets in the edge-to-edge repair is determined primarily by MA SL size, and paradoxically is lower when the contractile state is enhanced. This indicates that annular and/or LV dilatation increase stitch tension and may adversely affect durability of the repair if concomitant ring annuloplasty is not performed.
View details for PubMedID 15086253
-
1-norm support vector machines
17th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 2004: 49–56
View details for Web of Science ID 000225309500007
-
Margin maximizing loss functions
17th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 2004: 1237–1244
View details for Web of Science ID 000225309500154
-
Gene expression patterns in ovarian carcinomas
MOLECULAR BIOLOGY OF THE CELL
2003; 14 (11): 4376-4386
Abstract
We used DNA microarrays to characterize the global gene expression patterns in surface epithelial cancers of the ovary. We identified groups of genes that distinguished the clear cell subtype from other ovarian carcinomas, grade I and II from grade III serous papillary carcinomas, and ovarian from breast carcinomas. Six clear cell carcinomas were distinguished from 36 other ovarian carcinomas (predominantly serous papillary) based on their gene expression patterns. The differences may yield insights into the worse prognosis and therapeutic resistance associated with clear cell carcinomas. A comparison of the gene expression patterns in the ovarian cancers to published data of gene expression in breast cancers revealed a large number of differentially expressed genes. We identified a group of 62 genes that correctly classified all 125 breast and ovarian cancer specimens. Among the best discriminators more highly expressed in the ovarian carcinomas were PAX8 (paired box gene 8), mesothelin, and ephrin-B1 (EFNB1). Although estrogen receptor was expressed in both the ovarian and breast cancers, genes that are coregulated with the estrogen receptor in breast cancers, including GATA-3, LIV-1, and X-box binding protein 1, did not show a similar pattern of coexpression in the ovarian cancers.
View details for PubMedID 12960427
-
Repeated observation of breast tumor subtypes in independent gene expression data sets
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (14): 8418-8423
Abstract
Characteristic patterns of gene expression measured by DNA microarrays have been used to classify tumors into clinically relevant subgroups. In this study, we have refined the previously defined subtypes of breast tumors that could be distinguished by their distinct patterns of gene expression. A total of 115 malignant breast tumors were analyzed by hierarchical clustering based on patterns of expression of 534 "intrinsic" genes and shown to subdivide into one basal-like, one ERBB2-overexpressing, two luminal-like, and one normal breast tissue-like subgroup. The genes used for classification were selected based on their similar expression levels between pairs of consecutive samples taken from the same tumor separated by 15 weeks of neoadjuvant treatment. Similar cluster analyses of two published, independent data sets representing different patient cohorts from different laboratories, uncovered some of the same breast cancer subtypes. In the one data set that included information on time to development of distant metastasis, subtypes were associated with significant differences in this clinical feature. By including a group of tumors from BRCA1 carriers in the analysis, we found that this genotype predisposes to the basal tumor subtype. Our results strongly support the idea that many of these breast tumor subtypes represent biologically distinct disease entities.
View details for DOI 10.1073/pnas.0932692100
View details for Web of Science ID 000184222500069
View details for PubMedID 12829800
View details for PubMedCentralID PMC166244
-
Note on "Comparison of model selection for regression" by Vladimir Cherkassky and Yunqian Ma
NEURAL COMPUTATION
2003; 15 (7): 1477-1480
Abstract
While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212-213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation--errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment "SRM performs poorly overall" with a more moderate "the performance of SRM is mixed" (p. 212).
View details for Web of Science ID 000183421400002
View details for PubMedID 12816562
-
Post-transplantation lymphoproliferative disease in heart and heart-lung transplant recipients: 30-year experience at Stanford University
21st Annual Meeting of the International-Society-for-Heart-and-Lung-Transplantation
ELSEVIER SCIENCE INC. 2003: 505–14
Abstract
Post-transplantation lymphoproliferative disease (PTLD) is an important source of morbidity and mortality in transplant recipients, with a reported incidence of 0.8% to 20%. Risk factors are thought to include immunosuppressive agents and viral infection. This study attempts to evaluate the impact of different immunosuppressive regimens, ganciclovir prophylaxis and other potential risk factors in the development of PTLD.We reviewed the records of 1026 (874 heart, 152 heart-lung) patients who underwent transplantation at Stanford between 1968 and 1997. Of these, 57 heart and 8 heart-lung recipients developed PTLD. During this interval, 4 different immunosuppressive regimens were utilized sequentially. In January 1987, ganciclovir prophylaxis for cytomegalovirus serologic-positive patients was introduced. Other potential risk factors evaluated included age, gender, prior cardiac diagnoses, HLA match, rejection frequency and calcium-channel blockade.No correlation of development of PTLD was found with different immunosuppression regimens consisting of azathioprine, prednisone, cyclosporine, OKT3 induction, tacrolimus and mycophenolate mofetil. A trend suggesting an influence of ganciclovir on the prevention of PTLD was not statistically significant (p = 0.12). Recipient age and rejection frequency, as well as high-dose cyclosporine immunosuppression, were significantly (p < 0.02) associated with PTLD development. The prevalence of PTLD at 13.3 years was 15%.The overall incidence of PTLD was 6.3%. It was not altered by sequential modifications in treatment regimens. Younger recipient age and higher rejection frequency were associated with increased PTLD occurrence. The 15% prevalence of PTLD in 58 long-term survivors was unexpectedly high.
View details for DOI 10.1016/S1053-2498(02)01229-9
View details for PubMedID 12742411
-
Ischemia in three left ventricular regions: Insights into the pathogenesis of acute ischemic mitral regurgitation
82nd Annual Meeting of the American-Association-for-Thoracic-Surgery
MOSBY-ELSEVIER. 2003: 559–69
Abstract
Acute posterolateral left ventricular ischemia in sheep results in ischemic mitral regurgitation, but the effects of ischemia in other left ventricular regions on ischemic mitral regurgitation is unknown.Six adult sheep had radiopaque markers placed on the left ventricle, mitral annulus, and anterior and posterior mitral leaflets at the valve center and near the anterior and posterior commissures. After 6 to 8 days, animals were studied with biplane videofluoroscopy and transesophageal echocardiography before and during sequential balloon occlusion of the left anterior descending, distal left circumflex, and proximal left circumflex coronary arteries. Time of valve closure was defined as the time when the distance between leaflet edge markers reached its minimum plateau, and systolic leaflet edge separation distance was calculated on the basis of left ventricular ejection.Only proximal left circumflex coronary artery occlusion resulted in ischemic mitral regurgitation, which was central and holosystolic. Delayed valve closure (anterior commissure, 58 +/- 29 vs 92 +/- 24 ms; valve center, 52 +/- 26 vs 92 +/- 23 ms; posterior commissure, 60 +/- 30 vs 94 +/- 14 ms; all P <.05) and increased leaflet edge separation distance during ejection (mean increase, 2.2 +/- 1.5 mm, 2.1 +/- 1.9 mm, and 2.1 +/- 1.5 mm at the anterior commissure, valve center, and posterior commissure, respectively; P <.05 for all) was seen during proximal left circumflex coronary artery occlusion but not during left anterior descending or distal left circumflex coronary artery occlusion. Ischemic mitral regurgitation was associated with a 19% +/- 10% increase in mitral annular area, and displacement of both papillary muscle tips away from the septal annulus at end systole.Acute ischemic mitral regurgitation in sheep occurred only after proximal left circumflex coronary artery occlusion along with delayed valve closure in early systole and increased leaflet edge separation throughout ejection in all 3 leaflet coaptation sites. The degree of left ventricular systolic dysfunction induced did not correlate with ischemic mitral regurgitation, but both altered valvular and subvalvular 3-dimensional geometry were necessary to produce ischemic mitral regurgitation during acute left ventricular ischemia.
View details for DOI 10.1067/mtc.2003.43
View details for PubMedID 12658198
-
Feature extraction for nonparametric discriminant analysis
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
2003; 12 (1): 101-120
View details for DOI 10.1198/1061860031220
View details for Web of Science ID 000181549800005
-
Class prediction by nearest shrunken centroids, with applications to DNA microarrays
STATISTICAL SCIENCE
2003; 18 (1): 104-117
View details for Web of Science ID 000184301600006
-
Boosting and support vector machines as optimal separators
Conference on Document Recognition and Retrieval X
SPIE-INT SOC OPTICAL ENGINEERING. 2003: 1–7
View details for Web of Science ID 000181749800001
-
Generalized linear and generalized additive models in studies of species distributions: setting the scene
ECOLOGICAL MODELLING
2002; 157 (2-3): 89-100
View details for Web of Science ID 000179241300001
-
Risk factors for progressive cartilage loss in the knee: a longitudinal magnetic resonance imaging study in forty-three patients.
Arthritis and rheumatism
2002; 46 (11): 2884-2892
Abstract
To evaluate the rate of progression of cartilage loss in the knee joint using magnetic resonance imaging (MRI) and to evaluate potential risk factors for more rapid cartilage loss.We evaluated baseline and followup MRIs of the knees in 43 patients (minimum time interval of 1 year, mean 1.8 years, range 52-285 weeks). Cartilage loss was graded in the anterior, central, and posterior regions of the medial and lateral knee compartments. Knee joints were also evaluated for other pathology. Data were analyzed using analysis of variance models.Patients who had sustained meniscal tears showed a higher average rate of progression of cartilage loss (22%) than that seen in those who had intact menisci (14.9%) (P
View details for PubMedID 12428228
-
Risk factors for progressive cartilage loss in the knee
ARTHRITIS AND RHEUMATISM
2002; 46 (11): 2884-2892
View details for DOI 10.1002/art.10573
View details for Web of Science ID 000179239500008
-
Cortisol and behavior in fragile X syndrome
PSYCHONEUROENDOCRINOLOGY
2002; 27 (7): 855-872
Abstract
The purpose of this study was to determine if children with fragile X syndrome, who typically demonstrate a neurobehavioral phenotype that includes social anxiety, withdrawal, and hyper-arousal, have increased levels of cortisol, a hormone associated with stress. The relevance of adrenocortical activity to the fragile X phenotype also was examined.One hundred and nine children with the fragile X full mutation (70 males and 39 females) and their unaffected siblings (51 males and 58 females) completed an in-home evaluation including a cognitive assessment and a structured social challenge task. Multiple samples of salivary cortisol were collected throughout the evaluation day and on two typical non-school days. Measures of the fragile X mental retardation (FMR1) gene, child intelligence, the quality of the home environment, parental psychopathology, and the effectiveness of educational and therapeutic services also were collected. Linear mixed-effects analyses were used to examine differences in cortisol associated with the fragile X diagnosis and gender (fixed effects) and to estimate individual subject and familial variation (random effects) in cortisol hormone levels. Hierarchical multiple regression analyses were conducted to determine whether adrenocortical activity is associated with behavior problems after controlling for significant genetic and environmental factors.Results showed that children with fragile X, especially males, had higher levels of salivary cortisol on typical days and during the evaluation. Highly significant family effects on salivary cortisol were detected, consistent with previous work documenting genetic and environmental influences on adrenocortical activity. Increased cortisol was significantly associated with behavior problems in boys and girls with fragile X but not in their unaffected siblings.These results provide evidence that the function of the hypothalamic-pituitary-adrenal axis may have an independent association with behavioral problems in children with fragile X syndrome.
View details for Web of Science ID 000178462800008
View details for PubMedID 12183220
-
Degrees-of-freedom tests for smoothing splines
BIOMETRIKA
2002; 89 (2): 251-263
View details for Web of Science ID 000176520500001
-
Diagnosis of multiple cancer types by shrunken centroids of gene expression
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (10): 6567-6572
Abstract
We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.
View details for Web of Science ID 000175637300012
View details for PubMedID 12011421
-
Exploratory screening of genes and clusters from microarray experiments
STATISTICA SINICA
2002; 12 (1): 47-59
View details for Web of Science ID 000174372800004
-
Kernel logistic regression and the import vector machine
15th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 2002: 1081–1088
View details for Web of Science ID 000180520100135
-
Supervised learning from microarray data
15th Biannual Conference on Computational Statistics (COMPSTAT)
PHYSICA-VERLAG GMBH & CO. 2002: 67–77
View details for Web of Science ID 000179942900007
-
Optimization and evaluation of T7 based RNA linear amplification protocols for cDNA microarray analysis
BMC GENOMICS
2002; 3
Abstract
T7 based linear amplification of RNA is used to obtain sufficient antisense RNA for microarray expression profiling. We optimized and systematically evaluated the fidelity and reproducibility of different amplification protocols using total RNA obtained from primary human breast carcinomas and high-density cDNA microarrays.Using an optimized protocol, the average correlation coefficient of gene expression of 11,123 cDNA clones between amplified and unamplified samples is 0.82 (0.85 when a virtual array was created using repeatedly amplified samples to minimize experimental variation). Less than 4% of genes show changes in expression level by 2-fold or greater after amplification compared to unamplified samples. Most changes due to amplification are not systematic both within one tumor sample and between different tumors. Amplification appears to dampen the variation of gene expression for some genes when compared to unamplified poly(A)+ RNA. The reproducibility between repeatedly amplified samples is 0.97 when performed on the same day, but drops to 0.90 when performed weeks apart. The fidelity and reproducibility of amplification is not affected by decreasing the amount of input total RNA in the 0.3-3 micrograms range. Adding template-switching primer, DNA ligase, or column purification of double-stranded cDNA does not improve the fidelity of amplification. The correlation coefficient between amplified and unamplified samples is higher when total RNA is used as template for both experimental and reference RNA amplification.T7 based linear amplification reproducibly generates amplified RNA that closely approximates original sample for gene expression profiling using cDNA microarrays.
View details for Web of Science ID 000181477100031
View details for PubMedID 12445333
-
Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2001; 98 (19): 10869-10874
Abstract
The purpose of this study was to classify breast carcinomas based on variations in gene expression patterns derived from cDNA microarrays and to correlate tumor characteristics to clinical outcome. A total of 85 cDNA microarray experiments representing 78 cancers, three fibroadenomas, and four normal breast tissues were analyzed by hierarchical clustering. As reported previously, the cancers could be classified into a basal epithelial-like group, an ERBB2-overexpressing group and a normal breast-like group based on variations in gene expression. A novel finding was that the previously characterized luminal epithelial/estrogen receptor-positive group could be divided into at least two subgroups, each with a distinctive expression profile. These subtypes proved to be reasonably robust by clustering using two different gene sets: first, a set of 456 cDNA clones previously selected to reflect intrinsic properties of the tumors and, second, a gene set that highly correlated with patient outcome. Survival analyses on a subcohort of patients with locally advanced breast cancer uniformly treated in a prospective study showed significantly different outcomes for the patients belonging to the various groups, including a poor prognosis for the basal-like subtype and a significant difference in outcome for the two estrogen receptor-positive groups.
View details for Web of Science ID 000170966800067
View details for PubMedID 11553815
View details for PubMedCentralID PMC58566
-
Brain anatomy, gender and IQ in children and adolescents with fragile X syndrome
BRAIN
2001; 124: 1610-1618
Abstract
This study utilized MRI data to describe neuroanatomical morphology in children and adolescents with fragile X syndrome, the most common inherited cause of developmental disability. The syndrome provides a model for understanding how specific genetic factors can influence both neuroanatomy and cognitive capacity. Thirty-seven children and adolescents with fragile X syndrome received an MRI scan and cognitive testing. Scanning procedures and analytical strategies were identical to those reported in an earlier study of 85 typically developing children, permitting a comparison with a previously published template of normal brain development. Regression analyses indicated that there was a normative age-related decrease in grey matter and an increase in white matter. However, caudate and ventricular CSF volumes were significantly enlarged, and caudate volumes decreased with age. Rates of reduction of cortical grey matter were different for males and females. IQ scores were not significantly correlated with volumes of cortical and subcortical grey matter, and these relationships were statistically different from the correlational patterns observed in typically developing children. Children with fragile X syndrome exhibited several typical neurodevelopmental patterns. Aberrations in volumes of subcortical nuclei, gender differences in rates of cortical grey matter reduction and an absence of correlation between grey matter and cognitive performance provided indices of the deleterious effects of the fragile X mutation on the brain's structural organization.
View details for Web of Science ID 000170453400013
View details for PubMedID 11459752
-
Missing value estimation methods for DNA microarrays
BIOINFORMATICS
2001; 17 (6): 520-525
Abstract
Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
View details for Web of Science ID 000169404700005
View details for PubMedID 11395428
-
Posttransplantation lymphoproliferative disease in heart and heart-lung transplant recipients: thirty years experience at our hospital.
journal of heart and lung transplantation
2001; 20 (2): 258-?
View details for PubMedID 11250519
-
Supervised harvesting of expression trees
GENOME BIOLOGY
2001; 2 (1)
Abstract
We propose a new method for supervised learning from gene expression data. We call it 'tree harvesting'. This technique starts with a hierarchical clustering of genes, then models the outcome variable as a sum of the average expression profiles of chosen clusters and their products. It can be applied to many different kinds of outcome measures such as censored survival times, or a response falling in two or more classes (for example, cancer classes). The method can discover genes that have strong effects on their own, and genes that interact with other genes.We illustrate the method on data from a lymphoma study, and on a dataset containing samples from eight different cancers. It identified some potentially interesting gene clusters. In simulation studies we found that the procedure may require a large number of experimental samples to successfully discover interactions.Tree harvesting is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
View details for Web of Science ID 000207583500011
View details for PubMedID 11178280
-
Learning and tracking cyclic human motion
14th Annual Neural Information Processing Systems Conference (NIPS)
M I T PRESS. 2001: 894–900
View details for Web of Science ID 000171891800126
-
Functional linear discriminant analysis for irregularly sampled curves
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2001; 63: 533-550
View details for Web of Science ID 000170353900006
- The Elements of Statistical Learning: Prediction, Inference and Data Mining Springer Verlag. 2001
-
Estimating the number of clusters in a data set via the gap statistic
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
2001; 63: 411-423
View details for Web of Science ID 000168837200013
-
Principal component models for sparse functional data
BIOMETRIKA
2000; 87 (3): 587-602
View details for Web of Science ID 000089678500007
-
Prediction of risk for patients with unstable angina.
Evidence report/technology assessment (Summary)
2000: 1-3
View details for PubMedID 11013605
-
Bayesian backfitting
STATISTICAL SCIENCE
2000; 15 (3): 196-213
View details for Web of Science ID 000166404100002
-
'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns.
Genome biology
2000; 1 (2): RESEARCH0003-?
Abstract
Large gene expression studies, such as those conducted using DNA arrays, often provide millions of different pieces of data. To address the problem of analyzing such data, we describe a statistical method, which we have called 'gene shaving'. The method identifies subsets of genes with coherent expression patterns and large variation across conditions. Gene shaving differs from hierarchical clustering and other widely used methods for analyzing gene expression studies in that genes may belong to more than one cluster, and the clustering may be supervised by an outcome measure. The technique can be 'unsupervised', that is, the genes and samples are treated as unlabeled, or partially or fully supervised by using known properties of the genes or samples to assist in finding meaningful groupings.We illustrate the use of the gene shaving method to analyze gene expression measurements made on samples from patients with diffuse large B-cell lymphoma. The method identifies a small cluster of genes whose expression is highly predictive of survival.The gene shaving method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation.
View details for PubMedID 11178228
-
Optimal kernel shapes for local linear regression
13th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 2000: 540–546
View details for Web of Science ID 000165048700077
-
Bone mineral acquisition in healthy Asian, Hispanic, black, and Caucasian youth: A longitudinal study
JOURNAL OF CLINICAL ENDOCRINOLOGY & METABOLISM
1999; 84 (12): 4702-4712
Abstract
Ethnic and gender differences in bone mineral acquisition were examined in a longitudinal study of 423 healthy Asian, black, Hispanic, and white males and females (aged 9-25 yr). Bone mass of the spine, femoral neck, total hip, and whole body was measured annually for up to 4 yr by dual energy x-ray absorptiometry. Age-adjusted mean bone mineral curves for areal (BMD) and volumetric (BMAD) bone mineral density were compared for the 4 ethnic groups. Consistent differences in areal and volumetric bone density were observed only between black and nonblack subjects. Among females, blacks had greater mean levels of BMD and BMAD at all skeletal sites. Differences among Asians, Hispanics, and white females were significant for femoral neck BMD, whole body BMD, and whole body bone mineral content/height ratio, for which Asians had significantly lower values; femoral neck BMAD in Asian and white females was lower than that in Hispanics. Like the females, black males had consistently greater mean values than nonblacks for all BMD and BMAD measurements. A few differences were also observed among nonblack male subjects. Whites had greater mean total hip BMD, whole body BMD, and whole body bone mineral content/height ratio than Asian and Hispanic males; Hispanics had lower spine BMD than white and Asian males. The tempo of gains in BMD varied by gender and skeletal site. In females, total hip, spine, and whole body BMD reached a plateau at 14.1, 15.7, and 16.4 yr, respectively. For males, gains in BMD leveled off at 15.7 yr for total hip and at age 17.6 yr for spine and whole body. Black and Asian females and Asian males tended to reach a plateau in BMD earlier than the other ethnic groups. The use of gender- and ethnic-specific standards is recommended when interpreting pediatric bone densitometry data.
View details for Web of Science ID 000084134100065
View details for PubMedID 10599739
-
An evaluation of beta-blockers, calcium antagonists, nitrates, and alternative therapies for stable angina.
Evidence report/technology assessment (Summary)
1999: 1-2
View details for PubMedID 11925969
-
Statistical measures for the computer-aided diagnosis of mammographic masses
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
1999; 8 (3): 531-543
View details for Web of Science ID 000083134100011
-
Meta-analysis of trials comparing beta-blockers, calcium antagonists, and nitrates for stable angina
JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION
1999; 281 (20): 1927-1936
Abstract
Which drug is most effective as a first-line treatment for stable angina is not known.To compare the relative efficacy and tolerability of treatment with beta-blockers, calcium antagonists, and long-acting nitrates for patients who have stable angina.We identified English-language studies published between 1966 and 1997 by searching the MEDLINE and EMBASE databases and reviewing the bibliographies of identified articles to locate additional relevant studies.Randomized or crossover studies comparing antianginal drugs from 2 or 3 different classes (beta-blockers, calcium antagonists, and long-acting nitrates) lasting at least 1 week were reviewed. Studies were selected if they reported at least 1 of the following outcomes: cardiac death, myocardial infarction, study withdrawal due to adverse events, angina frequency, nitroglycerin use, or exercise duration. Ninety (63%) of 143 identified studies met the inclusion criteria.Two independent reviewers extracted data from selected articles, settling any differences by consensus. Outcome data were extracted a third time by 1 of the investigators. We combined results using odds ratios (ORs) for discrete data and mean differences for continuous data. Studies of calcium antagonists were grouped by duration and type of drug (nifedipine vs nonnifedipine).Rates of cardiac death and myocardial infarction were not significantly different for treatment with beta-blockers vs calcium antagonists (OR, 0.97; 95% confidence interval [CI], 0.67-1.38; P = .79). There were 0.31 (95% CI, 0.00-0.62; P = .05) fewer episodes of angina per week with beta-blockers than with calcium antagonists. beta-Blockers were discontinued because of adverse events less often than were calcium antagonists (OR, 0.72; 95% CI, 0.60-0.86; P<.001). The differences between beta-blockers and calcium antagonists were most striking for nifedipine (OR for adverse events with beta-blockers vs nifedipine, 0.60; 95% CI, 0.47-0.77). Too few trials compared nitrates with calcium antagonists or beta-blockers to draw firm conclusions about relative efficacy.beta-Blockers provide similar clinical outcomes and are associated with fewer adverse events than calcium antagonists in randomized trials of patients who have stable angina.
View details for Web of Science ID 000080427300033
View details for PubMedID 10349897
-
Regression analysis of multiple protein structures
JOURNAL OF COMPUTATIONAL BIOLOGY
1998; 5 (3): 585-595
Abstract
A general framework is presented for analyzing multiple protein structures using statistical regression methods. The regression approach can superimpose protein structures rigidly or with shear. Also, this approach can superimpose multiple structures explicitly, without resorting to pairwise superpositions. The algorithm alternates between matching corresponding landmarks among the protein structures and superimposing these landmarks. Matching is performed using a robust dynamic programming technique that uses gap penalties that adapt to the given data. Superposition is performed using either orthogonal transformations, which impose the rigid-body assumption, or affine transformations, which allow shear. The resulting regression model of a protein family measures the amount of structural variability at each landmark. A variation of our algorithm permits a separate weight for each landmark, thereby allowing one to emphasize particular segments of a protein structure or to compensate for variances that differ at various positions in a structure. In addition, a method is introduced for finding an initial correspondence, by measuring the discrete curvature along each protein backbone. Discrete curvature also characterizes the secondary structure of a protein backbone, distinguishing among helical, strand, and loop regions. An example is presented involving a set of seven globin structures. Regression analysis, using both affine and orthogonal transformations, reveals that globins are most strongly conserved structurally in helical regions, particularly in the mid-regions of the E, F, and G helices.
View details for Web of Science ID 000075921100016
View details for PubMedID 9773352
-
The error coding method and PICTs
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS
1998; 7 (3): 377-387
View details for Web of Science ID 000076008800008
-
Classification by pairwise coupling
ANNALS OF STATISTICS
1998; 26 (2): 451-471
View details for Web of Science ID 000079135400001
-
Modeling and superposition of multiple protein structures using affine transformations: analysis of the globins.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
1998: 509-520
Abstract
A novel approach for analyzing multiple protein structures is presented. A family of related protein structures may be characterized by an affine model, obtained by applying transformation matrices that permit both rotation and shear. The affine model and transformation matrices can be computed efficiently using a single eigen-decomposition. A novel method for finding correspondences is also introduced. This method matches curvatures along the protein backbone. The algorithm is applied to analyze a set of seven globin structures. Our method identifies 100 corresponding landmarks across all seven structures. Results show that most helices in globins can be identified by high curvature, with the exception of the C and D helices. Analysis of the superposition reveals that globins are most strongly conserved structurally in the mid-regions of the E and G helices.
View details for PubMedID 9697208
-
The error coding and substitution PaCTs
11th Annual Conference on Neural Information Processing Systems (NIPS)
MIT PRESS. 1998: 542–548
View details for Web of Science ID 000075130700077
-
Classification by pairwise coupling
11th Annual Conference on Neural Information Processing Systems (NIPS)
MIT PRESS. 1998: 507–513
View details for Web of Science ID 000075130700072
-
Discriminant adaptive nearest neighbor classification
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
1996; 18 (6): 607-616
View details for Web of Science ID A1996UR25400004
-
Discriminant adaptive nearest neighbor classification and regression
9th Annual Conference on Neural Information Processing Systems (NIPS)
M I T PRESS. 1996: 409–415
View details for Web of Science ID A1996BG45M00058
-
Generalized additive models for medical research.
Statistical methods in medical research
1995; 4 (3): 187-196
Abstract
This article reviews flexible statistical methods that are useful for characterizing the effect of potential prognostic factors on disease endpoints. Applications to survival models and binary outcome models are illustrated.
View details for PubMedID 8548102
-
PENALIZED DISCRIMINANT-ANALYSIS
ANNALS OF STATISTICS
1995; 23 (1): 73-102
View details for Web of Science ID A1995RE61100006
-
WAVELET SHRINKAGE - ASYMPTOPIA - DISCUSSION
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL
1995; 57 (2): 337-369
View details for Web of Science ID A1995QL97000002
-
FLEXIBLE DISCRIMINANT-ANALYSIS BY OPTIMAL SCORING
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
1994; 89 (428): 1255-1270
View details for Web of Science ID A1994PU33000012
-
THE USE OF POLYNOMIAL SPLINES AND THEIR TENSOR-PRODUCTS IN MULTIVARIATE FUNCTION ESTIMATION
ANNALS OF STATISTICS
1994; 22 (1): 118-184
View details for Web of Science ID A1994NH41200008
-
NEURAL NETWORKS AND RELATED METHODS FOR CLASSIFICATION - DISCUSSION
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY
1994; 56 (3): 437-456
View details for Web of Science ID A1994NK40900002
-
LOCAL REGRESSION - AUTOMATIC KERNEL CARPENTRY - COMMENTS AND REJOINDER
STATISTICAL SCIENCE
1993; 8 (2): 129-143
View details for Web of Science ID A1993ME64300004
-
LOCAL REGRESSION - AUTOMATIC KERNEL CARPENTRY
STATISTICAL SCIENCE
1993; 8 (2): 120-129
View details for Web of Science ID A1993ME64300003
-
FLEXIBLE COVARIATE EFFECTS IN THE PROPORTIONAL HAZARDS MODEL
BREAST CANCER RESEARCH AND TREATMENT
1992; 22 (3): 241-250
Abstract
The proportional hazards model is frequently used in analyzing the results of clinical trials, when it is often the case that the outcomes are right-censored. This model allows one to measure treatment effects and simultaneously identify and adjust for prognostic factors that might influence the outcome. In this paper, we outline a class of semiparametric models that allows one to model prognostic factors nonlinearly, and have the data suggest the form of their effect. The methods are illustrated in an analysis of data from a breast cancer clinical trial.
View details for Web of Science ID A1992JQ41500008
View details for PubMedID 1391990
-
THE II METHOD FOR ESTIMATING MULTIVARIATE FUNCTIONS FROM NOISY DATA - DISCUSSION
TECHNOMETRICS
1991; 33 (2): 155-155
View details for Web of Science ID A1991FJ19800004
-
MULTIVARIATE ADAPTIVE REGRESSION SPLINES - DISCUSSION
ANNALS OF STATISTICS
1991; 19 (1): 93-99
View details for Web of Science ID A1991FF04700005
- Statistical Models in S Wadsworth/Brooks Cole, Pacific Grove, California . 1991
-
EXPLORING THE NATURE OF COVARIATE EFFECTS IN THE PROPORTIONAL HAZARDS MODEL
BIOMETRICS
1990; 46 (4): 1005-1016
Abstract
We discuss an exploratory technique for investigating the nature of covariate effects in Cox's proportional hazards model. This technique features an additive term sigma p1 fj(chi ij), in place of the usual linear term sigma p1 chi ij beta j, where chi i1, chi i2,...,chi ip are covariate values for the ith individual. The fj(.) are unspecified smooth functions that are estimated using scatterplot smoothers. These functions can be used for descriptive purposes or to suggest transformations of the covariates. The estimation technique is a variation of the local scoring algorithm for generalized additive models (Hastie and Tibshirani, 1986, Statistical Science 1, 297-318).
View details for Web of Science ID A1990EV52100010
View details for PubMedID 1964808
-
AN ANALYSIS OF GESTATIONAL-AGE, NEONATAL SIZE AND NEONATAL DEATH USING NONPARAMETRIC LOGISTIC-REGRESSION
JOURNAL OF CLINICAL EPIDEMIOLOGY
1990; 43 (11): 1179-1190
Abstract
The relationship between gestational age, neonatal size and neonatal death is complex. To date, most authors have used birth weight as a proxy for neonatal size and have neglected to examine head circumference and crown heel length. In addition, they have assumed the size and gestational age were linearly related to neonatal death. In this study we use nonparametric multiple logistic regression to examine the relationship between gestational age, neonatal size and neonatal death. On its own, gestational age was nonlinearly associated with neonatal death. This nonlinearity disappeared with the addition of birth weight, crown heel length and head circumference. Birth weight, head circumference and crown heel length all had significant nonlinear associations with neonatal death in univariate analysis. With all factors in the model, birth weight and head circumference were nonlinearly associated with neonatal death and crown heel length was linearly associated with neonatal death. The complex relations between gestational age, neonatal size and neonatal death were explored with greater ease with nonparametric logistic regression.
View details for Web of Science ID A1990EK04600007
View details for PubMedID 2243255
- Generalized Additive Models. Chapman and Hall. 1990
-
REGRESSION WITH AN ORDERED CATEGORICAL RESPONSE
STATISTICS IN MEDICINE
1989; 8 (7): 785-794
Abstract
A survey on Mseleni joint disease in South Africa involved the scoring of pelvic X-rays of women to measure osteoporosis. The scores were ordinal by construction and ranged from 0 to 12. It is standard practice to use ordinary regression techniques with an ordinal response that has that many categories. We give evidence for these data that the constraints on the response result in a misleading regression analysis. McCullagh's proportional-odds model is designed specifically for the regression analysis of ordinal data. We demonstrate the technique on these data, and show how it fills the gap between ordinary regression and logistic regression (for discrete data with two categories). In addition, we demonstrate non-parametric versions of these models that do not make any linearity assumptions about the regression function.
View details for Web of Science ID A1989AF73400002
View details for PubMedID 2772438
-
LINEAR SMOOTHERS AND ADDITIVE-MODELS
ANNALS OF STATISTICS
1989; 17 (2): 453-510
View details for Web of Science ID A1989AB89300001
-
LOCAL LIKELIHOOD ESTIMATION
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
1987; 82 (398): 559-567
View details for Web of Science ID A1987J105700027
-
THE GEOMETRIC INTERPRETATION OF CORRESPONDENCE-ANALYSIS
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
1987; 82 (398): 437-447
View details for Web of Science ID A1987J105700007
-
GENERALIZED ADDITIVE-MODELS - SOME APPLICATIONS
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
1987; 82 (398): 371-386
View details for Web of Science ID A1987J105700001
-
NONPARAMETRIC LOGISTIC AND PROPORTIONAL ODDS REGRESSION
APPLIED STATISTICS-JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C
1987; 36 (3): 260-276
View details for Web of Science ID A1987K880300002
-
PROJECTION PURSUIT - DISCUSSION
ANNALS OF STATISTICS
1985; 13 (2): 502-508
View details for Web of Science ID A1985AKQ9500010
-
CLOSED MITRAL VALVOTOMY - ACTUARIAL ANALYSIS OF RESULTS IN 654 PATIENTS OVER 12 YEARS AND ANALYSIS OF PREOPERATIVE PREDICTORS OF LONG-TERM SURVIVAL
ANNALS OF THORACIC SURGERY
1982; 33 (5): 473-479
Abstract
The records of 654 patients with mitral stenosis who underwent closed mitral valvotomy over a 12-year period were submitted to actuarial analysis. This revealed a low (2.97%) operative mortality. At 12 years, the overall cumulative proportion surviving was 78%; 47% of patients survived without reoperation. The usual clinical indicators of suitability for closed valvotomy were successful in predicting improved survival. The surgeon's assessment of the suitability of the valve correlated well with outcome. Valvotomy during pregnancy was associated with a good long-term outlook. The presence of pulmonary hypertension and atrial fibrillation did not alter survival significantly. Sex ane age were not associated with adverse prognosis. We conclude that closed mitral valvotomy still has a place in the management of mobile mitral stenosis, particularly in areas where there is a high incidence of rheumatic heart disease and a large number of young patients have mobile mitral stenosis.
View details for Web of Science ID A1982NP11000007
View details for PubMedID 7082084
-
SURVEY OF ANTIBIOTIC-RESISTANCE IN GRAM-NEGATIVE BACTERIA USING THE CROSS PRODUCT RATIO
ZENTRALBLATT FUR BAKTERIOLOGIE MIKROBIOLOGIE UND HYGIENE SERIES A-MEDICAL MICROBIOLOGY INFECTIOUS DISEASES VIROLOGY PARASITOLOGY
1979; 243 (4): 483-489
Abstract
Antibiotic resistance pattern in clinical isolates of selected gram-negative bacteria at Groote Schuur Hospital during two three-month periods with a ten year interval were investigated. The antibiotic resistance is represented by means of the cross product, or odds ratio, using the log-linear model. This was found to be a simple method of monitoring the change or increase of antibiotic resistance, and enabled an overall analysis, catering for antibiotic and organism effects, to be performed
View details for Web of Science ID A1979HC38800037
View details for PubMedID 384718
-
Risk of asbestosis in crocidolite and amosite mines in South Africa.
Annals of the New York Academy of Sciences
1979; 330: 35-52
Abstract
X-rays of all while and mixed-race men employed in crocidolite and amosite mines and mills were read independently by three experienced readers according to the ILO U/C classification. Abnormality was regarded as present if reported by two or more readers. Parenchymal abnormality, defined as the presence of small irregular opacities of profusion 1/0 or greater, was found in 7.3% of the workers. Pleural thickening was found in 4.5% of the workers, costophrenic angle obliteration in 3.2%, and pleural calcification in 1.7%. The prevalences of both pleural and parenchymal abnormality were strongly related to the duration of exposure to asbestos at work. The overall prevalence of abnormality increase from 4.0% in men with exposure for 1 year or less to 47.9% in men with more than 15 years of exposure. After taking into account the effects of age and duration of asbestos exposure, the prevalence of pleural abnormality was not predicted by fiber concentration. However, white men working with amosite tended to develop a higher prevalence of pleural abnormality than did those working with crocidolite. Compared to whites, men of mixed race, who only work with crocidolite, had a high prevalence of pleural abnormality in each exposure duration category. In contrast to pleural abnormality, the prevalence of parenchymal abnormality, after taking into account the effects of age and duration of exposure, was significantly predicted by fiber concentration but not by race or asbestos type. Our results suggest that parenchymal abnormality in workers in South African asbestos mines could be largely prevented by reducing exposure to fibers visible under the light microscope. However, this may not be the case for pleural abnormality.
View details for PubMedID 294187